I have the following grammar:
LIST = LBRACE LISTBODY RBRACE
LISTBODY = ATOM | ATOM COMMA LISTBODY
ATOM = NUMBER | WORD
NUMBER = INTEGER
INTEGER = #'[-|+]{0,1}[ ]*(\\d)+'
WORD = #'[a-z]([a-zA-z0-9_])*'
LBRACE = '['
RBRACE = ']'
COMMA = ','
Trying to parse [a,b] fails. If I replace RBRACE=']' with RBRACE='}' I can then parse [a,b}
Is there something special about the ] that I am missing?
What you have as LBRACE and RBRACE are fine.
I wonder that what is happening is not what you may think is happening i.e. multiple rules are working against each other. You can often tell by the error that is in the form [:index n], where n is how far the parsing got.
INTEGER is a regex that has [ and ] in it, with a space between. Could you take them and the space and the * away and see how you go? I know they should be escaped in regex, but still I would try experiments along those lines.
Related
I am trying to write EBNF production rule for parsing math. expression with +, -, *, / operations (op) and matching parantheses. I would like to point out that writing a = op b is not possible. Example: a = - 1 is not allowed. What is allowed is if + or - are touching the literal (no whitespace between). Then they are treated as a lit_num by the lexer. Identifiers are not allowed to have special characters and the lexer throws an error if they do.
This is what I have done so far:
ex =
(literal | identifier)[('+'|'-'|'*'|'/') ex]
| '(' ex ')' [('+'|'-'|'*'|'/') ex]
I think it could also be written like this:
ex =
( '(' ex ')' | (literal | identifier) )[('+'|'-'|'*'|'/') ex]
It should work, I tried parsing some long expressions by hand and it also works in the algorithm. I would just like to hear your opinion on it.
EDIT: I just realised that with allowing -1 with no whitespace I could write something like a = 1 + -1. I will just put the rule into lexer to not allow this. So I will not be able to prefix a number but this is not a big deal. Assignin a negative number can be written like 0 - x.
I have a string like RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG","DOG CAT-DOG")". For quoted string literals I use:
StringLiteralSQ : UnterminatedStringLiteralSQ '\'' ;
UnterminatedStringLiteralSQ : '\'' (~['\r\n] | '\\' (. | EOF))* ;
StringLiteralDQ : UnterminatedStringLiteralDQ '"' ;
UnterminatedStringLiteralDQ : '"' (~[\r\n] | '\\' (. | EOF))* ;
This parses the above mentioned String. I need to identify them words as comma separated tokens like this DOG CAT-DOG. for this I use something like
options : name EQUALS value
| OPTIONS L_PAREN (name EQUALS value) (COMMA (name EQUALS value)* R_PAREN
;
However, when I make the string of this format RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG"DOG CAT-DOG")", it fails with an out-of-memory error.
I wanted to parse the strings that I have been parsing before and also parse this kind of string ("DOG CAT-DOG"DOG CAT-DOG") and consider it a single token maybe. How can I do that?
Your question is a bit confusing, so I'm not sure I understand what you are after. You ask for handling escaped characters, but then you don't show any input which uses escapes.
However, I think you are making things way too complicated. Look in other grammars to see how they define string tokens, including escape handling. Here's a typical example:
fragment SINGLE_QUOTE: '\'';
fragment DOUBLE_QUOTE: '"';
DOUBLE_QUOTED_TEXT: (
DOUBLE_QUOTE ('\\'? .)*? DOUBLE_QUOTE
)+
;
SINGLE_QUOTED_TEXT: (
SINGLE_QUOTE ('\\'? .)*? SINGLE_QUOTE
)+
;
With the following grammar:
program: /*empty*/ | stmt program;
stmt: var_decl | assignment;
var_decl: type ID '=' expr ';';
assignment: expr '=' expr ';';
type: ID | ID '[' NUMBER ']';
expr: ID | NUMBER | subscript_expr;
subscript_expr: expr '[' expr ']';
I'd expect the following to be valid:
array[5] = 0;
That's just an assignment with a subscript_expr on the left-hand-side. However the generated parser gives an error for that statement:
syntax error, unexpected '=', expecting ID
Generating the parser also warns that there's 1 shift/reduce conflict. Removing subscript_expr makes it go away.
Why does this happen and how can I get it to parse array[5] = 0; as an assignment with a subscript_expr?
I'm using Bison 2.3.
The following two statements are both valid in your language:
x [ 3 ] = 42;
x [ 3 ] y = 42;
The first is an assignment of an element of the array variable x, while the second is a declaration and initialization of the array variable y whose elements are of type x.
But from the parser's viewpoint, x and y are both just IDs; it has no way of knowing that x is a variable in the first case and a type in the second case. All it can do is notice that the two statements match the productions assignment and var_decl, respectively.
Unfortunately, it cannot do that until it sees the token after the ]. If that token is an ID, then the statement must be a var_decl; otherwise, it's an assignment (assuming the statement is valid, of course).
But in order to parse the statement as an assignment, the parser must be able to produce
expr '=' expr
which in this case is the result of expr: subsciprt_expr, which in turn is subscript_expr: expr[expr]`.
So the set of reductions for the first statement will be as follows: (Note: I didn't write the shifts; rather, I mark the progress of the parse by putting a • at the end of each reduction. To get to the next step, just shift the • until you reach the end of the handle.)
ID • [ NUMBER ] = NUMBER ; expr: ID
expr [ NUMBER • ] = NUMBER ; expr: NUMBER
expr [ expr ] • = NUMBER ; subscript_expr: expr '[' expr ']'
subscript_expr • = NUMBER ; expr: subscript_expr
expr = NUMBER • ; expr: NUMBER
expr = expr ; • assignment: expr '=' expr ';'
assignment
The second statement must be parsed as follows:
ID [ NUMBER ] • ID = NUMBER ; type: ID '[' NUMBER ']'
type ID = NUMBER • ; expr: NUMBER
type ID = expr ; • var_decl: type ID '=' expr ';'
var_decl
That's a shift/reduce conflict, because the crucial decision must be made immediately after the first ID. In the first statement, we need to reduce the identifier to an expr. In the second statement, we must continue shifting until we are ready to reduce a type.
Of course, this problem wouldn't exist if we could lexically distinguish type IDs from variable name IDs, but that may not be possible (or, if possible, it may not be desirable because it requires feedback from the parser to the lexer).
As written, the shift/reduce prediction can be made with fixed lookahead, since the fourth token after the ID will determine the possibilities. That makes the grammar LALR(4), but that doesn't help much since bison only implements LALR(1) parsers. In any case, it is likely that a less simplified grammar will not be fixed-lookahead, for example if constant expressions are allowed for array sizes, or if arrays can have multiple dimensions.
Even so, the grammar is not ambiguous, so it is possible to use a GLR parser to parse it. Bison does implement GLR parsers; it is only necessary to insert
%glr-parser
into the prologue. (The shift/reduce warning will still be produced, but the parser will correctly identify both kinds of statement.)
It's worth noting that C doesn't have this particular parsing problem precisely because it puts the array size after the name of the variable being declared. I don't believe this was done to avoid parsing problems (although who knows?) but rather because it was believed that it is more natural to write declarations the way variables are used. Hence, we write int a[3] and char *p, because in the program we will dereference using a[i] and *p.
It is possible to write an LALR(1) grammar for this syntax, but it's a bit annoying. The key is to delay the reduction of the syntax ID [ NUMBER ] until we know for sure which production it will be the start of. That means we need to include the production expr: ID '[' NUMBER ']'. That will result in a larger number of shift/reduce warnings (since it makes the grammar ambiguous), but since bison always prefers to shift, it should produce a correct parser.
Adding %glr-parser solves this.
I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.
I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)
Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.