Lemon Parser - Parsing conflict between rules for a.b.c & a.b[0].c - parsing

typename ::= typename DOT ID.
typename ::= ID.
lvalue ::= lvalue DOT lvalue2.
lvalue ::= lvalue2.
lvalue2 ::= ID LSQB expr RSQB. // LSQB & RSQB: left & right square bracket. i.e. [ ].
lvalue2 ::= ID.
typename is a rule for the names of types. It matches the following code:
ClassA
package_a.ClassA
while lvalue is a rule for left values. It matches the following code:
varA
varB.C
varD.E[i].F
Now the 2 rules conflicts with each other. Maybe it is because lvalue can also match package_a.ClassA?
How can I solve this?

You can't resolve this issue grammatically because your syntax is ambiguous. a.b = 3 is valid if a.b is a member of a, and invalid if a.b is a type, but the semantics of a.b cannot be determined by the syntax.
You could resolve this in a fairly messy way if you have some way to figure that out in the lexer (which will probably involve some kind of lexical feedback, since the lexer would presumably need access to the symbol table in order to provide that information). Then the lexer could use two different token types for IDs, based on whether or not they are type names.
But probably the best option is to either abandon the idea of distinguishing grammatically between lvalues and rvalues, and or to assume that all selection operations (a.b) produce lvalues, and then to validate the use of an expression as an lvalue in the semantic action or some subsequent semantic analysis.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

How to differentiate identifier from function call in LL(1) parser

My graduate student and I are working on a training compiler, which we will use to teach students at the subject "Compilers and Interpreters".
The input program language is a limited subset of the Java language and the compiler implementation language is Java.
The grammar of the input language syntax is LL(1), because it is easier to be understood and implemented by students. We have the following general problem in the parser implementation. How to differentiate identifier from function call during the parsing?
For example we may have:
b = sum(10,5) //sum is a function call
or
b = a //a is an identifier
In both cases after the = symbol we have an identifier.
Is it possible to differentiate what kind of construct (a function call or an identifier) we have after the equality symbol =?
May be it is not possible in LL(1) parser, as we can look only 1 symbol ahead? If this is true, how do you recommend to define the function call in the grammar? Maybe some additional symbol in front of the function call is necessary, e.g. b = #sum(10,5)?
Do You think this symbol would be confusing for students? What kind of symbol for the function call would be proper?
You indeed can't have separate rules for function calls and variables in an LL(1) grammar because that would require additional lookahead. The common solution to this is to combine them into one rule that matches an identifier, optionally followed by an argument list:
primary_expression ::= ID ( "(" expression_list ")" )?
| ...
In a language where a function can be an arbitrary expression, not just an identifier, you'll want to treat it just like any other postfix operator:
postfix_expression ::= primary_expression postfix_operator*
postfix_operator ::= "++"
| "--"
| "[" expression "]"
| "(" expression_list ")"

ANTLR4 Context-sensitive rule: unexpected parsing/resynchronization when failing semantic predicate

I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Resources