LL(1) parsing conflict - parsing

I'm writing a LL(1) parser for a very simple grammar. Yet I've found conflicts when trying to build the parsing table.
I'm surprised since the grammar seems simple. I don't know if I have a problem with the parser or with my understanding of LL(1) parsing. Maybe the grammar is not LL(1) in the end.
The grammar is:
1: S -> begin list
2: list -> id listPrime
3: listPrime -> id listPrime
4: | ε
My code runs into two conflicts, both for deriving listPrime, one with the terminal symbol id and one with the EOF. In both cases rule 3 clashes against rule 4.
My computed FIRST and FOLLOW sets are:
first:
{ S: Set { 'begin' },
list: Set { 'id' },
listPrime: Set { 'id', 'eps' } },
follow:
{ S: Set { 'EOF' },
list: Set { 'EOF', 'id' },
listPrime: Set { 'EOF', 'id' } } }

The grammar is LL(1). Your FOLLOW sets are computed incorrectly, which can easily be verified: there is no derivation in which list or listPrime is followed by a token other than EOF.

Related

How to break up a grammar file per statement

I have the following single-file grammar which will parse a (fake) SELECT statement or ALTER statement:
grammar Calc;
statements
: statement (NEWLINE statement)*
;
statement
: select_statement
| alter_statement
;
select_statement
: SELECT IDENTIFIER
;
alter_statement
: ALTER IDENTIFIER
;
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip; // We put it at the end and ONLY capture single white-spaces,
// so it's always overriden if anything else is provided above it, such as the NEWLINE
And the test input:
SELECT one
ALTER two
I would like to separate this grammar into separate parser files for each statement. This can be done in two steps. In the first part I'll separate the parser and lexer:
parser grammar SQLParser;
options { tokenVocab = SQLLexer; }
program: statements EOF;
statements: statement (NEWLINE statement)*;
statement
: select_statement
| alter_statement
;
select_statement: SELECT IDENTIFIER;
alter_statement: ALTER IDENTIFIER;
lexer grammar SQLLexer;
options { caseInsensitive=true; }
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip;it,
How would I then separate the two statements into their own separate files, such as SQLSelectParser.g4 and SQLAlterParser.g4 ? Are there any downsides of breaking up multiple complex statements into their own file? (of course, in the example it's trivial but it's just to ask this question).
Update, it seems the following approach works fine, though it'd be good to get someone experienced to comment on the approach and if it's even a good idea to do it at all:
# SQLParser.g4
parser grammar SQLParser;
import SQLAlterParser, SQLSelectParser;
options { tokenVocab = SQLLexer; }
program
: statements EOF
;
statements
: statement (NEWLINE statement)*
;
statement
: select_statement
| alter_statement
;
# SQLSelectParser.g4
parser grammar SQLSelectParser;
options { tokenVocab = SQLLexer; }
select_statement
: SELECT IDENTIFIER
;
# SQLAlterParser.g4
parser grammar SQLAlterParser;
options { tokenVocab = SQLLexer; }
alter_statement
: ALTER IDENTIFIER
;
# SQLLexer.g4
lexer grammar SQLLexer;
options { caseInsensitive=true; }
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip;
ANTLR import is not a simple include. Each grammar is treated as its own grammar "object".
When importing you get more of a superclass behavior as explained in the
Grammar Imports documentation.
You may find that it gets tricky to keep everything straight (especially as you find common sub-rules).
As the documentation shows, it can definitely be useful.
You didn't elaborate on your motivation for breaking things up. If it's just to break things up into smaller source files for organizational purposes, you may be incurring more complexity from the way imports work than you deal with by having the grammar in a single file (or just breaking up Lexer and Parser, which is very common).

Why does this grammar fail to parse this input?

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:
function aFunctionHere(int a, function callback) ....
However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.
In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.
The error I get with the grammar as-is:
line 1:0 mismatched input 'function' expecting <INVALID>
What does "expecting <INVALID>" mean?
The (stripped down) grammar:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: BaseParamType IDENTIFIER ;
// Lexer stuff
BaseParamType:
INT_TYPE
| FUNCTION_TYPE // <---- LINE 1
;
FUNCTION : 'function'; // <---- LINE 2
INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;
The input I'm using:
function abc(int c, int d, int a)
The program to test the generated parser:
from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees
def main(argv):
input = FileStream(argv[1] if len(argv)>1 else "test.in")
lexer = Lexer(input)
tokens = CommonTokenStream(lexer)
parser = Parser(tokens)
tree = parser.begin()
print Trees.toStringTree(tree, None, parser)
if __name__ == '__main__':
import sys
main(sys.argv)
Just use one name for the token function.
A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.
The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.
In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.
I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)
Here's the grammar file which I tested:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;
// Lexer stuff
baseParamType:
INT_TYPE
| FUNCTION //
;
FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

How to resolve ambiguity in the definition of an LR(1) grammar?

I am writing a Golang compiler in OCaml, and argument lists are causing me a bit of a headache. In Go, you can group consecutive parameter names of the same type in the following way:
func f(a, b, c int) === func f(a int, b int, c int)
You can also have a list of types, without parameter names:
func g(int, string, int)
The two styles cannot be mix-and-matched; either all parameters are named or none are.
My issue is that when the parser sees a comma, it doesn't know what to do. In the first example, is a the name of a type or the name of a variable with more variables coming up? The comma has a dual role and I am not sure how to fix this.
I am using the Menhir parser generator tool for OCaml.
Edit: at the moment, my Menhir grammar follows exactly the rules as specified at http://golang.org/ref/spec#Function_types
As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).
If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.
Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.
What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.
It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.
You don't have ambiguity. The fact that the standard Go parser is LALR(1) proves that.
is a the name of a type or the name of a variable with more variables coming up?
So basically your grammar and the parser as a whole should be completely disconnected from the symbol table; don't be C – your grammar is not ambiguous therefore you can check the type name later in the AST.
These are the relevant rules (from http://golang.org/ref/spec); they are already correct.
Parameters = "(" [ ParameterList [ "," ] ] ")" .
ParameterList = ParameterDecl { "," ParameterDecl } .
ParameterDecl = [ IdentifierList ] [ "..." ] Type .
IdentifierList = identifier { "," identifier } .
I'll explain them to you:
IdentifierList = identifier { "," identifier } .
The curly braces represent the kleene-closure (In POSIX regular expression notation it's the asterisk). This rule says "an identifier name, optionally followed by a literal comma and an identifier, optionally followed by a literal comma and an identifier, etc… ad infinitum"
ParameterDecl = [ IdentifierList ] [ "..." ] Type .
The square brackets are nullability; this means that that part may or may not be present. (In POSIX regular expression notation it's the question mark). So you have "Maybe an IdentifierList, followed by maybe an ellipsis, followed by a type.
ParameterList = ParameterDecl { "," ParameterDecl } .
You can have several ParameterDecl in a list like e.g. func x(a, b int, c, d string).
Parameters = "(" [ ParameterList [ "," ] ] ")" .
This rules defines that a ParameterList is optional and to be surrounded by parenthesis and may include an optional final comma literal, useful when you write something like:
func x(
a, b int,
c, d string, // <- note the final comma
)
The Go grammar is portable and can be parsed by any bottom-up parser with one token of lookahead.
Edit regarding "don't be C": I said this because C is context-sensitive and the way they solve this problem in many (all?) compilers is by wiring the symbol table to the lexer and lexing tokens differently depending on if they are defined as type names or variables. This is a hack and should not be done for unambiguous grammars!

Overlapping Tokens in ANTLR 4

I have the following ANTLR 4 combined grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
This parses:
field x { A }
field x { B }
This does not:
field a { A }
field b { B }
In the case where parsing fails, I think the lexer is getting confused and emitting a NOTE token where I want it to emit an IDENTIFIER token.
Edit:
In the tokens coming out of the lexer, the 'NOTE' token is showing up where the parser is expecting 'IDENTIFIER'. 'NOTE' has higher precedence because it's shown first in the grammar. So, I can think of two ways to fix this... first, I could alter the grammar to disambiguate 'NOTE' and 'IDENTIFIER' (like adding a '$' in front of 'NOTE'). Or, I could just use 'IDENTIFIER' where I would use note and then deal with detecting issues when I walk the parse tree. Neither of those feel optimal. Surely there must be a way to fix this?
I actually ended up solving it like this:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER | NOTE ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
My parse tree still ends up looking how I'd like.
The actual grammar I'm developing is more complicated, as is the workaround based on this approach. But in general, the approach seems to work well.
Quick and dirty fix for your problem can be:
Change IDENTIFIERto match only the complement of NOTE. Then you put them together in identifier.
Resulting grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: (NOTE|IDENTIFIER_C)+ ;
NOTE: [A-Ga-g] ;
IDENTIFIER_C: [H-Zh-z0-9] ;
WS: [ \t\r\n]+ -> skip ;
Disadvantage of this solution is, that you do not get the Identifier as tokens and you tokenize every single Character.

Antlr conditional rewrites

I have the following Antlr grammar rule:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
However if I have '3 | 1 | 2 | 6' this results in a flat tree, with 3, 1, 2, 6 all children of the BINOR node. What I really want is to be able to pattern match on either
expression2
or
^(BINOR expression2 expression2)
How can I change the rewrite so that these are the 2 patterns?
EDIT:
If I use custom rewrites, I'm thinking along the lines of
expression1
: e=expression2 (BINOR e2=expression2)*
-> {$BINOR != null}? ^(BINOR $e $e2*)
-> $e
But when I do this with '1|2|3' the resulting tree only has one BINOR node with two children which are 1 and 3, so 2 is missing.
Many thanks
You were close, this would work:
expression1
#init{boolean or = false;}
: e=expression2 (BINOR {or=true;} expression2)* -> {or}? ^(BINOR expression2+)
-> $e
;
But this is preferred since it doesn't use any custom code:
grammar T;
options {
output=AST;
}
expression1
: (e=expression2 -> $e) ((BINOR expression2)+ -> ^(BINOR expression2+))?
;
expression2
: NUMBER
;
NUMBER
: '0'..'9'+
;
BINOR
: '|'
;
The parser generated from the grammar above will parse the input "3|1|2|6" into the AST:
and the input "3" into the AST:
But your original try:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
does not produce a flat tree (assuming you have output=AST; in your options). It generates the following AST for "3|1|2|6":
If you "see" a flat tree, I guess you're using the interpreter in ANTLRWorks, which does not show the AST but the parse tree of your parse. The interpreter is also rather buggy (does not handle predicates and does not evaluate custom code), so best not use it. Use ANTLRWorks debugger instead, which works like a charm (the images from my answer are from the debugger)!

Resources