Fixing ambiguities on a Lemon parser grammar

Fixing ambiguities on a Lemon parser grammar - parsing

I'm having what appears to be an ambiguous grammar. It seems like there are some problems under FileText since there is no conflict when I run only the top part (above FileText). Can anyone help me to spot where my issue is? I believe my tree looks fine. Here's an input sample:
lemon AND (#Chapter1.Title : "BNF grammar" AND #Chapter10.Title : ("BNF notion" OR "EBNF notion"))
error:
QUOT shift 17
QUOT reduce 14 ** Parsing conflict **
STR shift-reduce 20 subval ::= STR
STR reduce 14 ** Parsing conflict **
LPAR shift 7
LPAR reduce 14 ** Parsing conflict **
WS shift-reduce 10 space ::= WS
WS reduce 14 ** Parsing conflict **
op shift 9
space shift 12
text shift-reduce 15 filetext::= filetext text
subvalue shift-reduce 15 filetext::= filetext text /*because subval==text
{default} reduce 14 location ::= location COLON filetext
grammar:
%left::=AND.
%left::=OR.
book::= expr.
expr::= expr term.
expr::= expr op term.
expr::= term.
term::= value.
term::= QUOT STR QUOT.
value::= atom.
value::= LPAR expr RPAR.
atom::= STR.
atom::= file.
op::= space AND space.
op::= space OR space.
space::= WS.
space::= space WS.
file::= location COLON filetext.
location::= SHARP STR PERIOD STR.
filetext::= filetext text.
filetext::= filetext op text.
filetext::= text.
text::= subvalue.
text::= QUOT STR QUOT.
subvalue::= subatom.
subvalue::= LPAR filetext RPAR.
subatom::= STR.
For what is worth, the tree came up with and derived my grammar from:

The problem is that you allow implicit concatenation (i.e., without an operator) in both expr and filetext.
Consider the concatenation expr term.
Note that expr can derive file (through term -> value -> atom) and file ends with filetext, which in turn can be filetext text.
Both text and term derive STR and QUOT STR QUOT. Suppose your input were something like
SHARP STR PERIOD STR COLON STR STR
------location------ --filetext--?
----------------------file-------------?
Now, how is that last STR to be handled?
The parser could reduce location COLON filetext to file and then expr. Then the last STR could be reduced to term (through atom and value), so that the whole input reduces to expr using expr::=expr term.
But it could also reduce STR to text (via subatom and subvalue), which would let it extend filetext to include the last STR. Then when it reduces location COLON filetext to file and expr, it's done.
That's definitely an ambiguity (and it's just one of many). There's no obvious way to resolve the conflict -- at least, nothing is obvious to me -- so you'll have to figure out which of the alternatives you actually want.
By the way, I don't think your whitespace handling corresponds to your expectations. For example, nothing in your grammar permits the whitespace which surrounds your :. You're probably better off just ignoring whitespace in your lexical analyser and dropping it from the grammar, where it just complicates things. If you don't want to allow "word"suffix to be analysed as two lexemes ("word" suffix), then you can put a check in your lexical analyser which requires a close quote to not be followed by a STR (and similar, an open quote to not be preceded by a STR). You would actually probably be better off recognising the quoted strings in your lexical analyser as well; since the difference is lexical rather than syntactic. As written, QUOT STR QUOT won't match "BNF grammar", for example (because BNF grammar doesn't match STR).

Related

Use character as operator between numbers, otherwise treat it as a token ANTLR4

I'm making a language in ANTLR, where a sequence of digits is a number. A sequence of digits, letters and an underscore is, however, an identifier. So, for example:
These are numbers: 234, 0243, 0, 11
These are identifiers: foo, 2foo, foo2, 2y8
However, I also have operators, like multiplication, addition, division... They all work fine, except for one operator, which is the scientific operator e (or E). Unlike in most other languages, where the e for a scientific number is considered part of the number itself (like 2e3), in my language the e is considered to be an operator itself. So for example, (2+5)e4 is valid.
However, this brings an issue: since e is a letter, unless I separe my scientific number into spaces, ANTLR recognizes 2e3 as an identifier instead of the operation 2-e-3.
I'd like the language to always treat e as an operator, and not as part of an identifier, if what's on both sides of the e is something other than a letter or an underscore, or if it's alone. So, for example:
The following are treated as operations: 2e3, 2 e 3, 2E3, 2e 3, 2 e3, 23.4e78.9, 5e($my_var), .0e4, -7e3, 7e+9, 0e-4, (2)e(3), 2e(hello)
The following are treated as identifiers: hello, 2ey9, w3e6, 6ee7, 7ep, e9, e.
I have the following minimal reproduction example:
grammar ScientificExample;
file: expr* EOF;
expr
: UNARY expr
| L_PAREN expr R_PAREN
| expr SCIENTIFIC expr
| atom;
atom: NUMBER | IDENTIFIER;
WS : [ \r\t\n]+ -> skip ;
SCIENTIFIC: [eE];
NUMBER: ( [0-9]* '.' )? [0-9]+;
L_PAREN: '(';
R_PAREN: ')';
IDENTIFIER: [a-zA-Z0-9_]+;
UNARY: [+-];
As I understand it, ANTLR gives priority to whichever parsing rule produces the largest possible result, which might be why it's not giving priority to scientific being first. Any ideas how I might engineer this so priority is given to valid scientific expressions, as long as both members are an expression that isn't a simple IDENTIFIER atom, or the lack thereof?
I've thought of over-engineering the IDENTIFIER lexing rule, however since ANTLR doesn't really have lookahead and lookbehind expressions, I'm not completely sure how I'd achieve that.

Parser grammar rule is being ignored

The Goal
The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.
Example Input
This would be entered by a user.
# This is an example comment
10 + 10
// Another comment
This is one line of text
Tested
Expected Parse Tree
The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).
Note: The 10 + 10 is being recognised as an arithmetic rule.
Current Parse Tree aka The Problem
Below is the current output from the lexer and parser.
Note: The 10 + 10 is being recognised as an text and not the arithmetic rule.
Grammar Definition
The logic of the grammar at a high levels is as follows:
Parse line by line
Determine the line content if not fall back to text
grammar ContentParser;
/*
* Tokens
*/
NUMBER: '-'? [0-9]+;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r]+;
EOL: '\r'? '\n';
/*
* Rules
*/
start: file;
file: line+ EOF;
line: content EOL;
content
: comment
| arithmetic
| text
;
// Custom Content Types
comment: LINE_COMMENT;
/// Example taken from ANTLR Docs
arithmetic:
NUMBER # Number
| LPARAN inner = arithmetic RPARAN # Parentheses
| left = arithmetic operator = POW right = arithmetic # Power
| left = arithmetic operator = (MUL | DIV) right = arithmetic # MultiplicationOrDivision
| left = arithmetic operator = (ADD | SUB) right = arithmetic # AdditionOrSubtraction;
text: TEXT;
My Understanding
The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.
However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.
(I'm still learning ANTLR)

When you are writing a parser, it's always a good idea to print out the tokens for the input.
In the current grammar, 10 + 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.
There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.
Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.
A grammar that does all this is:
grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line+ EOF;
line: content? EOL;
start: file;
text: TEXT+;
ADD: '+';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9]+;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]+ -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.

parsing maxscript - problem with newlines

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.

My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

When to reduce in a shift-reduce parser?

This is the skeleton of my bottom-up parser:
while (!stack.empty())
{
if (!reduce())
{
shift();
}
}
And I have these rules:
Program -> Expr
Expr -> Expr '+' Expr
Expr -> Number
Number -> FLOAT | INTEGER // These 2 are terminal symbols
If I have the following input:
2 + 3
2 gets pushed onto the stack, then gets reduced to a Number, then an Expression and then a Program. So it doesn't have any chance to parse the whole addition. How can I force the parser to parse the rest too? Should I do something like:
Program -> Expr EOF
?
Bottom-up parsing is pretty new for me so any help is appreciated.

You can use a look-ahead to decide whether to shift or reduce. Your example grammar fits in the LR(1) family of grammars, so a bottomup parser with a 1 symbol look-ahead should be able to capture it.
In your example you have input:
2 + 3
So you build up a stack:
Program, Expr, Number
Shift FLOAT, reduce Number, reduce Expr. Now you have a choice, whether to reduce Program or shift '+', so you look ahead is there is a '+'. If so you shift and follow the Expr = Expr '+' Expr rule.
You may still want to do Program = Expr EOF so your lookahead can always return EOF if there's nothing left to parse.

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!

I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart