Distinguish rules with common prefix - parsing

I am facing a problem with rules that with long look ahead
Take as an example a parser that consumes integer or fractions, also the fractions have no GAPs in the numerator (however it may have GAPs between the numerator and the SLASH)
The regular expression [0-9_ ]*?([0-9]+\/[0-9_ ]+)|[0-9_ ]+ describes the valid inputs
you can check some examples here.
Here is one way to write it
Value: Integer | Fraction;
Fraction: IntegerTokenStar DigitPlus GapStar SLASH IntegerToken
DigitPlus: DIGIT DigitPlus | DIGIT
GapStar: GAP GapStar | %empty
Integer: IntegerTokenPlus
IntegerToken: DIGIT | GAP
IntegerTokenStar: IntegerToken IntegerTokenStar | %empty
IntegerTokenPlus: IntegerToken IntegerTokenPlus | IntegerToken
But it will fail to parse even an example like 0 0/0, the IntegerTokenStar will consume as much as it can, then trying to parse the numerator there is no digit available, trying to continue with integer is not possible because it has a '/'.
How to write this in a conceptually clear way and that we can produce a valid parser.
Examples
A few strings and the expected (i)nteger part, (n)umerator, (d)enominator.
1_1_ 1___/1_1 -> fraction {i:"1_1_ ",n:"1___", d:"1_1"}
1_1_ 1___1_1 -> integer {i:"1_1_ 1___1_1",n:"", d:""}
1_1_1___/1_1 -> fraction {i:"",n:"1_1_1___",d:"1_1"}
frac.y
%define parse.error verbose
%locations
%{
void yyerror(const char* s);
extern int yylex();
extern int yylineno;
extern int yycolumn;
#include <stdio.h>
#include <stdlib.h>
%}
%token DIGIT SLASH GAP NEWLINE
%start File
%%
File: Value | Value NEWLINE File
Value: Integer | Fraction;
Fraction: IntegerTokenStar DigitPlus SLASH IntegerToken
DigitPlus: DIGIT DigitPlus | DIGIT
Integer: IntegerTokenPlus
IntegerToken: DIGIT | GAP
IntegerTokenStar: IntegerToken IntegerTokenStar | %empty
IntegerTokenPlus: IntegerToken IntegerTokenPlus | IntegerToken
%%
int main(){
yyparse();
return 0;
}
void yyerror(const char* s) {
fprintf(stderr, "Line %d: %s\n", yylineno, s);
exit(1);
}
frac.l
%option noyywrap yylineno
%{
#include <stdio.h>
#include "frac.tab.h"
#define YY_DECL int yylex()
int yycolumn = 1;
#define YY_USER_ACTION yylloc.first_line = yylloc.last_line = yylineno; \
yylloc.first_column = yycolumn; yylloc.last_column = yycolumn + yyleng - 1; \
yycolumn = yytext[0] == '\n' ? 1: yycolumn + yyleng;
%}
%%
[\n] {return NEWLINE;}
[_ ] {return GAP;}
[0-9] {return DIGIT;}
"/" {return SLASH;}
%%
Makefile
frac: frac.yy.c frac.tab.c
gcc frac.tab.c frac.yy.c -o frac
frac.yy.c: frac.l
flex -o frac.yy.c frac.l
frac.tab.c frac.tab.h: frac.y
bison -d frac.y

The basic problem is that you've split digit sequences with gaps and digit sequences without gaps into two independent rules, which means you need to decide which you're going to match first, which requires (possibly unbounded) lookahead to decide which to match.
The solution is generally to match tokens "bottom up" -- single rule for each thing independent of context, and build up lookahead-dependent things from that. In you case, that means building up IntegerToken from DigitStar rather than from DIGIT directly -- an input of digits will be recognized as a DigitStar and only when you get to the end of it (and see the non-digit) do you need to decide what it is.
The problem is that the obvious fix for your grammar (changing IntegerToken: DIGIT | GAP to DigitStar | GAP) doesn't work because it makes IntegerTokenStar (and -Plus) ambigous as any sequence of 2 or more digits might be any number of DigitStar tokens. So you need to rewrite this to make sure you can't have two consecutive DigitStar tokens, which turns out to be quite tricky. Your really need to rethink things "bottom up" -- the input is a sequence of alternating numbers (1+ digits each) and gaps (1+ spaces), with an optional single / that can appear directly between two numbers (no gaps) or a number and a gap (no gap before the /). So you get rules that look more like:
File: Value | Value NEWLINE File
Value: OptGap Integer OptGap | Fraction ;
Fraction: OptGap Integer GapPlus DigitPlus SLASH OptGap Integer OptGap
| OptGap DigitPlus SLASH OptGap Integer OptGap
DigitPlus: DIGIT DigitPlus | DIGIT
GapPlus: GAP GapPlus | GAP
OptGap: %empty | GapPlus
Integer: Integer GapPlus DigitPlus | DigitPlus
This does the trick, but is needlessly complex as it recognizes 'number' and 'gap'1 tokens in the grammar rather than the lexer. There's also the odd corner case of disallowing a gap between the numerator and the / in a fraction -- without that we could just ignore spaces (gaps) in the lexer and make things much simpler:
File: Value | Value NEWLINE File
Value: Integer | Fraction ;
Fraction: Integer NUMBER SLASH Integer | NUMBER SLASH Integer
Integer: Integer NUMBER | NUMBER
1Side note -- your lexer seems to disagree with your examples as it treats _ as a GAP token rather than a DIGIT token.
But then your examples don't match your regex -- having a _ immediately before a / will not match

Related

Parser grammar rule is being ignored

The Goal
The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.
Example Input
This would be entered by a user.
# This is an example comment
10 + 10
// Another comment
This is one line of text
Tested
Expected Parse Tree
The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).
Note: The 10 + 10 is being recognised as an arithmetic rule.
Current Parse Tree aka The Problem
Below is the current output from the lexer and parser.
Note: The 10 + 10 is being recognised as an text and not the arithmetic rule.
Grammar Definition
The logic of the grammar at a high levels is as follows:
Parse line by line
Determine the line content if not fall back to text
grammar ContentParser;
/*
* Tokens
*/
NUMBER: '-'? [0-9]+;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r]+;
EOL: '\r'? '\n';
/*
* Rules
*/
start: file;
file: line+ EOF;
line: content EOL;
content
: comment
| arithmetic
| text
;
// Custom Content Types
comment: LINE_COMMENT;
/// Example taken from ANTLR Docs
arithmetic:
NUMBER # Number
| LPARAN inner = arithmetic RPARAN # Parentheses
| left = arithmetic operator = POW right = arithmetic # Power
| left = arithmetic operator = (MUL | DIV) right = arithmetic # MultiplicationOrDivision
| left = arithmetic operator = (ADD | SUB) right = arithmetic # AdditionOrSubtraction;
text: TEXT;
My Understanding
The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.
However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.
(I'm still learning ANTLR)
When you are writing a parser, it's always a good idea to print out the tokens for the input.
In the current grammar, 10 + 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.
There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.
Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.
A grammar that does all this is:
grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line+ EOF;
line: content? EOL;
start: file;
text: TEXT+;
ADD: '+';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9]+;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]+ -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.

unable to parse ints with antlr

I'm trying to parse ints, but I can parse only multi-digit ints, not single-digit ints.
I narrowed it down to a very small lexer and parser which I based on sample grammars from antlr.org as follows:
# IntLexerTest.g4
lexer grammar IntLexerTest;
DIGIT
: '0' .. '9'
;
INT
: DIGIT+
;
#IntParserTest.g4
parser grammar IntParserTest;
options {
tokenVocab = IntLexerTest;
}
mything
: INT
;
And when I try to parse the digit 3 all by itself, I get "line 1:0 mismatched input '3' expecting INT". On the other hand, if I try to parse 33, it's fine. What am I doing wrong?
The lexer matches rules from top to bottom. When 2 (or more) rules match the same amount of characters, the rule defined first will win. That is why a single digit is matched as an DIGIT and two or more digits as an INT.
What you should do is make DIGIT a fragment. Fragments are only used by other lexer rules and will never become a token of their own:
fragment DIGIT
: '0' .. '9'
;
INT
: DIGIT+
;

Grammar of calculator in a finite field

I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.

Bison Reduce/Reduce Conflict with Casting and Expression Parentheses

I'm building a grammar in bison, and I've narrowed down my last reduce/reduce error to the following test-case:
%{
#include <stdio.h>
#include <string.h>
extern yydebug;
void yyerror(const char *str)
{
fprintf(stderr, "Error: %s\n", str);
}
main()
{
yydebug = 1;
yyparse();
}
%}
%right '='
%precedence CAST
%left '('
%token AUTO BOOL BYTE DOUBLE FLOAT INT LONG SHORT SIGNED STRING UNSIGNED VOID
%token IDENTIFIER
%start file
%debug
%%
file
: %empty
| statement file
;
statement
: expression ';'
;
expression
: expression '=' expression
| '(' type ')' expression %prec CAST
| '(' expression ')'
| IDENTIFIER
;
type
: VOID
| AUTO
| BOOL
| BYTE
| SHORT
| INT
| LONG
| FLOAT
| DOUBLE
| SIGNED
| UNSIGNED
| STRING
| IDENTIFIER
;
Presumably the issue is that it can't tell the difference between type and expression when it sees (IDENTIFIER) in an expression.
Output:
fail.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
fail.y:64.5-14: warning: rule useless in parser due to conflicts [-Wother]
| IDENTIFIER
^^^^^^^^^^
What can I do to fix this conflict?
If the grammar were limited to the productions shown in the OP, it would be relatively easy to remove the conflict, since the grammar is unambiguous. The only problem is that it is LR(2) and not LR(1).
The analysis in the OP is completely correct. When the parser sees, for example:
( identifier1 · )
(where · marks the current point, so the lookahead token is )), it is not possible to know whether that is a prefix of
( identifier1 · ) ;
( identifier1 · ) = ...
( identifier1 · ) identifier2
( identifier1 · ) ( ...
In the first two cases, identifier1 must be reduced to expression, so that ( expression ) can subsequently be reduced to expression, whereas in the last two cases,
identifier1 must be reduced to type so that ( type ) expression can subsequently be reduced to expression. If the parser could see one token further into the future, the decision could be made.
Since for any LR(k) grammar, there is an LR(1) grammar which recognizes the same language, there is clearly a solution; the general approach is to defer the reduction until the one-token lookahead is sufficient to distinguish. One way to do this is:
cast_or_expr : '(' IDENTIFIER ')'
;
cast : cast_or_expr
| '(' type ')'
;
expr_except_id : cast_or_expr
| cast expression %prec CAST
| '(' expr_except_id ')'
| expression '=' expression
;
expression : IDENTIFIER
| expr_except_id
;
(The rest of the grammar is the same except for the removal of IDENTIFIER from the productions for type.)
That works fine for grammars where no symbol can be both a prefix and an infix operator (like -) and where no operator can be elided (effectively, as in function calls). In particular, it won't work for C, because it will leave the ambiguities:
( t ) * a // product or cast?
( t ) ( 3 ) // function-call or cast?
Those are real ambiguities in the grammar which can only be resolved by knowing whether t is a typename or a variable/function name.
The "usual" solution for C parsers is to resolve the ambiguity by sharing the symbol table between the scanner and the parser; since typedef type alias declarations must appear before the first use of the symbol as a typename in the applicable scope, it can be known in advance of the scan of the token whether or not the token has been declared with a typedef. More accurately, if the typedef has not been seen, it can be assumed that the symbol is not a type, although it may be completely undeclared.
By using a GLR grammar and a semantic predicate, it is possible to restrict the logic to the parser. Some people find that more elegant.

Dealing with overloaded symbols in ambiguous grammars in ANTLR4

I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.

Resources