I'm trying to parse ints, but I can parse only multi-digit ints, not single-digit ints.
I narrowed it down to a very small lexer and parser which I based on sample grammars from antlr.org as follows:
# IntLexerTest.g4
lexer grammar IntLexerTest;
DIGIT
: '0' .. '9'
;
INT
: DIGIT+
;
#IntParserTest.g4
parser grammar IntParserTest;
options {
tokenVocab = IntLexerTest;
}
mything
: INT
;
And when I try to parse the digit 3 all by itself, I get "line 1:0 mismatched input '3' expecting INT". On the other hand, if I try to parse 33, it's fine. What am I doing wrong?
The lexer matches rules from top to bottom. When 2 (or more) rules match the same amount of characters, the rule defined first will win. That is why a single digit is matched as an DIGIT and two or more digits as an INT.
What you should do is make DIGIT a fragment. Fragments are only used by other lexer rules and will never become a token of their own:
fragment DIGIT
: '0' .. '9'
;
INT
: DIGIT+
;
Related
The Goal
The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.
Example Input
This would be entered by a user.
# This is an example comment
10 + 10
// Another comment
This is one line of text
Tested
Expected Parse Tree
The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).
Note: The 10 + 10 is being recognised as an arithmetic rule.
Current Parse Tree aka The Problem
Below is the current output from the lexer and parser.
Note: The 10 + 10 is being recognised as an text and not the arithmetic rule.
Grammar Definition
The logic of the grammar at a high levels is as follows:
Parse line by line
Determine the line content if not fall back to text
grammar ContentParser;
/*
* Tokens
*/
NUMBER: '-'? [0-9]+;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r]+;
EOL: '\r'? '\n';
/*
* Rules
*/
start: file;
file: line+ EOF;
line: content EOL;
content
: comment
| arithmetic
| text
;
// Custom Content Types
comment: LINE_COMMENT;
/// Example taken from ANTLR Docs
arithmetic:
NUMBER # Number
| LPARAN inner = arithmetic RPARAN # Parentheses
| left = arithmetic operator = POW right = arithmetic # Power
| left = arithmetic operator = (MUL | DIV) right = arithmetic # MultiplicationOrDivision
| left = arithmetic operator = (ADD | SUB) right = arithmetic # AdditionOrSubtraction;
text: TEXT;
My Understanding
The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.
However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.
(I'm still learning ANTLR)
When you are writing a parser, it's always a good idea to print out the tokens for the input.
In the current grammar, 10 + 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.
There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.
Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.
A grammar that does all this is:
grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line+ EOF;
line: content? EOL;
start: file;
text: TEXT+;
ADD: '+';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9]+;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]+ -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.
I have the following minimal example of a grammar I'd like to use with Jison.
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[0-9] return 'DIGIT'
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: DIGIT
{$$ = Number(yytext);}
;
e
: NUMBER
{$$ = Number(yytext);}
;
Here I have defined both NUMBER and DIGIT in order to allow for both digits and numbers, depending on the context. What I do not know, is how I define the context. The above example always returns
Expecting 'DIGIT', got 'NUMBER'
when I try to run it in the Jison debugger. How can I define the grammar in order to always expect a digit after a separator? I tried the following which does not work either
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: [0-9]
{$$ = Number(yytext);}
;
e
: [0-9]+("."[0-9]+)?\b
{$$ = Number(yytext);}
;
The classic scanner/parser model (originally from lex/yacc, and implemented by jison as well) puts the scanner before the parser. In other words, the scanner is expected to tokenize the input stream without regard to parsing context.
Most lexical scanner generators, including jison, provide a mechanism for the scanner to adapt to context (see "start conditions"), but the scanner is responsible for tracking context on its own, and that gets quite ugly.
The simplest solution in this case is to define only a NUMBER token, and have the parser check for validity in the semantic action of rules which actually require a DIGIT. That will work because the difference between DIGIT and NUMBER does not affect the parse other than to make some parses illegal. It would be different if the difference between NUMBER and DIGIT determined which production to use, but that would probably be ambiguous since all digits are actually numbers as well.
Another solution is to allow either NUMBER or DIGIT where a number is allowed. That would require changing e so that it accepted either NUMBER or DIGIT, and ensuring that DIGIT wins out in the case that both NUMBER and DIGIT are possible. That requires putting its rule earlier in the grammar file, and adding the \b at the end:
[0-9]\b return 'DIGIT'
[0-9]+("."[0-9]+)?\b return 'NUMBER'
A weird thing is going on. I defined the grammar and this is an excerpt.
name
: Letter
| Digit name
| Letter name
;
numeral
: Digit
| Digit numeral
;
fragment
Digit
: [0-9]
;
fragment
Letter
: [a-zA-Z]
;
So why does it show warnings for just two lines (Letter and Digit name) where i referenced a fragment and others below are completely fine...
Lexer rules you mark as fragments can only be used by other lexer rules, not by parser rules. Fragment rules never become a token of their own.
Be sure you understand the difference: What does "fragment" mean in ANTLR?
EDIT
Also, I now see that you're doing too much in the parser. The rules name and numeral should really be a lexer rule:
Name
: ( Digit | Letter)* Letter
;
Numeral
: Digit+
;
in which case you don't need to account for a Space rule in any of your parser rules (this is about your last question which was just removed).
Just in case you are using an older version of antlr:
[0-9]
and
[a-zA-Z]
are not valid regular expressions in old Antlr.
replace them with
'0'..'9'
and
('a'..'z' | 'A'..'Z')
and your issues should go away.
I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.
I am working on a parser for a DSL that has two currently 'conflicting' features:
Floating-point numbers like 123.4.
Ranges specified like ID[2..5] (ID is defined as 'a'..'z'+ and doesn't matter much. The part '[2..5]' matters most.
The test grammar that should parse it looks as follows:
grammar DotTest;
span returns [double value]
: ID'['e=INT'..'f=INT']' { /*some code to process the values*/ $value = (double)(Int32.Parse($e.text) + Int32.Parse($f.text)); } ;
num returns [double value]
: DOUBLE {$value = double.Parse($DOUBLE.text); } ;
INT : '0'..'9'+ ;
DOUBLE : '0'..'9'+'.''0'..'9'+ ;
ID : 'a'..'z'+ ;
WS : ( ' ' | '\t' | '\r' | '\n' ) {$channel=HIDDEN;} ;
The problem: the rule span cannot parse its input correctly, because it conflicts with DOUBLE token. The lexer tries to match 2..5 as a DOUBLE and fails. Here is how it looks in ANTLR Works:
What will be the correct way to solve this conflict and parse the two INTs in the span correctly?
P.S. I'm using ANTLR 3 and not ANTLR 4 as I'm going to generate a C# parser, which is not currently implemented in ANTLR 4.
This solution (the second grammar) works fine. After I transformed the lexer rules to the following:
NUM : (INT RNG)=> INT {$type=INT;}
| (DOUBLE)=> DOUBLE {$type=DOUBLE;}
| INT {$type=INT;};
fragment INT : '0'..'9'+ ;
fragment DOUBLE : '0'..'9'+'.''0'..'9'+ ;
RNG: '..' ;
parsing of intervals like 1..2 started working smoothly.
The DOUBLE rule you posted above does not conflict with the .. operator since the '0'..'9'+ following the '.' contains at least one digit. The following alternate definition of DOUBLE would in fact conflict:
DOUBLE : '0'..'9'+ '.' '0'..'9'*;
I suspect you are using the interpreter in ANTLRWorks, which is known to give incorrect results in many cases.