So I have take inspiration from the DOT.g4 grammar in this github repository grammars-v4/dot/DOT.g4. Tht's why I have as well a DOT file to parse.
This is a possible structure of my DOT file:
digraph G {
rankdir=LR
label="\n[Büchi]"
labelloc="t"
node [shape="circle"]
I [label="", style=invis, width=0]
I -> 34
0 [label="0", peripheries=2]
0 -> 0 [label="!v_0"]
1 [label="1", peripheries=2]
1 -> 1 [label="!v_2 & !v_5"]
2 [label="2"]
2 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
3 [label="3"]
3 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
4 [label="4"]
4 -> 1 [label="v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
5 [label="5"]
5 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
}
And Here my grammar.g4 file that I have modified from the link above:
parse: nba| EOF;
nba: STRICT? ( GRAPH | DIGRAPH ) ( initialId? ) '{' stmtList '}';
stmtList : ( stmt ';'? )* ;
stmt: nodeStmt| edgeStmt| attrStmt | initialId '=' initialId;
attrStmt: ( GRAPH | NODE | EDGE ) '[' a_list? ']';
a_list: ( initialId ( '=' initialId )? ','? )+;
edgeStmt: (node_id) edgeRHS label ',' a_list? ']';
label: ('[' LABEL '=' '"' (id)+ '"' );
edgeRHS: ( edgeop ( node_id ) )+;
edgeop: '->';
nodeStmt: node_id label? ',' a_list? ']';
node_id: initialId ;
id: ID | SPACE | DIGIT | LETTER | SYMBOL | STRING ;
initialId : STRING | LETTER | DIGIT;
And here the lexar rules:
GRAPH: [Gg] [Rr] [Aa] [Pp] [Hh];
DIGRAPH: [Dd] [Ii] [Gg] [Rr] [Aa] [Pp] [Hh];
NODE: [Nn] [Oo] [Dd] [Ee];
EDGE: [Ee] [Dd] [Gg] [Ee];
LABEL: [Ll] [Aa] [Bb] [Ee] [Ll];
/** "a numeral [-]?(.[0-9]+ | [0-9]+(.[0-9]*)? )" */
NUMBER: '-'? ( '.' DIGIT+ | DIGIT+ ( '.' DIGIT* )? );
DIGIT: [0-9];
/** "any double-quoted string ("...") possibly containing escaped quotes" */
STRING: '"' ( '\\"' | . )*? '"';
/** "Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores
* ('_') or digits ([0-9]), not beginning with a digit"
*/
ID: LETTER ( LETTER | DIGIT )*;
SPACE: '" "';
LETTER: [a-zA-Z\u0080-\u00FF_];
SYMBOL: '<'| '>'| '&'| 'U'| '!';
COMMENT: '/*' .*? '*/' -> skip;
LINE_COMMENT: '//' .*? '\r'? '\n' -> skip;
/** "a '#' character is considered a line output from a C preprocessor */
PREPROC: '#' ~[\r\n]* -> skip;
/*whitespace are ignored from the constructor*/
WS: [ \t\n\r]+ -> skip;
I clicked on the ANTLR Recognizer section that create itself the files in java and the tokens to interpreter the grammars. Now I have to construct a parser in which I overrride some methods to match my code in Java with the java files created by ANTLR4. But first I want to understand if my grammar for that kind of DOT is correct. How can I verify that?
Re: "I clicked on the ANTLR Recognizer"... sounds like you're using some sort of IDE with a plugin or another ANTLR tool. Use use VS Code and IntelliJ with plugins, but neither has an "ANTLR Recognizer" section (that I can see). So the following assumes using the command line. It's simple command line stuff and definitely worth learning early on when using ANTLR. (Both of the plugins I use also give the ability to view the token stream and parse tree from within the plugin though)
I you follow the "QuickStart" at www.antlr.org, you'll have created the grun alias that's useful for just this purpose.
(Assuming your grammar name is DOT)
To dump out your token stream (the result of all you lexer rules)
grun DOT tokens -tokens
To verify that you're parsing input correctly:
grun DOT parse -gui
or
grun DOT parse -tree
BTW, it's rather unlikely that you'll need to override the parser class. First take a look into Visitor and Listeners.
Related
I am writing an ANTLR Lexer and Parser grammar that will parse text that is quite similar to a Java class. Eventually it will parse text like the following:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author {
}
I am building up the Lexer and Parser slowly. I have successfully managed to parse the references but have hit a wall when parsing the type.
Before adding support for the type I was able to use string literals for space, colon, and semi-colon in the parser but after I encountered cannot create implicit token for string literal errors. I defined a lexer rule for each of those characters and replaced all occurrences of the literal with the rule. However this broke the parsing of references.
I have included my lexer and parser that successfully parses references below (along with a sample input and the parsed abstract syntax tree) and the evolved versions which isn't working. I am not getting any compilation errors but plenty of token recognition errors (screenshot included below).
What is the correct way to handle the parsing?
Working
Lexer
lexer grammar WorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: ' ' -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: ':' -> pushMode(IriMode);
END_IRI: ';' -> popMode;
mode IriMode;
IRI: String -> popMode;
Parser
parser grammar WorkingParserGrammar ;
options { tokenVocab=WorkingLexerGrammar; }
document: reference* EOF ;
prefixedReference: REFERENCE_PREFIX ':' IRI;
reference: REFERENCE_KEYWORD ' ' prefixedReference ';';
Input
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
Output
Evolved (not working)
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
COLON: Colon;
SEMICOLON: SemiColon;
SPACE: ' ';
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX COLON IRI;
reference: REFERENCE_KEYWORD SPACE prefixedReference SEMICOLON;
prefixedName: NAME_PREFIX SPACE LOCAL_NAME;
type: TYPE_KEYWORD SPACE prefixedName;
Output
Following Bart Kiers' help I have made two updates to the lexer and parser grammars with varying success.
First update
This change parses the type definition correctly but only if I remove the lexer rules for reference. I think the reason for that is that the two rules are the same (i.e. PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ; for reference and PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ; for type) – that is they both match on a space. My second update attempts to fix this but the full lexer and parser grammars are below.
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: REFERENCE_KEYWORD PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_KEYWORD PREFIXED_NAME prefixedName END_NAME;
Second update
In an attempt to fix this I moved the reference and type keywords to the Lexer rules for the corresponding parts but this only parses the type if I remove all of the Lexer rules for reference. However references are parsed correctly.
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: REFERENCE_KEYWORD SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
TYPE_DEFINITION: TYPE_KEYWORD SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_DEFINITION prefixedName END_NAME;
Output
For the following input:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author;
This is the output:
line 4:0 token recognition error at: 't'
line 4:1 token recognition error at: 'y'
line 4:2 token recognition error at: 'p'
line 4:3 token recognition error at: 'e'
line 4:4 token recognition error at: ' '
line 4:5 token recognition error at: 'd'
line 4:6 token recognition error at: 'c'
line 4:7 token recognition error at: ':'
line 4:8 token recognition error at: 'A'
line 4:9 token recognition error at: 'u'
line 4:10 token recognition error at: 't'
line 4:11 token recognition error at: 'h'
line 4:12 token recognition error at: 'o'
line 4:13 token recognition error at: 'r;'
My reasoning for using modes is to limit the scope of rules. This is a language I control but would prefer not to change it dramatically. There is much more to the language than I've shown here and we have already have a grammar (currently a combined grammar) but it is quite brittle. I tried to make a change to prevent uppercase characters in prefixes but permit them in the local name but this snowballed and other rules started applying. Research suggested that modes was an approach to handle this situation but I'm not very familiar with ANTLR so I've possibly misunderstood it.
When encountering errors/warnings like these:
line 4:0 token recognition error at: 't'
line 4:1 token recognition error at: 'y'
line 4:2 token recognition error at: 'p'
line 4:3 token recognition error at: 'e'
...
it means that the lexer cannot construct a token for the input (type ... in this case). In your case, it means the lexer cannot create a token from the input in the mode it at that moment is in.
I tried to make a change to prevent uppercase characters in prefixes but permit them in the local name but this snowballed and other rules started applying
There are two options to resolve such things:
just parse prefixes like any ordinary identifier (upper or lower cased) and after parsing, walk the generated parse tree and validate that the prefix-identifiers are really lower cased using an ANTLR visitor or listener (see: https://github.com/antlr/antlr4/blob/master/doc/listeners.md)
make a distinction in your lexer between lower- and upper cased identifiers and use them accordingly in your parser rules, something like this could work:
document
: reference* type* EOF
;
reference
: K_REFERENCE LOWER_ID COL STRING SCOL
;
type
: K_TYPE LOWER_ID COL id OPAR CPAR
;
id
: LOWER_ID
| ID
;
K_REFERENCE : 'reference';
K_TYPE : 'type';
LOWER_ID : [a-z_] [a-z_0-9]*;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
STRING : '"' ~["]* '"';
SCOL : ';';
COL : ':';
OPAR : '{';
CPAR : '}';
SPACES : [ \t\r\n] -> skip;
Modes are meant to be used for input that really are 2 (or more) languages embedded in each other. For example parsing HTML files: there is content (text) and tags with attributes. From what I see, you're not using it as it is meant to be used, IMO.
I am trying to write a parser for a relatively simple but idiosyncratic language.
Simply put, one of the rules is that comment lines are denoted by an asterisk only if that asterisk is the first character of the line. How might I go about formalising such a rule in ANTLR4? I thought about using:
START_LINE_COMMENT: '\n*' .*? '\n' -> skip;
But I am certain this won't work with more than one line comment in a row, as the newline at the end will be consumed as part of the START_LINE_COMMENTtoken, meaning any subsequent comment lines will be missing the required initial newline character, which won't work. Is there a way I can perhaps check if the line starts with a '*' without needing to consume the prior '\n'?
Matching a comment line is not easy. As I write one grammar per year, I had to grab to The Definitive ANTLR Reference to refresh my brain. Try this :
grammar Question;
/* Comment line having an * in column 1. */
question
: line+
;
line
// : ( ID | INT )+
: ( ID | INT | MULT )+
;
LINE_COMMENT
: '*' {getCharPositionInLine() == 1}? ~[\r\n]* -> channel(HIDDEN) ;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
//WS : [ \t\r\n]+ -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> skip ;
MULT : '*' ;
Compile and execute :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar:
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens data.txt
[#0,0:3='line',<ID>,1:0]
[#1,5:5='1',<INT>,1:5]
[#2,9:12='line',<ID>,2:2]
[#3,14:14='2',<INT>,2:7]
[#4,16:26='* comment 1',<LINE_COMMENT>,channel=1,3:0]
[#5,32:35='line',<ID>,4:4]
[#6,37:37='4',<INT>,4:9]
[#7,39:48='*comment 2',<LINE_COMMENT>,channel=1,5:0]
[#8,51:78='* comment 3 after empty line',<LINE_COMMENT>,channel=1,7:0]
[#9,81:81='*',<'*'>,8:1]
[#10,83:85='not',<ID>,8:3]
[#11,87:87='a',<ID>,8:7]
[#12,89:95='comment',<ID>,8:9]
[#13,97:100='line',<ID>,9:0]
[#14,102:102='9',<INT>,9:5]
[#15,107:107='*',<'*'>,9:10]
[#16,109:110='no',<ID>,9:12]
[#17,112:118='comment',<ID>,9:15]
[#18,120:119='<EOF>',<EOF>,10:0]
with the following data.text file :
line 1
line 2
* comment 1
line 4
*comment 2
* comment 3 after empty line
* not a comment
line 9 * no comment
Note that without the MULT token or '*' somewhere in a parser rule, the asterisk is not listed in the tokens, but the parser complains :
line 8:1 token recognition error at: '*'
If you display the parsing tree
$ grun Question question -gui data.txt
you'll see that the whole file is absorbed by one line rule. If you need to recognize lines, change the line and white space rules like so :
line
: ( ID | INT | MULT )+ NL
| NL
;
//WS : [ \t\r\n]+ -> skip ;
NL : [\r\n] ;
WS : [ \t]+ -> skip ;
I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.
I'm following the example given here-
https://datapsyche.wordpress.com/2014/10/23/back-to-learning-grammar-with-antlr/
which basically has following grammar-
grammar Simpleql;
statement : expr command* ;
expr : expr ('AND' | 'OR' | 'NOT') expr # expopexp
| expr expr # expexp
| predicate # predicexpr
| text # textexpr
| '(' expr ')' # exprgroup
;
predicate : text ('=' | '!=' | '>=' | '<=' | '>' | '<') text ;
command : '| show' text* # showcmd
| '| show' text (',' text)* # showcsv
;
text : NUMBER # numbertxt
| QTEXT # quotedtxt
| UQTEXT # unquotedtxt
;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'NOT' ;
EQUALS : '=' ;
NOTEQUALS : '!=' ;
GREQUALS : '>=' ;
LSEQUALS : '<=' ;
GREATERTHAN : '>' ;
LESSTHAN : '<' ;
NUMBER : DIGIT+
| DIGIT+ '.' DIGIT+
| '.' DIGIT+
;
QTEXT : '"' (ESC|.)*? '"' ;
UQTEXT : ~[ ()=,<>!\r\n]+ ;
fragment
DIGIT : [0-9] ;
fragment
ESC : '\\"' | '\\\\' ;
WS : [ \t\r\n]+ -> skip ;
When I pass input like this-
Abishek AND (country=India OR city=NY) LOGIN 404 | show name city
I get error- line 1:65 no viable alternative at input '<EOF>'
I went through a couple of SO posts related to the error but can't seem to be able to figure out what is wrong with the grammar.
I tried running your example but was thrown a number of errors in antlrworks 2. However i was able to run it without any errors in the test rig getting the following output:
(statement (expr (expr (expr (text Abishek)) AND (expr ( (expr (expr (predicate (text country) = (text India))) OR (expr (predicate (text city) = (text NY)))) ))) (expr (expr (text LOGIN)) (expr (text 404)))) (command | show (text name) (text city)))
And the same output of the tree shown on the website.
My opinion on what's wrong may be your actual input, iv had problems in the past with ANTLR reading text from a file if the file was not encoded to be ascii/ansi/utf-8 or whatever works for the os you are using. I encountered this when i saved a file on linux from a linux text editor and tried to run it on windows with the same generated parser. So my recommendation is try re-saving your text input - 'Abishek AND (country=India OR city=NY) LOGIN 404 | show name city' and make sure the encoding is different each time incase this is the cause.
Note you can also specify the encoding like this or similar ways :
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
Since having an encoding error will cause it to try and parse irrelevant of encoding and result in no matches being found.
Let me know if it works after saving encoded in a few different ways and i'll try and help further. Hope this helps.
I've defined multiple lexer rules that potentially matches the same character sequence. For example:
LBRACE: '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*' ;
AMPERSAND: '&' ;
IGNORED_SYMBOLS: ('!' | '#' | '%' | '^' | '-' | '+' | '=' |
'\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/' ) ;
// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};
STRING_LITERAL: '"' (STR_ESC | ~( '"' ))* '"';
fragment STR_ESC: '\\' '"' ;
CHAR_LITERAL : '\'' (CH_ESC | ~( '\'' )) '\'' ;
fragment CH_ESC : '\\' '\'';
My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.
This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).
Take the following rules for example:
DO : 'do';
ID : 'a'..'z'+;
The input "do" would obviously be matched by the rule DO.
And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].