identifier token keyword antlr parser - parsing

How to handle the case where the token 'for' is used in two different situations in the language to parse? Such as statement and as a "parameter" as the following example:
echo for print example
for i in {0..10..2}
do
echo "Welcome $i times"
done
Output:
for print example
Welcome 0 times
Welcome 2 times
Welcome 4 times
Welcome 6 times
Welcome 8 times
Welcome 10 times
Thanks.

The only way I see how you could go about doing this, is define an Echo rule in your lexer grammar that matches the characters echo followed by all other characters except \r and \n:
Echo
: 'echo' ~('\r' | '\n')+
;
and make sure that rule is before the rule that matches identifiers and keywords (like for).
A quick demo of a possible start would be:
grammar Test;
parse
: (echo | for)*
;
echo
: Echo (NewLine | EOF)
;
for
: For Identifier In range NewLine
Do NewLine
echo
Done (NewLine | EOF)
;
range
: '{' Integer '..' Integer ('..' Integer)? '}'
;
Echo
: 'echo' ~('\r' | '\n')+
;
For : 'for';
In : 'in';
Do : 'do';
Done : 'done';
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Integer
: '0'..'9'+
;
NewLine
: '\r' '\n'
| '\n'
| '\r'
;
Space
: (' ' | '\t') {skip();}
;
If you'd parse the input:
echo for print example
for i in {0..10..2}
do
echo "Welcome $i times"
done
echo the end for now!
with it, it would look like:
alt text http://img571.imageshack.us/img571/5713/grammar.png
(I had to rotate the image a bit, otherwise it wouldn't be visible at all!)
HTH.

In order to do that you need to use a semantic predicate to only take that lexer rule when it really is the for keyword.
Details are available on the keywords as identifiers page on the ANTLR wiki.

Well, it's pretty easy, most grammars use something like this:
TOKEN_REF
: 'A'..'Z' ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
;
So when referring to a print statement you would do something like:
'print' (TOKEN_REF)*
And with a for statement you just explicity state 'for' such as:
'for' INT 'in' SOMETHING

Related

Antlr grun error - no viable alternative input at

I'm trying to write a grammar for Prolog interpreter. When I run grun from command line on input like "father(john,mary).", I get a message saying "no viable input at 'father(john,'" and I don't know why. I've tried rearranging rules in my grammar, used different entry points etc., but still get the same error. I'm not even sure if it's caused by my grammar or something else like antlr itself. Can someone point out what is wrong with my grammar or think of what could be the cause if not the grammar?
The commands I ran are:
antlr4 -no-listener -visitor Expr.g4
javac *.java
grun antlr.Expr start tests/test.txt -gui
And this is the resulting parse tree:
Here is my grammar:
grammar Expr;
#header{
package antlr;
}
//start rule
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound
| compound ':-' conjunction
;
conjunction : compound
| compound ',' conjunction
;
compound : Atom '(' elements ')'
| '.(' elements ')'
;
list : '[]'
| '[' element ']'
| '[' elements ']'
;
element : Term
| list
| compound
;
elements : element
| element ',' elements
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z]([a-z]|[A-Z]|[0-9]|'_')*
| '0'
;
Var : [A-Z]([a-z]|[A-Z]|[0-9]|'_')*
;
Term : Atom
| Var
;
The lexer will always produce the same tokens for any input. The lexer does not "listen" to what the parser is trying to match. The rules the lexer applies are quite simple:
try to match as many characters as possible
when 2 or more lexer rules match the same amount of characters, let the rule defined first "win"
Because of the 2nd rule, the rule Term will never be matched. And moving the Term rule above Var and Atom will cause the latter rules to be never matched. The solution: "promote" the Term rule to a parser rule:
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound (':-' conjunction)?
;
conjunction : compound (',' conjunction)?
;
compound : Atom '(' elements ')'
| '.' '(' elements ')'
;
list : '[' elements? ']'
;
element : term
| list
| compound
;
elements : element (',' element)*
;
term : Atom
| Var
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z] [a-zA-Z0-9_]*
| '0'
;
Var : [A-Z] [a-zA-Z0-9_]*
;

Grammar in ANTLR4

So I have take inspiration from the DOT.g4 grammar in this github repository grammars-v4/dot/DOT.g4. Tht's why I have as well a DOT file to parse.
This is a possible structure of my DOT file:
digraph G {
rankdir=LR
label="\n[Büchi]"
labelloc="t"
node [shape="circle"]
I [label="", style=invis, width=0]
I -> 34
0 [label="0", peripheries=2]
0 -> 0 [label="!v_0"]
1 [label="1", peripheries=2]
1 -> 1 [label="!v_2 & !v_5"]
2 [label="2"]
2 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
3 [label="3"]
3 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
4 [label="4"]
4 -> 1 [label="v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
5 [label="5"]
5 -> 1 [label="v_0 & v_1 > 5 & !v_2 & v_3 < 8 & !v_5"]
}
And Here my grammar.g4 file that I have modified from the link above:
parse: nba| EOF;
nba: STRICT? ( GRAPH | DIGRAPH ) ( initialId? ) '{' stmtList '}';
stmtList : ( stmt ';'? )* ;
stmt: nodeStmt| edgeStmt| attrStmt | initialId '=' initialId;
attrStmt: ( GRAPH | NODE | EDGE ) '[' a_list? ']';
a_list: ( initialId ( '=' initialId )? ','? )+;
edgeStmt: (node_id) edgeRHS label ',' a_list? ']';
label: ('[' LABEL '=' '"' (id)+ '"' );
edgeRHS: ( edgeop ( node_id ) )+;
edgeop: '->';
nodeStmt: node_id label? ',' a_list? ']';
node_id: initialId ;
id: ID | SPACE | DIGIT | LETTER | SYMBOL | STRING ;
initialId : STRING | LETTER | DIGIT;
And here the lexar rules:
GRAPH: [Gg] [Rr] [Aa] [Pp] [Hh];
DIGRAPH: [Dd] [Ii] [Gg] [Rr] [Aa] [Pp] [Hh];
NODE: [Nn] [Oo] [Dd] [Ee];
EDGE: [Ee] [Dd] [Gg] [Ee];
LABEL: [Ll] [Aa] [Bb] [Ee] [Ll];
/** "a numeral [-]?(.[0-9]+ | [0-9]+(.[0-9]*)? )" */
NUMBER: '-'? ( '.' DIGIT+ | DIGIT+ ( '.' DIGIT* )? );
DIGIT: [0-9];
/** "any double-quoted string ("...") possibly containing escaped quotes" */
STRING: '"' ( '\\"' | . )*? '"';
/** "Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores
* ('_') or digits ([0-9]), not beginning with a digit"
*/
ID: LETTER ( LETTER | DIGIT )*;
SPACE: '" "';
LETTER: [a-zA-Z\u0080-\u00FF_];
SYMBOL: '<'| '>'| '&'| 'U'| '!';
COMMENT: '/*' .*? '*/' -> skip;
LINE_COMMENT: '//' .*? '\r'? '\n' -> skip;
/** "a '#' character is considered a line output from a C preprocessor */
PREPROC: '#' ~[\r\n]* -> skip;
/*whitespace are ignored from the constructor*/
WS: [ \t\n\r]+ -> skip;
I clicked on the ANTLR Recognizer section that create itself the files in java and the tokens to interpreter the grammars. Now I have to construct a parser in which I overrride some methods to match my code in Java with the java files created by ANTLR4. But first I want to understand if my grammar for that kind of DOT is correct. How can I verify that?
Re: "I clicked on the ANTLR Recognizer"... sounds like you're using some sort of IDE with a plugin or another ANTLR tool. Use use VS Code and IntelliJ with plugins, but neither has an "ANTLR Recognizer" section (that I can see). So the following assumes using the command line. It's simple command line stuff and definitely worth learning early on when using ANTLR. (Both of the plugins I use also give the ability to view the token stream and parse tree from within the plugin though)
I you follow the "QuickStart" at www.antlr.org, you'll have created the grun alias that's useful for just this purpose.
(Assuming your grammar name is DOT)
To dump out your token stream (the result of all you lexer rules)
grun DOT tokens -tokens
To verify that you're parsing input correctly:
grun DOT parse -gui
or
grun DOT parse -tree
BTW, it's rather unlikely that you'll need to override the parser class. First take a look into Visitor and Listeners.

antlr error "no viable alternative at input"

I'm following the example given here-
https://datapsyche.wordpress.com/2014/10/23/back-to-learning-grammar-with-antlr/
which basically has following grammar-
grammar Simpleql;
statement : expr command* ;
expr : expr ('AND' | 'OR' | 'NOT') expr # expopexp
| expr expr # expexp
| predicate # predicexpr
| text # textexpr
| '(' expr ')' # exprgroup
;
predicate : text ('=' | '!=' | '>=' | '<=' | '>' | '<') text ;
command : '| show' text* # showcmd
| '| show' text (',' text)* # showcsv
;
text : NUMBER # numbertxt
| QTEXT # quotedtxt
| UQTEXT # unquotedtxt
;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'NOT' ;
EQUALS : '=' ;
NOTEQUALS : '!=' ;
GREQUALS : '>=' ;
LSEQUALS : '<=' ;
GREATERTHAN : '>' ;
LESSTHAN : '<' ;
NUMBER : DIGIT+
| DIGIT+ '.' DIGIT+
| '.' DIGIT+
;
QTEXT : '"' (ESC|.)*? '"' ;
UQTEXT : ~[ ()=,<>!\r\n]+ ;
fragment
DIGIT : [0-9] ;
fragment
ESC : '\\"' | '\\\\' ;
WS : [ \t\r\n]+ -> skip ;
When I pass input like this-
Abishek AND (country=India OR city=NY) LOGIN 404 | show name city
I get error- line 1:65 no viable alternative at input '<EOF>'
I went through a couple of SO posts related to the error but can't seem to be able to figure out what is wrong with the grammar.
I tried running your example but was thrown a number of errors in antlrworks 2. However i was able to run it without any errors in the test rig getting the following output:
(statement (expr (expr (expr (text Abishek)) AND (expr ( (expr (expr (predicate (text country) = (text India))) OR (expr (predicate (text city) = (text NY)))) ))) (expr (expr (text LOGIN)) (expr (text 404)))) (command | show (text name) (text city)))
And the same output of the tree shown on the website.
My opinion on what's wrong may be your actual input, iv had problems in the past with ANTLR reading text from a file if the file was not encoded to be ascii/ansi/utf-8 or whatever works for the os you are using. I encountered this when i saved a file on linux from a linux text editor and tried to run it on windows with the same generated parser. So my recommendation is try re-saving your text input - 'Abishek AND (country=India OR city=NY) LOGIN 404 | show name city' and make sure the encoding is different each time incase this is the cause.
Note you can also specify the encoding like this or similar ways :
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
Since having an encoding error will cause it to try and parse irrelevant of encoding and result in no matches being found.
Let me know if it works after saving encoded in a few different ways and i'll try and help further. Hope this helps.

ANTLR grammar for multi-level text segmentation

I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.

Help with parsing a log file (ANTLR3)

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.

Resources