What's the best way to handle optional tokens in antlr4 - parsing

Suppose I have following input:
Great University
Graduated in 2010
Some University
09/2009 - 06/2011
Nice University
06/2011
I want to handle years of studying. My grammar looks like that:
education:
(section)*
EOF
;
section:
(school | years)+
;
degree: WORD* DEGREE WORD* SEPARATOR;
years: WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;
WS : [ \t\r]+ -> skip;
SEPARATOR : (NEWLINE | COMMA);
COMMA : ',';
NEWLINE : '\n';
SCHOOL : ('university' | 'University' | 'school' | 'School');
WORD : [a-zA-Z'()]+;
YEAR_START : YEAR;
YEAR_END : YEAR;
YEAR : (DIGIT DIGIT '/')? [1-2] DIGIT DIGIT DIGIT;
DIGIT : [0-9];
I'm getting following errors:
line 1:17 mismatched input '\n' expecting '-'
line 6:17 mismatched input '\n' expecting '-'
How can I handle optional start year via grammar?

The lexer can assign only one token type to one pattern. You expect it to assign a year pattern to three token types and to decide at runtime which one is the correct one. This is not how ANTLR works.
In your case all years (not only the optional one) will be captured by the first rule, i.e. YEAR_START. This means following tokenization
"Graduated in 2010" -> WORD WORD YEAR_START
The only matching rule is
years: WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;
but the '-' is missing.
The grammar should work if you delete the YEAR_START and YEAR_END rules and replace all occurrences by YEAR. Probably YEAR_START and YEAR_END have the purpose to distinguish start and end, yet for this purpose there exist labels.
If this does not work, please post your complete grammar; the one you posted does e.g. not contain a rule for DEGREE.

Related

ANTLR4: How can I recognize words from an alphabet?

I am new to Antl4. I have an antlr grammar file that consists of something similar to:
consonant : 'b' | 'c' | 'd' | 'f' ;
vowel : 'a' | 'e' | 'i' ;
connector : ':' | '-' ;
cseq : (consonant)+ ;
vseq : (vowel)+ ;
prefix : cseq vseq ;
word : (cseq vseq | cseq)+ ;
From my understanding, even though these lines are at the bottom of a file, they're still considered rules. My parse tree captures each individual letter instead of treating them as lexical items - or words. How can I change these rules into lexer statements?
A couple of things to keep in mind.
parser rules are rules beginning with lower case letters
lexer rules are those whose name begins with an uppercase character (fairly common convention is to make then all uppercase)
if you put a literal character in a parser rule (all of your rules are parser rules, as they begin with lower case characters), ANTLR will synthesize a TOKEN rule for those characters.
Since it appears that you want a word to be a lexical item (i.e. Token), you could do something along the lines of:
fragment CONSONANT : 'b' | 'c' | 'd' | 'f' ;
fragment VOWEL : 'a' | 'e' | 'i' ;
CONNECTOR : ':' | '-' ; // not sure what you intend for this
fragment CSEQ: CONSONANT+ ;
fragment VSEQ : VOWEL+ ;
PREFIX : CSEQ VSEQ ; // not sure what you intend for this
WORD : (CSEQ VSEQ | CSEQ)+ ;
(That's making quite a few assumptions about your intention.)
Main point, if you want WORDs to be single tokens, they need to be defined as a Lexer rule.
If you want to compose rules for Lexer rules, you can define fragment rules. These rules can be used to compose Lexer rules, but will not, themselves, be recognized as tokens.
With the changes here, you should be able to use WORD in a parser rule, and have all the characters that make up your WORD in a single Token.

ANTLR grammar not working as expected. What am I doing wrong?

I have this grammar below for implementing an IN operator taking a list of numbers or strings.
grammar listFilterExpr;
listFilterExpr: entityIdNumberListFilter | entityIdStringListFilter;
entityIdNumberProperty
: 'a.Id'
| 'c.Id'
| 'e.Id'
;
entityIdStringProperty
: 'f.phone'
;
listFilterExpr
: entityIdNumberListFilter
| entityIdStringListFilter
;
listOperator
: '$in:'
;
entityIdNumberListFilter
: entityIdNumberProperty listOperator numberList
;
entityIdStringListFilter
: entityIdStringProperty listOperator stringList
;
numberList: '[' ID (',' ID)* ']';
fragment ID: [1-9][0-9]*;
stringList: '[' STRING (',' STRING)* ']';
STRING
: '"'(ESC | SAFECODEPOINT)*'"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
If I try to parse the following input:
c.Id $in: [1,1]
Then I get the following error in the parser:
mismatched input '1' expecting ID
Please help me to correct this grammar.
Update
I found this following rule way above in the huge grammar file of my project that might be matching '1' before it gets to match to ID:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
But, If I write my ID rule before NUMBER then other things fail, because they have already matched ID which should have matched NUMBER
What should I do?
As mentioned by rici: ID should not be a fragment. Fragments can only be used by other lexer rules, they will never become a token on their own (and can therefor not be used in parser rules).
Just remove the fragment keyword from it: ID: [1-9][0-9]*;
Note that you'll also have to account for spaces. You probably want to skip them:
SPACES : [ \t\r\n] -> skip;
...
mismatched input '1' expecting ID
...
This looks like there's another lexer, besides ID, that also matches the input 1 and is defined before ID. In that case, have a look at this Q&A: ANTLR 4.5 - Mismatched Input 'x' expecting 'x'
EDIT
Because you have the rules ordered like this:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
ID
: [1-9][0-9]*
;
the lexer will never create an ID token (only NUMBER tokens will be created). This is just how ANTLR works: in case of 2 or more lexer rules match the same amount of characters, the one defined first "wins".
In the first place I think it's odd to have an ID rule that matches only digits, but, if that's the language you're parsing, OK. In your case, you could do something like this:
id : POS_NUMBER;
number : POS_NUMBER | NEG_NUMBER;
POS_NUMBER : INT ('.' [0-9] +)?;
NEG_NUMBER : '-' POS_NUMBER;
fragment INT
: '0' | [1-9] [0-9]*
;
and then instead of ID, use id in your parser rules. As well as using number instead of the NUMBER you're using now.

How do I figure out ANTLR grammar failure to parse input, special timestamp in logfile

INPUT:
Mar 9 10:19:07 west info tmm1[17280]: 01870003:6:
/Common/mysaml.app/mysaml:Common:00000000: helloasdfasdf asdfadf vgnfg
GRAMMAR:
grammar scratch;
lines : datestamp hostname level proc msgnum module msgstring;
datestamp: month day time;
//month : MONTH;
day : INTEGER;
time : INTEGER ':' INTEGER ':' INTEGER;
hostname : STRING;
level : ALPHA;
proc: procname '[' procnum ']' ':';
procname : STRING;
procnum : INTEGER;
msgnum : INTEGER ':' DIGIT':';
module : '/' DOTSLASHSTRING ':' PARTITION ':' SESSID ':';
PARTITION: STRING;
sessid : HEX;
msgstring: MSGSTRING;
DOTSLASHSTRING : [a-zA-Z./]+;
SESSID : HEX;
INTEGER : [0-9]+;
DIGIT: [0-9];
STRING : [a-zA-Z][a-zA-Z0-9]*;
HEX : [a-f0-9]+;
//ALPHA: [a-zA-Z]+;
ALPHA: ('['|'(') .*? (']'|')');
MSGSTRING : [a-zA-Z0-9':,_(). ]+ [\r];
// | 'Agent' MSGSTRING;
month : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
WS : [ \t\r\n]+ -> skip;
PROBLEM:
the parse tree shows that the month is populated properly, but the next item, day is not. In the parse tree, it shows day is set to the entire rest of the input. Don't see how this is possible.
Error from parser is:
line 1:4 mismatched input '9' expecting INTEGER
The parser (i.e. rules starting with lowercase letter) and lexer (uppercase first letter) behave in a slightly different way:
The parser knows what token it expects and tries to match it (except when it has multiple alternatives - then it looks at the next tokens to see which alternative to choose)
The lexer however knows nothing about the parser rules - it matches whatever it can match to the current input. When multiple lexer rules can match a prefix of the input:
It will match (and emit token for) the rule that matches the longest sequence
If multiple rules can match the same sequence, the rule earlier in the file (closer to the top) wins.
So your input would most likely be tokenized to*:
MONTH Mar
(WS)
SESSID 9 - SESSID matches and is higher up than INTEGER
(WS)
SESSID 10
':' :
SESSID 19
':' :
SESSID 07
(WS)
PARTITION west - same as STRING but higher up - STRING will never be matched
(WS)
PARTITION info
(WS)
PARTITION tmm1
ALPHA [17280] - matches longer sequence than just '[' in rule "proc"
':' :
(WS)
SESSID 01870003
':' :
SESSID 6
':' :
(WS)
DOTSLASHSTRING /Common/mysaml.app/mysaml - longer than just '/' in rule "module"
MSGSTRING :Common:00000000: helloasdfasdf asdfadf vgnfg - the rest can be matched to this rule
As you can see, these are quite different tokens than your parser expects.
The bottom line is, you have too much logic in your lexer rules, namely you tried to put semantic meanings into lexer. It's not suited to that task. If a single input sequence can mean different things (like 123 might be an integer number, hex number or session ID), that distinction needs to go into parser since it can only be decided based on context (where in the sentence it occurred), and not by the content of the 123 itself. Similarly, if [17280] can either be ALPHA (whatever that is) or an INTEGER in brackets, that decision needs to go into parser because it cannot be decided solely by looking at [17280] (it's now in lexer due to the ALPHA rule).
* The likely tokenization is based on the input from your screenshot which is all on one line, while the input in question itself is on two lines - not sure whether that is intentional or a result of line wrap.

Ordering lexer rules in a grammar using ANTLR4

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!
From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by
matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[#0,0:4='Hello',<WORD>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<WORD>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<WORD>,1:18]
[#6,22:24='are',<WORD>,1:22]
[#7,26:28='you',<WORD>,1:26]
[#8,30:31='at',<WORD>,1:30]
[#9,33:36='nine',<WORD>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD to word in the item rule, and add a word rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[#0,0:4='Hello',<ID>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<ID>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<ID>,1:18]
[#6,22:24='are',<ID>,1:22]
[#7,26:28='you',<ID>,1:26]
[#8,30:31='at',<ID>,1:30]
[#9,33:36='nine',<ID>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.
As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.
In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.
However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:
start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;
Or you could skip the UNINTERESTING completely.

Help with parsing a log file (ANTLR3)

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.

Resources