How to parse a partial date in ANTLR? - parsing

I am taking my first steps to use antlr4 and try to parse a partial date in european format DD.MM.YYYY.
I want to recognize a normal date like 15.05.2020 or 7.5.20 but also dates which only contains month and year like 05.2020 or 5.20 and in addition to that dates that only contain out of a year like 2020 or 20. In my application I want to have access to all parts of a date (day, month and year) at which some parts may be empty/null.
Here is my grammar so far.
grammar LogicalDateExpressions;
stmt : date EOF
;
date : (YEAR)
| (MONTH DOT YEAR)
| (DAY DOT MONTH DOT YEAR)
;
YEAR : ([12] [0-9] [0-9] [0-9])
| ([0-9] [0-9])
;
MONTH : ('0'? [1-9])
| ('1' [012])
;
DAY : ('0'? [1-9])
| ([12][0-9])
| ('3'[01])
;
DOT : '.';
WS : [ \t\r\n\u000C]+ -> skip;
This grammar works with a single year (2020) but fails to recognize a month-year combination (05.2020). grun -tokens told me the following.
[#0,0:1='05',<YEAR>,1:0]
[#1,2:2='.',<'.'>,1:2]
[#2,3:6='2020',<YEAR>,1:3]
[#3,9:8='<EOF>',<EOF>,2:0]
line 1:2 mismatched input '.' expecting <EOF>
So with my smattering I figured the parser rule date is the problem and I rewrote it to
date : (
(DAY DOT)?
MONTH DOT
)?
YEAR
;
But I still got the same error. Then I thought maybe I need to reorder the lexer rules. So instead of YEAR -> MONTH -> DAY, I wrote them DAY -> MONTH -> YEAR. But grun told me.
[#0,0:1='05',<DAY>,1:0]
[#1,2:2='.',<'.'>,1:2]
[#2,3:6='2020',<YEAR>,1:3]
[#3,9:8='<EOF>',<EOF>,2:0]
line 1:3 mismatched input '2020' expecting MONTH
I also tried to change the order of the or'ed alternatives in the parser rule date but that did not work out either. Then I tried to change the lexer rules DAY, MONTH, YEAR to make them parser rules (day, month, year). After getting some errors because apparently the [0-9] notation is not allowed in in parser rules I changed the grammar to this.
date : (year)
| (month DOT year)
| (day DOT month DOT year)
;
[...]
year : (('1'|'2') DIGIT DIGIT DIGIT)
| (DIGIT DIGIT)
;
month : ('0'? DIGIT_NO_ZERO)
| ('1' ('0'|'1'|'2'))
;
day : ('0'? DIGIT_NO_ZERO)
| (('1'|'2') DIGIT)
| ('3' ('0'|'1'))
;
[...]
DIGIT : [0-9];
DIGIT_NO_ZERO : [1-9];
That was a bummer too. grun told me.
[#0,0:0='0',<'0'>,1:0]
[#1,1:1='5',<DIGIT>,1:1]
[#2,2:2='.',<'.'>,1:2]
[#3,3:3='2',<'2'>,1:3]
[#4,4:4='0',<'0'>,1:4]
[#5,5:5='2',<'2'>,1:5]
[#6,6:6='0',<'0'>,1:6]
[#7,9:8='<EOF>',<EOF>,2:0]
line 1:1 no viable alternative at input '05'
As far as I understand the language I am looking for is a regular one. And every input is unambiguous. So I tried to get the whole "logic" into the lexer and I succeeded with the following grammar.
grammar LogicalDateExpressions;
stmt : date EOF
;
date : DT
;
DT : (
((('0'? [1-9])|([12][0-9])|('3'[01])) DOT)? // Day
(('0'? [1-9])|('1' [012])) DOT // Month
)?
((DIGIT DIGIT DIGIT DIGIT)|(DIGIT DIGIT)) // Year
;
DIGIT : [0-9];
DOT : '.';
WS : [ \t\r\n\u000C]+ -> skip;
It parses every input I give it. But the problem is that every input is just a DT.
[#0,0:6='05.2020',<DT>,1:0]
[#1,9:8='<EOF>',<EOF>,2:0]
I can not distinguish between the day, the month and the year in a visitor/listener because labels are not allowed in lexer rules.
So my question is where is the problem with the first given grammar and what do I need to change to make it work?
From a look at the token output from grun I think I might grasp the problem every input for a day, month and/or year might be ambiguous but as a whole input in conjunction with the dots it should not be. How can I tell antlr that?

So my question is where is the problem with the first given grammar and what do I need to change to make it work?
The problem is that the lexer is not driven by the parser. What this means is that when the parser tries to match the tokens DAY DOT MONTH and the input is 01.01, the lexer will not create a DAY and a MONTH for these two 01's, but two MONTH tokens. This is how ANTLR's lexer works: try to grab as much characters for a token, and when there are 2 or more tokens that match the same characters (like 01 can be matched by both DAY and MONTH), let the token defined first "win" (which is the MONTH token). There is no way around this.
What you could do is something like this (untested):
date
: year
| month DOT year
| day DOT month DOT year
;
day
: N_01_12
| N_13_31
;
month
: N_01_12
;
year
: N_01_12
| N_13_31
| N_32_99
| N_1000_2999
;
N_01_12
: '0'? D // 01-09
| '1' [0-2] // 10-12
;
N_13_31
: '1' [3-9] // 13-19
| '2' D // 20-29
| '3' [01] // 30-31
;
N_32_99
: '3' [2-9] // 32-39
| [4-9] D // 40-99
;
N_1000_2999
: [12] D D D // 1000-2999
;
fragment D : [0-9];

Related

Antrl4 parsing - tackling difficult syntax

I am parsing a language which has some difficult syntax that I need some help or suggestions to tackle it.
Heres an a typical line =>
IF CLCI((ZNTEM+CHRCNT),1,1H())) EQ 0
The difficult bit is nH(....any character within the character set.....) where n is 1 in this example and the single char in question is a ')' . My lexer has:
fragment Lp: '(';
fragment Rp: ')';
LP: Lp;
RP: Rp; etc....
My current non-working solution is to switch modes in the lexer because then I can then define all the special chars to consume
// Default mode rules
STRING_SUB: INT 'H' LP -> pushMode(ISLAND) ; // switch to ISLAND mode
and then to switch back
// Special mode of INT H ( ID )
// ID is the string substitute which can includes, spaces, backslash, etc, special chars
mode ISLAND;
ISLAND_CLOSE : RP -> popMode ; // back to nomal mode
ID : SpecialChars+; // match/send ID in tag to parser
fragment SpecialChars: '\u0020'..'\u0027' | '\u002A'..'\u0060' | '\u0061'..'\u007E' | '¦';
But obviously the pop mode trigger is the ')' which fails in the particular example case, because the payload is a RP. Any suggestions?

ANTLR: How to give precedence to parser rule alternative

So I am working with an ANTLR grammar for parsing dates, and I want to be able to recognize not just individual date-units, but also pairs of date-units.
For the purposes of this question, I think it might be helpful to divide the kinds of questions I want to be able to recognize into 3 classes:
What was the temperature in August 2019? - Straightforward. Single date-unit (August 2019).
Which was hotter between June 3rd 2019 and yesterday? - Still straightforward. Two date-units (June 3, 2019 and yesterday).
Between August 2018 and 2019, which was hotter? - Tricky. The natural expectation of the user in this case would be to compare August 2018 and August 2019 (implicitly). To handle such cases, I want 2018 and 2019 to be parsed as a single year_pair rule and August to be parsed as a month.
I am currently handling only cases 1 and 2. Case 1 is handled in a straightforward way. Case 2 is handles by having a date_unit AND date_unit rule. But to handle Case 3 now, I also tried adding a year AND year rule, so that 2018 and 2019 is picked up as a year_pair much before, but due to the top-down nature of ANTLR, it still parses them into August 2018 and 2019.
How can I go about changing this such that it parses August 2018 and 2019 into August and 2018 and 2019 instead (while also retaining the general date_unit AND date_unit rule?
You are trying to add semantic to a syntax. From a language standpoint the implicit user expectation doesn't matter at all. The parser (as a syntax tool) can only determine if input conforms to a language and not if the input also matches semantic rules).
Instead you should use ANTLR4 to quantify your input and create the parse tree. Then in a second step do the semantic analysis where you can apply your special date rules (e.g. auto fill implicit date parts).
"Bottom-up" is a term that's synonymous for LR parsing for decades and has nothing to do with ANTLR nor the problem. It's the wrong term.
Mike's solution above is what most people would do because a date_range corresponds to just a Tuple<date_unit, date_unit>, and one would just be creating that type in the semantic analyzer. You want to describe a different range, something like Tuple<month, Tuple<year, year>> and other variations syntactically. Here is a grammar that does that. It produces the trees you are looking for, for all three of your examples.
grammar Dates;
MONTH : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
YESTERDAY : 'yesterday' ;
FIRST : 'First';
SECOND : 'Second';
THIRD : 'Third';
AND : 'and' ;
BETWEEN : 'between';
ORDINAL: [1-9][0-9]* ('rd' | 'th');
CARDINAL : [0-9]+ ;
WS: [ \t\r\n]+ -> skip;
// NB: Note order here.
range
: BETWEEN month year_group
| BETWEEN date_unit AND date_unit
;
input: ( date_unit | range ) EOF ;
year_group : year AND year ;
date_unit : month day year | month year | year | yesterday ;
day : ordinal | CARDINAL ;
ordinal : ORDINAL | FIRST | SECOND | THIRD ;
month : MONTH ;
year : CARDINAL ;
yesterday : YESTERDAY ;

ANTLR Grammar for parsing a text file

I'm driving crazy trying to generate a parser Grammar with ANTLR.
I've got plain text file like:
Diagram : VW 503 FSX 09/02/2015 12/02/2015 STP
Fleet : AAAA
OFF :
AAA 05+44 5R06
KKK 05+55 06.04 1R06 5530
ZZZ 06.24 06.30 1R06 5530
YYY 07.53 REVRSE
YYY 08.23 9G98 5070
WORKS :
MILES :(LD) 1288.35 (ETY) 3.18 (TOT) 1291.53
Each "Diagram" entity is contained beetween "Diagram :" and the "(TOT) before EOF.
In the same plain txt file multiple "Diagram" entity can be present.
I've done some test with ANTRL
`grammar Hello2;
xxxt : diagram+;
diagram : DIAGRAM_ini txt fleet LEGS+ DIAGRAM_end;
txt : TEXT;
fleet : FLEET_INI txt;
num : NUMBER;
// Lexer Rules
DIAGRAM_ini : 'Diagram :';
DIAGRAM_end : '(TOT)' ;
LEGS : ('AAA' | 'KKK' | 'ZZZ' | 'YYY') ;
FLEET_INI : 'Fleet :';
TEXT : ('a'..'z')+ ;
NUMBER: ('0'..'9') ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;`
My Goal is to be able to parse Diagrams recursively, and gather all LEGS text/number.
Any help/tips is much more than appreciated!
Many Thanks
Regs
S.
I suggest not parsing the file like you did. This file does not define a language with words and grammar, but rather a formatted text of chars:
The formatting conventions are rather weak
The labels before the colon cannot serve as tokens since they may reappear in the body (AAA (=label) vs AAAA (=body)
The tokens must be very primitive to fit this requirements
Solution with ANTLR
You need a weaker grammar to solve this problem, e.g.
grammar diagrams;
diagrams : diagram+ ;
diagram : section+ ;
section : WORD ':' body? ;
body : textline+;
textline : (WORD | NUMBER | SIGNS)* ('\r' | '\n')+;
WORD : LETTER+ ;
NUMBER : DIGIT+ ;
SIGNS : SIGN+ ;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment SIGN : ('.'|'+'|'('|')'|'/') ;
fragment DIGIT : ('0'..'9') ;
Run a visitor on the Parsing result
to build up the normalized text of body
to filter out the LEGS lines out of the body
to parse a LEGS line with another parser (a regexp-parser would be sufficient here, but you could also define another ANTLR-Parser)
Another alternative:
Try out Packrat parsing (e.g. parboiled)
- it is (especially for people with low experience in compiler construction) more comprehensible
it matches better to your grammar design
parboiled is pure java (grammar specified in java)
Disadvantages:
Whitespace handling must be done in Parser Rules
Debugging/Error Messages are a problem (with all packrat parsers)

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

ANTR3 set the number of accepted characters for a token

I have to create a Lexer which will accept for example an integer only if it has a maximum of 8 digits. Is here an alternative to do it rather than just writing it like this?
INTEGER : (DIG | DIG DIG | DIG DIG DIG | ...)
This can be done using a Gated Semantic Predicates like this:
INTEGER
#init{int n = 1;}
: ({n <= 8}?=> DIGIT {n++;})+
;
fragment DIGIT : '0'..'9';
Details about this kind of predicate, see: What is a 'semantic predicate' in ANTLR?

Resources