So I am working with an ANTLR grammar for parsing dates, and I want to be able to recognize not just individual date-units, but also pairs of date-units.
For the purposes of this question, I think it might be helpful to divide the kinds of questions I want to be able to recognize into 3 classes:
What was the temperature in August 2019? - Straightforward. Single date-unit (August 2019).
Which was hotter between June 3rd 2019 and yesterday? - Still straightforward. Two date-units (June 3, 2019 and yesterday).
Between August 2018 and 2019, which was hotter? - Tricky. The natural expectation of the user in this case would be to compare August 2018 and August 2019 (implicitly). To handle such cases, I want 2018 and 2019 to be parsed as a single year_pair rule and August to be parsed as a month.
I am currently handling only cases 1 and 2. Case 1 is handled in a straightforward way. Case 2 is handles by having a date_unit AND date_unit rule. But to handle Case 3 now, I also tried adding a year AND year rule, so that 2018 and 2019 is picked up as a year_pair much before, but due to the top-down nature of ANTLR, it still parses them into August 2018 and 2019.
How can I go about changing this such that it parses August 2018 and 2019 into August and 2018 and 2019 instead (while also retaining the general date_unit AND date_unit rule?
You are trying to add semantic to a syntax. From a language standpoint the implicit user expectation doesn't matter at all. The parser (as a syntax tool) can only determine if input conforms to a language and not if the input also matches semantic rules).
Instead you should use ANTLR4 to quantify your input and create the parse tree. Then in a second step do the semantic analysis where you can apply your special date rules (e.g. auto fill implicit date parts).
"Bottom-up" is a term that's synonymous for LR parsing for decades and has nothing to do with ANTLR nor the problem. It's the wrong term.
Mike's solution above is what most people would do because a date_range corresponds to just a Tuple<date_unit, date_unit>, and one would just be creating that type in the semantic analyzer. You want to describe a different range, something like Tuple<month, Tuple<year, year>> and other variations syntactically. Here is a grammar that does that. It produces the trees you are looking for, for all three of your examples.
grammar Dates;
MONTH : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
YESTERDAY : 'yesterday' ;
FIRST : 'First';
SECOND : 'Second';
THIRD : 'Third';
AND : 'and' ;
BETWEEN : 'between';
ORDINAL: [1-9][0-9]* ('rd' | 'th');
CARDINAL : [0-9]+ ;
WS: [ \t\r\n]+ -> skip;
// NB: Note order here.
range
: BETWEEN month year_group
| BETWEEN date_unit AND date_unit
;
input: ( date_unit | range ) EOF ;
year_group : year AND year ;
date_unit : month day year | month year | year | yesterday ;
day : ordinal | CARDINAL ;
ordinal : ORDINAL | FIRST | SECOND | THIRD ;
month : MONTH ;
year : CARDINAL ;
yesterday : YESTERDAY ;
Related
I am taking my first steps to use antlr4 and try to parse a partial date in european format DD.MM.YYYY.
I want to recognize a normal date like 15.05.2020 or 7.5.20 but also dates which only contains month and year like 05.2020 or 5.20 and in addition to that dates that only contain out of a year like 2020 or 20. In my application I want to have access to all parts of a date (day, month and year) at which some parts may be empty/null.
Here is my grammar so far.
grammar LogicalDateExpressions;
stmt : date EOF
;
date : (YEAR)
| (MONTH DOT YEAR)
| (DAY DOT MONTH DOT YEAR)
;
YEAR : ([12] [0-9] [0-9] [0-9])
| ([0-9] [0-9])
;
MONTH : ('0'? [1-9])
| ('1' [012])
;
DAY : ('0'? [1-9])
| ([12][0-9])
| ('3'[01])
;
DOT : '.';
WS : [ \t\r\n\u000C]+ -> skip;
This grammar works with a single year (2020) but fails to recognize a month-year combination (05.2020). grun -tokens told me the following.
[#0,0:1='05',<YEAR>,1:0]
[#1,2:2='.',<'.'>,1:2]
[#2,3:6='2020',<YEAR>,1:3]
[#3,9:8='<EOF>',<EOF>,2:0]
line 1:2 mismatched input '.' expecting <EOF>
So with my smattering I figured the parser rule date is the problem and I rewrote it to
date : (
(DAY DOT)?
MONTH DOT
)?
YEAR
;
But I still got the same error. Then I thought maybe I need to reorder the lexer rules. So instead of YEAR -> MONTH -> DAY, I wrote them DAY -> MONTH -> YEAR. But grun told me.
[#0,0:1='05',<DAY>,1:0]
[#1,2:2='.',<'.'>,1:2]
[#2,3:6='2020',<YEAR>,1:3]
[#3,9:8='<EOF>',<EOF>,2:0]
line 1:3 mismatched input '2020' expecting MONTH
I also tried to change the order of the or'ed alternatives in the parser rule date but that did not work out either. Then I tried to change the lexer rules DAY, MONTH, YEAR to make them parser rules (day, month, year). After getting some errors because apparently the [0-9] notation is not allowed in in parser rules I changed the grammar to this.
date : (year)
| (month DOT year)
| (day DOT month DOT year)
;
[...]
year : (('1'|'2') DIGIT DIGIT DIGIT)
| (DIGIT DIGIT)
;
month : ('0'? DIGIT_NO_ZERO)
| ('1' ('0'|'1'|'2'))
;
day : ('0'? DIGIT_NO_ZERO)
| (('1'|'2') DIGIT)
| ('3' ('0'|'1'))
;
[...]
DIGIT : [0-9];
DIGIT_NO_ZERO : [1-9];
That was a bummer too. grun told me.
[#0,0:0='0',<'0'>,1:0]
[#1,1:1='5',<DIGIT>,1:1]
[#2,2:2='.',<'.'>,1:2]
[#3,3:3='2',<'2'>,1:3]
[#4,4:4='0',<'0'>,1:4]
[#5,5:5='2',<'2'>,1:5]
[#6,6:6='0',<'0'>,1:6]
[#7,9:8='<EOF>',<EOF>,2:0]
line 1:1 no viable alternative at input '05'
As far as I understand the language I am looking for is a regular one. And every input is unambiguous. So I tried to get the whole "logic" into the lexer and I succeeded with the following grammar.
grammar LogicalDateExpressions;
stmt : date EOF
;
date : DT
;
DT : (
((('0'? [1-9])|([12][0-9])|('3'[01])) DOT)? // Day
(('0'? [1-9])|('1' [012])) DOT // Month
)?
((DIGIT DIGIT DIGIT DIGIT)|(DIGIT DIGIT)) // Year
;
DIGIT : [0-9];
DOT : '.';
WS : [ \t\r\n\u000C]+ -> skip;
It parses every input I give it. But the problem is that every input is just a DT.
[#0,0:6='05.2020',<DT>,1:0]
[#1,9:8='<EOF>',<EOF>,2:0]
I can not distinguish between the day, the month and the year in a visitor/listener because labels are not allowed in lexer rules.
So my question is where is the problem with the first given grammar and what do I need to change to make it work?
From a look at the token output from grun I think I might grasp the problem every input for a day, month and/or year might be ambiguous but as a whole input in conjunction with the dots it should not be. How can I tell antlr that?
So my question is where is the problem with the first given grammar and what do I need to change to make it work?
The problem is that the lexer is not driven by the parser. What this means is that when the parser tries to match the tokens DAY DOT MONTH and the input is 01.01, the lexer will not create a DAY and a MONTH for these two 01's, but two MONTH tokens. This is how ANTLR's lexer works: try to grab as much characters for a token, and when there are 2 or more tokens that match the same characters (like 01 can be matched by both DAY and MONTH), let the token defined first "win" (which is the MONTH token). There is no way around this.
What you could do is something like this (untested):
date
: year
| month DOT year
| day DOT month DOT year
;
day
: N_01_12
| N_13_31
;
month
: N_01_12
;
year
: N_01_12
| N_13_31
| N_32_99
| N_1000_2999
;
N_01_12
: '0'? D // 01-09
| '1' [0-2] // 10-12
;
N_13_31
: '1' [3-9] // 13-19
| '2' D // 20-29
| '3' [01] // 30-31
;
N_32_99
: '3' [2-9] // 32-39
| [4-9] D // 40-99
;
N_1000_2999
: [12] D D D // 1000-2999
;
fragment D : [0-9];
I am currently implementing the part of the Decaf (programming language) grammar. Here is the relevant snippet of bison code:
type:
INT
| ID
| type LS RS
;
local_var_decl:
type ID SEMICOLON
;
name:
THIS
| ID
| name DOT ID
| name LS expression RS
;
Nevertheless, as soon as I started working on name production rule, my parser gives the reduce-reduce warning.
Here what it's inside the .output file (generated by bison):
State 84
23 type: ID .
61 name: ID .
ID reduce using rule 23 (type)
LS reduce using rule 23 (type)
LS [reduce using rule 61 (name)]
$default reduce using rule 61 (name)
So, if we give the following input { abc[1] = abc; }, it says that syntax error, unexpected NUMBER, expected RS. NUMBER comes here from expression rule (basically, how it must have parsed it), though it tries to parse it through local_var_decl rule.
What do you think should be changed in order to solve this problem? Spent around 2 hours, tried different stuff, did not work.
Thank you!!
PS. Here is the link to the full .y source code
This is a specific instance of a common problem where the parser is being forced to make a decision before it has enough information. In some cases, such as this one, the information needed is not far away, and it would be sufficient to increase the lookahead, if that were possible. (Unfortunately, few parser generators produce LR(k) parsers with k > 1, and bison is no exception.) The usual solution is to simply allow the parse to continue without having to decide. Another solution, with bison (but only in C mode) is to ask for a %glr-parser, which is much more flexible about when reductions need to be resolved at the cost of additional processing time.
In this case, the context allows either a type or a name, both of which can start with an ID followed by a [ (LS). In the case of a name, the [ must be followed by a number; in the case of a type, the [ must be followed by a ]. So if we could see the second token after the ID, we could immediately decide.
But we can only see one token ahead, which is the ]. And the grammar insists that we be able to make an immediate decision because in one case we must reduce the ID to a name and in the other case, to a type. So we have a reduce-reduce conflict, which bison resolves by always using whichever reduction comes first in the grammar file.
One solution is to avoid forcing this choice, at the cost of duplicating productions. For example:
type_other:
INT
| ID LS RS
| type_other LS RS
;
type: ID
| type_other
;
name_other:
THIS
| ID LS expression RS
| name_other DOT ID
| name_other LS expression RS
;
name: ID
| name_other
;
I found in some code I maintain they used this format for an update query
UPDATE X=to_date('$var','%iY-%m-%d %H:%M:%S.%F3') ...
But I can't find anywhere in Informix documentation what the i is for. Running this next query will result the same values.
SELECT TO_CHAR(CURRENT, '%Y-%m-%d %H:%M:%S%F3') as wo_I,
TO_CHAR(CURRENT, '%iY-%m-%d %H:%M:%S%F3') as with_I FROM X;
wo_i | with_i
------------------------|------------------------
2017-06-20 16:49:44.712 | 2017-06-20 16:49:44.712
So what am I missing?
Resources I looked into:
https://www.ibm.com/support/knowledgecenter/SSGU8G_11.70.0/com.ibm.sqlt.doc/ids_sqt_130.htm
https://www.ibm.com/support/knowledgecenter/SSGU8G_11.70.0/com.ibm.sqlt.doc/ids_sqt_129.htm
http://www.sqlines.com/informix-to-oracle/to_char_datetime
It's a trifle hard to find, but one location for the information you need (assuming you use Informix 11.70 rather than 12.10, though it probably hasn't changed much) is:
Client APIs and Tools — GLS User's Guide — GLS Environment Variables
In particular, it says:
%iy — Is replaced by the year as a two-digit number (00 - 99) for both reading and printing. It is the formatting directive specific to IBM Informix for %y.
%iY — Is replaced by the year as a four-digit number (0000 - 9999) for both reading and printing. It is the formatting directive specific to IBM Informix for %Y.
…
%y — Requires that the year is a two-digit number (00 through 99) for both reading and printing.
%Y — Requires that the year is a four-digit number (0000 through 9999) for both reading and printing.
There clearly isn't much difference between the two — I'm not even sure I understand what the difference is supposed to be. I think it may be the difference between accepting but not requiring leading zeros on 1, 2 or 3 digit year numbers. But for the most part, it seems you can treat them as equivalent.
I'm having trouble with semantic predicates in ANTLR 4. My grammar is syntactically ambiguous, and needs to look ahead one token to resolve the ambiguity.
As an example, I want to parse "Jan 19, 2012 until 9 pm" as the date "Jan 19, 2012" leaving parser's next token at "until". And I want to parse "Jan 19, 7 until 9 pm" as the date "Jan. 19" with parser's next token at "7".
So I need to look at the 3rd token and either take it or leave it.
My grammar fragment is:
date
: month d=INTEGER { isYear(getCurrentToken().getText())}? y=INTEGER
{//handle date, use $y for year}
| month d=INTEGER {//handle date, use 2013 for year}
;
When the parser runs on either sample input, I get this message:
line 1:9 rule date failed predicate: { isYear(getCurrentToken().getText())}?
It never gets to the 2nd rule alternative, because (I'm guessing) it's already read one extra token.
Can someone show me how to accomplish this?
In parser rules, ANTLR 4 only uses predicates on the left edge when making a decision. Inline predicates like the one you showed above are only validated.
The following modification will cause ANTLR to evaluate the predicate while it makes the decision, but obviously you'll need to modify it to use the correct lookahead token instead of calling getCurrentToken().
date
: {isYear(getCurrentToken().getText())}? month d=INTEGER y=INTEGER
{//handle date, use $y for year}
| month d=INTEGER {//handle date, use 2013 for year}
;
PS: If month is always exactly one token long, then _input.LT(3) should provide the token you want.
I have to create a Lexer which will accept for example an integer only if it has a maximum of 8 digits. Is here an alternative to do it rather than just writing it like this?
INTEGER : (DIG | DIG DIG | DIG DIG DIG | ...)
This can be done using a Gated Semantic Predicates like this:
INTEGER
#init{int n = 1;}
: ({n <= 8}?=> DIGIT {n++;})+
;
fragment DIGIT : '0'..'9';
Details about this kind of predicate, see: What is a 'semantic predicate' in ANTLR?