I wrote the following combined grammar:
grammar KeywordGrammar;
options{
TokenLabelType = MyToken;
}
//start rule
start: sequence+ EOF;
sequence: keyword filter?;
filter: simpleFilter | logicalFilter | rangeFilter;
logicalFilter: andFilter | orFilter | notFilter;
simpleFilter: lessFilter | greatFilter | equalFilter | containsFilter;
andFilter: simpleFilter AND? simpleFilter;
orFilter: simpleFilter OR simpleFilter;
lessFilter: LESS (DIGIT | FLOAT|DATE);
notFilter: NOT IN? (STRING|ID);
greatFilter: GREATER (DIGIT|FLOAT|DATE);
equalFilter: EQUAL (DIGIT|FLOAT|DATE);
containsFilter: EQUAL (STRING|ID);
rangeFilter: RANGE? DATE DATE? | RANGE? FLOAT FLOAT?;
keyword: ID | STRING;
DATE: DIGIT DIGIT? SEPARATOR MONTH SEPARATOR DIGIT DIGIT (DIGIT DIGIT)?;
MONTH: JAN
| FEV
| MAR
| APR
| MAY
| JUN
| JUL
| AUG
| SEP
| OCT
| NOV
| DEC
;
JAN : 'janeiro'|'jan'|'01'|'1';
FEV : 'fevereiro'|'fev'|'02'|'2';
MAR : 'março'|'mar'|'03'|'3';
APR : 'abril' |'abril'|'04'|'4';
MAY : 'maio'| 'mai'| '05'|'5';
JUN : 'junho'|'jun'|'06'|'6';
JUL : 'julho'|'jul'|'07'|'7';
AUG : 'agosto'|'ago'|'08'|'8';
SEP : 'setembro'|'set'|'09'|'9';
OCT : 'outubro'|'out'|'10';
NOV : 'novembro'|'nov'|'11';
DEC : 'dezembro'|'dez'|'12';
SEPARATOR: '/'|'-';
AND: ('e'|'E');
OR: ('O'|'o')('U'|'u');
NOT: ('N'|'n')('Ã'|'ã')('O'|'o');
IN: ('E'|'e')('M'|'m');
GREATER: '>' | ('m'|'M')('a'|'A')('i'|'I')('o'|'O')('r'|'R') ;
LESS: '<' | ('m'|'M')('e'|'E')('n'|'N')('o'|'O')('r'|'R');
EQUAL: '=' | ('i'|'I')('g'|'G')('u'|'U')('a'|'A')('l'|'L');
RANGE: ('e'|'E')('n'|'N')('t'|'T')('r'|'R')('e'|'E');
FLOAT: DIGIT+ | DIGIT+ POINT DIGIT+;
ID: (LETTER|DIGIT+ SYMBOL) (LETTER|SYMBOL|DIGIT)*;
STRING: '"' ( ESC_SEQ | ~('\\'|'"') )* '"';
DIGIT: [0-9];
WS: (' '
| '\t'
| '\r'
| '\n') -> skip
;
POINT: '.' | ',';
fragment
LETTER: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment
SYMBOL: '-' | '_';
fragment
HEX_DIGIT: ('0'..'9'|'a'..'f'|'A'..'F');
fragment
ESC_SEQ: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
But a no viable alternative at input error occurs only trying parse a following type of sentences: keyword OPERATOR DIGIT; for example:
filter = 2
filter < 2
filter > 2
Zero as a value, it works!!!
Where is the error?
Thanks by your help,
Yenier
You have a lot of ambiguity in your lexer rules. What messes it up specifically in your case is digits 1-9 can be matched to both DIGIT and MONTH, JAN, etc. Digit 0 is immune to this problem. Use grun with -tokens to diagnose problems of the sort you encountered:
$ grun KeywordGrammar start -tokens
filter = 0
[#0,0:5='filter',<24>,1:0]
[#1,7:7='=',<21>,1:7]
[#2,9:9='0',<23>,1:9]
[#3,11:10='<EOF>',<-1>,2:0]
$ grun KeywordGrammar start -tokens
filter = 2
[#0,0:5='filter',<24>,1:0]
[#1,7:7='=',<21>,1:7]
[#2,9:9='2',<1>,1:9]
[#3,11:10='<EOF>',<-1>,2:0]
line 1:9 no viable alternative at input '=2'
As you can see, 0 in the first case hase token type <23>, in the second case 2 is token type <1>. Look at your generated KeywordGrammar.tokens:
MONTH=1
JAN=2
...
FLOAT=23
...
So it is not a DIGIT or FLOAT - it is MONTH. As a result, your filter rule does not match. And yes, the order of rules matter, since in case of ambiguity ANTLR picks the first rule.
Remove the ambiguity from the lexer. Make months and similar tokens into grammar rules. And you have plenty of other places, like your FLOAT makes DIGIT impossible to appear standalone, still you refer to DIGIT along with the FLOAT in the rules. If DIGIT has no significance at the grammar level, make it a fragment and use only FLOAT in parser rules.
And make it a habit to use grun and/or ANTLR plugins for IDE to make sure you know what your lexers and parsers actually see.
testing here I saw that the problem disappears placing the FLOAT definition token before DATE definition.
...
FLOAT: DIGIT+ (POINT DIGIT+)?;
DATE: DIGIT DIGIT? SEPARATOR MONTH SEPARATOR DIGIT DIGIT (DIGIT DIGIT)?;
...
I do not know why. Does the order matter?
Related
I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.
thanks for taking a look at my question.
So I have the grammar and lexer rules that I use to parse user input on a grocery list.
The grammar matches sentences such as '10 Pound beef' which has the tokens 'amount unit ware'. The ware token matches any valid unicode string but I cannot enter strings matched by the unit token as they are caught by the unit rule. So my question is, can i instruct my lexer to ignore the unit rule after the first match such that I can input '10 Pound Pound' with the tokens 'amount unit ware' without errors?
Grammar:
grammar Shopping;
import lexerrules;
parse : item EOF ;
item : (amount (SPACE* unit)? SPACE+)? ware | (unit (SPACE* amount)? SPACE+)? ware ;
ware : STRING (SPACE+ STRING)* ;
amount : NUM ;
unit : UNIT ;
Lexer rules:
lexer grammar lexerrules;
NUM : [0-9]+(('.'|',')[0-9]+)? ;
UNIT : WEIGHT | LENGTH | VOLUME | MISC ;
STRING : CHAR+
SPACE : ' ' ;
WS : [\u000C\f\t\r\n]+ -> skip ;
CHAR : '\u0041' .. '\uFFFF' ;
WEIGHT : [Kk]'g' | [Kk]'ilo' | [Kk]'ilogram' | [Gg] | [Gg]'ram' |
[Dd]'ecigram' | [Oo]'unce' | [Oo]'z' | [Pp]'ound' | [Ll]'b' ;
LENGTH : [Mm] | [Mm]'eter' | [Cc]'m' | [Cc]'entimeter' |
[Ii]'nch' | [Ii][Nn] ;
VOLUME : [Ll] | [Ll]'iter' | [Dd]'l' | [Dd]'eciliter' | [Cc][Ll] |
[Cc]'entiliter' ;
I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.
I am trying to make a grammar for SMT formulae and this is what I have so far
grammar Z3input;
startRule : formulaList? EOF;
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
DIGIT : [0-9];
INTEGER : '0' | [1-9] DIGIT*;
FLOAT : DIGIT+ '.' DIGIT+;
NUMERICAL_LITERAL : FLOAT | INTEGER;
BOOLEAN_LITERAL : 'True' | 'False';
LITERAL : MINUS? NUMERICAL_LITERAL | BOOLEAN_LITERAL;
COMPARISON_OPERATOR : '>' | '<' | '>=' | '<=' | '!=' | '==';
WHITESPACE: [ \t\n\r]+ -> skip;
IDENTIFIER : [a-uw-zB-DF-Z]+ ([a-zA-Z0-9]? [a-uw-zB-DF-Z])*; // omits 'v', 'A', 'E' and cannot end in those characters
IMPLIES : '->' | '-->' | 'implies';
AND : '&' | 'and' | '^';
OR : 'or' | 'v' | '|';
NOT : '~' | '!' | 'not';
QUANTIFIER : 'A' | 'E' | 'forall' | 'exists';
formulaList : formula ( SEMICOLON formula )*;
argumentList : expression ( COMMA expression )*;
formula : formulaConjunction
| LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN
| formula IMPLIES LEFT_PAREN formulaConjunction RIGHT_PAREN;
formulaConjunction : formulaNegation | formulaConjunction AND formulaNegation;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| QUANTIFIER '.' LEFT_PAREN formulaAtom RIGHT_PAREN
| compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor | term TIMES factor | term DIVIDE factor;
factor : primary | MINUS factor;
primary : LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| LEFT_PAREN expression RIGHT_PAREN;
SMT formulae are formulae of first-order logic with function symbols (identifiers which can be called with however many arguments), variables, comparison of either boolean literals (I.e. 'True' or 'False') or numeric literals or function calls or variables, arithmetic with operators '+', '*', '-', and '/'. Essentially these formulae are first-order logic over some signature and for my purposes I've chosen for this signature to be the theory of rationals.
I can get a proper interpretation of something like 'True ^ True' but anything more complicated, including even 'True | True', seems to always result in something along the lines of
... mismatched input '|' expecting {<EOF>, ';', IMPLIES, AND}
so I would like some help with correcting the grammar. And for the record I would prefer to keep the grammar run-time independent.
Your formula rule seems to be causing the issue here: LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN.
That's saying that only formulas of the form (FORMULA)|(CONJUNCTIVE) will be accepted by the language.
Instead, specify precedence rules for each operator, and use a nonterminal for each level of precedence. For example, your grammar might look something like the following:
formula : (QUANTIFIER IDENTIFIER '.')? formulaImplication;
formulaImplication : formulaConjunction (IMPLIES formula)?;
formulaConjunction : formulaDisjunction (AND formulaConjunction)?;
formulaDisjunction : formulaNegation (OR formulaDisjunction)?;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN formula RIGHT_PAREN | compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor ((TIMES | DIVIDE) term)?;
factor : primary | MINUS factor;
primary : LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN expression RIGHT_PAREN;
I'm trying to build a parser with bison and have narrowed all my errors down to one difficult one.
Here's the debug output of bison with the state where the error lies:
state 120
12 statement_list: statement_list . SEMICOLON statement
24 if_statement: IF conditional THEN statement_lists ELSE statement_list .
SEMICOLON shift, and go to state 50
SEMICOLON [reduce using rule 24 (if_statement)]
$default reduce using rule 24 (if_statement)
Here are the translation rules in the parser.y source
%%
program : ID COLON block ENDP ID POINT
;
block : CODE statement_list
| DECLARATIONS declaration_block CODE statement_list
;
declaration_block : id_list OF TYPE type SEMICOLON
| declaration_block id_list OF TYPE type SEMICOLON
;
id_list : ID
| ID COMMA id_list
;
type : CHARACTER
| INTEGER
| REAL
;
statement_list : statement
| statement_list SEMICOLON statement
;
statement_lists : statement
| statement_list SEMICOLON statement
;
statement : assignment_statement
| if_statement
| do_statement
| while_statement
| for_statement
| write_statement
| read_statement
;
assignment_statement : expression OUTPUTTO ID
;
if_statement : IF conditional THEN statement_lists ENDIF
| IF conditional THEN statement_lists ELSE statement_list
;
do_statement : DO statement_list WHILE conditional ENDDO
;
while_statement : WHILE conditional DO statement_list ENDWHILE
;
for_statement : FOR ID IS expression BY expressions TO expression DO statement_list ENDFOR
;
write_statement : WRITE BRA output_list KET
| NEWLINE
;
read_statement : READ BRA ID KET
;
output_list : value
| value COMMA output_list
;
condition : expression comparator expression
;
conditional : condition
| NOT conditional
| condition AND conditional
| condition OR conditional
;
comparator : ASSIGNMENT
| BETWEEN
| LT
| GT
| LESSEQUAL
| GREATEREQUAL
;
expression : term
| term PLUS expression
| term MINUS expression
;
expressions : term
| term PLUS expressions
| term MINUS expressions
;
term : value
| value MULTIPLY term
| value DIVIDE term
;
value : ID
| constant
| BRA expression KET
;
constant : number_constant
| CHARCONST
;
number_constant : NUMBER
| MINUS NUMBER
| NUMBER POINT NUMBER
| MINUS NUMBER POINT NUMBER
;
%%
When I remove the if_statement rule there are no errors, so I've narrowed it down considerably, but still can't solve the error.
Thanks for any help.
Consider this statement: if condition then s2 else s3; s4
There are two interpretations:
if condition then
s1;
else
s2;
s3;
The other one is:
if condition then
s1;
else
s2;
s3;
In the first one, the statment list is composed of an if statement and s3. While the other statement is composed of only one if statement. That's where the ambiguity comes from. Bison will prefer shift to reduce when a shift-reduce conflict exist, so in the above case, the parser will choose to shift s3.
Since you have an ENDIF in your if-then statement, consider to introduce an ENDIF in your if-then-else statement, then the problem is solved.
I think you are missing ENDIF in the IF-THEN-ELSE-ENDIF rule.