How to parse decimal values correctly? - parsing

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

Related

antlr4 does't parse obvious tree

I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.

ANTLR grammar for SMT formulae

I am trying to make a grammar for SMT formulae and this is what I have so far
grammar Z3input;
startRule : formulaList? EOF;
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
DIGIT : [0-9];
INTEGER : '0' | [1-9] DIGIT*;
FLOAT : DIGIT+ '.' DIGIT+;
NUMERICAL_LITERAL : FLOAT | INTEGER;
BOOLEAN_LITERAL : 'True' | 'False';
LITERAL : MINUS? NUMERICAL_LITERAL | BOOLEAN_LITERAL;
COMPARISON_OPERATOR : '>' | '<' | '>=' | '<=' | '!=' | '==';
WHITESPACE: [ \t\n\r]+ -> skip;
IDENTIFIER : [a-uw-zB-DF-Z]+ ([a-zA-Z0-9]? [a-uw-zB-DF-Z])*; // omits 'v', 'A', 'E' and cannot end in those characters
IMPLIES : '->' | '-->' | 'implies';
AND : '&' | 'and' | '^';
OR : 'or' | 'v' | '|';
NOT : '~' | '!' | 'not';
QUANTIFIER : 'A' | 'E' | 'forall' | 'exists';
formulaList : formula ( SEMICOLON formula )*;
argumentList : expression ( COMMA expression )*;
formula : formulaConjunction
| LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN
| formula IMPLIES LEFT_PAREN formulaConjunction RIGHT_PAREN;
formulaConjunction : formulaNegation | formulaConjunction AND formulaNegation;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| QUANTIFIER '.' LEFT_PAREN formulaAtom RIGHT_PAREN
| compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor | term TIMES factor | term DIVIDE factor;
factor : primary | MINUS factor;
primary : LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| LEFT_PAREN expression RIGHT_PAREN;
SMT formulae are formulae of first-order logic with function symbols (identifiers which can be called with however many arguments), variables, comparison of either boolean literals (I.e. 'True' or 'False') or numeric literals or function calls or variables, arithmetic with operators '+', '*', '-', and '/'. Essentially these formulae are first-order logic over some signature and for my purposes I've chosen for this signature to be the theory of rationals.
I can get a proper interpretation of something like 'True ^ True' but anything more complicated, including even 'True | True', seems to always result in something along the lines of
... mismatched input '|' expecting {<EOF>, ';', IMPLIES, AND}
so I would like some help with correcting the grammar. And for the record I would prefer to keep the grammar run-time independent.
Your formula rule seems to be causing the issue here: LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN.
That's saying that only formulas of the form (FORMULA)|(CONJUNCTIVE) will be accepted by the language.
Instead, specify precedence rules for each operator, and use a nonterminal for each level of precedence. For example, your grammar might look something like the following:
formula : (QUANTIFIER IDENTIFIER '.')? formulaImplication;
formulaImplication : formulaConjunction (IMPLIES formula)?;
formulaConjunction : formulaDisjunction (AND formulaConjunction)?;
formulaDisjunction : formulaNegation (OR formulaDisjunction)?;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN formula RIGHT_PAREN | compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor ((TIMES | DIVIDE) term)?;
factor : primary | MINUS factor;
primary : LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN expression RIGHT_PAREN;

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

Antlr wrong rule invocation

I'm trying to implement a grammar for parsing lucene queries. So far everything went smooth until i tried to add support for range queries . Lucene details aside my grammar looks like this :
grammar ModifiedParser;
TERM_RANGE : '[' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) ']'
| '{' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) '}'
;
query : not (booleanOperator? not)* ;
booleanOperator : andClause
| orClause
;
andClause : 'AND' ;
notClause : 'NOT' ;
orClause : 'OR' ;
not : notClause? MODIFIER? clause;
clause : unqualified
| qualified
;
unqualified : TERM_RANGE # termRange
| TERM_PHRASE # termPhrase
| TERM_PHRASE_ANYTHING # termTruncatedPhrase
| '(' query ')' # queryUnqualified
| TERM_TEXT_TRUNCATED # termTruncatedText
| TERM_NORMAL # termText
;
qualified : TERM_NORMAL ':' unqualified
;
fragment TERM_CHAR : (~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '?' | '*' | '\\' ))
;
fragment TERM_START_CHAR : TERM_CHAR
| ESCAPE
;
fragment ESCAPE : '\\' ~[];
MODIFIER : '-'
| '+'
;
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
TERM_PHRASE_ANYTHING : '"' (ESCAPE|~('\"'|'\\'))+ '"' ;
TERM_PHRASE : '"' (ESCAPE|~('\"'|'\\'|'?'|'*'))+ '"' ;
TERM_TEXT_TRUNCATED : ('*'|'?')(TERM_CHAR+ ('*'|'?'))+ TERM_CHAR*
| TERM_START_CHAR (TERM_CHAR* ('?'|'*'))+ TERM_CHAR+
| ('?'|'*') TERM_CHAR+
;
TERM_NORMAL : TERM_TEXT;
fragment TERM_TEXT : TERM_START_CHAR TERM_CHAR* ;
WS : [ \t\r\n] -> skip ;
When i try to do a visitor and work with the tokens apparently parsing asd [ 10 TO 100 ] { 1 TO 1000 } 100..1000 will throw token recognition error for [ , ] , } and {, and only tries to visit the termRange rule on the third range . do you guys know what i'm missing here ? Thanks in advance
Since you made TERM_RANGE a lexer rule, you must account for everything at a character level. In particular, you forgot to allow whitespace characters in your input.
You would likely be in a much better position if you instead created termRange, a parser rule.

'Token collision' in Boolean Query Parser

I'm creating a simple boolean query parser. I would like to do something like this below.
grammar BooleanQuery;
options
{
language = Java;
output = AST;
}
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
QUOTE : ( '"' );
AND : ( 'AND' | '&' | 'EN' | '+' ) ;
OR : ( 'OR' | '|' | 'OF' );
WS : ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
MINUS : '-';
PLUS : '+';
expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : part (OR^ part)*;
phrase : QUOTE ( options {greedy=false;} : . )* QUOTE;
requiredexpr : PLUS atom;
excludedexpr : MINUS atom;
part : excludedexpr | requiredexpr | atom;
atom : phrase | WORD | LPAREN! expr RPAREN!;
The problem is that the MINUS and PLUS tokens 'collide' with the MINUS and PLUS signs in the AND and OR tokens. Sorry if I don't use the correct terminology. I'm a ANTLR newbie.
Below an example query:
foo OR (pow AND -"bar with cream" AND -bar)
What mistakes did I make?
A token must be unique. You can, however, use the same token for several purposes in you syntax (like the unary and binary minus in Java).
I do not know the exact syntax of your environment, but something like changing the following two clauses
AND : ( 'AND' | '&' | 'EN' ) ;
and
andexpr : orexpr ((AND^ | PLUS^) orexpr)*;
would probably solve this issue.

Resources