ANTLR grammar for SMT formulae - parsing

I am trying to make a grammar for SMT formulae and this is what I have so far
grammar Z3input;
startRule : formulaList? EOF;
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
DIGIT : [0-9];
INTEGER : '0' | [1-9] DIGIT*;
FLOAT : DIGIT+ '.' DIGIT+;
NUMERICAL_LITERAL : FLOAT | INTEGER;
BOOLEAN_LITERAL : 'True' | 'False';
LITERAL : MINUS? NUMERICAL_LITERAL | BOOLEAN_LITERAL;
COMPARISON_OPERATOR : '>' | '<' | '>=' | '<=' | '!=' | '==';
WHITESPACE: [ \t\n\r]+ -> skip;
IDENTIFIER : [a-uw-zB-DF-Z]+ ([a-zA-Z0-9]? [a-uw-zB-DF-Z])*; // omits 'v', 'A', 'E' and cannot end in those characters
IMPLIES : '->' | '-->' | 'implies';
AND : '&' | 'and' | '^';
OR : 'or' | 'v' | '|';
NOT : '~' | '!' | 'not';
QUANTIFIER : 'A' | 'E' | 'forall' | 'exists';
formulaList : formula ( SEMICOLON formula )*;
argumentList : expression ( COMMA expression )*;
formula : formulaConjunction
| LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN
| formula IMPLIES LEFT_PAREN formulaConjunction RIGHT_PAREN;
formulaConjunction : formulaNegation | formulaConjunction AND formulaNegation;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| QUANTIFIER '.' LEFT_PAREN formulaAtom RIGHT_PAREN
| compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor | term TIMES factor | term DIVIDE factor;
factor : primary | MINUS factor;
primary : LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| LEFT_PAREN expression RIGHT_PAREN;
SMT formulae are formulae of first-order logic with function symbols (identifiers which can be called with however many arguments), variables, comparison of either boolean literals (I.e. 'True' or 'False') or numeric literals or function calls or variables, arithmetic with operators '+', '*', '-', and '/'. Essentially these formulae are first-order logic over some signature and for my purposes I've chosen for this signature to be the theory of rationals.
I can get a proper interpretation of something like 'True ^ True' but anything more complicated, including even 'True | True', seems to always result in something along the lines of
... mismatched input '|' expecting {<EOF>, ';', IMPLIES, AND}
so I would like some help with correcting the grammar. And for the record I would prefer to keep the grammar run-time independent.

Your formula rule seems to be causing the issue here: LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN.
That's saying that only formulas of the form (FORMULA)|(CONJUNCTIVE) will be accepted by the language.
Instead, specify precedence rules for each operator, and use a nonterminal for each level of precedence. For example, your grammar might look something like the following:
formula : (QUANTIFIER IDENTIFIER '.')? formulaImplication;
formulaImplication : formulaConjunction (IMPLIES formula)?;
formulaConjunction : formulaDisjunction (AND formulaConjunction)?;
formulaDisjunction : formulaNegation (OR formulaDisjunction)?;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN formula RIGHT_PAREN | compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor ((TIMES | DIVIDE) term)?;
factor : primary | MINUS factor;
primary : LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN expression RIGHT_PAREN;

Related

How to parse decimal values correctly?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

Parse Rules Decaf grammar antlr4

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. I'm trying to parse a test file and keep getting an error, there must be something wrong in the grammar but i cant figure it out.
My test file looks like:
class Program {
int i[10];
}
The error is : line 2:8 mismatched input '10' expecting INT_LITERAL
And here is the full Decaf.g4 grammar file
grammar Decaf;
/*
LEXER RULES
-----------
Lexer rules define the basic syntax of individual words and symbols of a
valid Decaf program. Lexer rules follow regular expression syntax.
Complete the lexer rules following the Decaf Language Specification.
*/
CLASS : 'class';
INT : 'int';
RETURN : 'return';
VOID : 'void';
IF : 'if';
ELSE : 'else';
FOR : 'for';
BREAK : 'break';
CONTINUE : 'continue';
CALLOUT : 'callout';
TRUE : 'True' ;
FALSE : 'False' ;
BOOLEAN : 'boolean';
LCURLY : '{';
RCURLY : '}';
LBRACE : '(';
RBRACE : ')';
LSQUARE : '[';
RSQUARE : ']';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
EQ : '=';
SEMI : ';';
COMMA : ',';
AND : '&&';
LESS : '<';
GREATER : '>';
LESSEQUAL : '<=' ;
GREATEREQUAL : '>=' ;
EQUALTO : '==' ;
NOTEQUAL : '!=' ;
EXCLAMATION : '!';
fragment CHAR : (' '..'!') | ('#'..'&') | ('('..'[') | (']'..'~') | ('\\'[']) | ('\\"') | ('\\') | ('\t') | ('\n');
CHAR_LITERAL : '\'' CHAR '\'';
//STRING_LITERAL : '"' CHAR+ '"' ;
HEXMARK : '0x';
fragment HEXA : [a-fA-F];
fragment HEXDIGIT : DIGIT | HEXA ;
HEX_LITERAL : HEXMARK HEXDIGIT+;
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\';
fragment DIGIT : [0-9];
DECIMAL_LITERAL : DIGIT(DIGIT)*;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
fragment ALPHA : [a-zA-Z] | '_';
fragment ALPHA_NUM : ALPHA | DIGIT;
ID : ALPHA ALPHA_NUM*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
BOOL_LITERAL : TRUE | FALSE;
/*
PARSER RULES
------------
Parser rules are all lower case, and make use of lexer rules defined above
and other parser rules defined below. Parser rules also follow regular
expression syntax. Complete the parser rules following the Decaf Language
Specification.
*/
program : CLASS ID LCURLY field_decl* method_decl* RCURLY EOF;
field_name : ID | ID LSQUARE INT_LITERAL RSQUARE;
field_decl : datatype field_name (COMMA field_name)* SEMI;
method_decl : (datatype | VOID) ID LBRACE ((datatype ID) (COMMA datatype ID)*)? RBRACE block;
block : LCURLY var_decl* statement* RCURLY;
var_decl : datatype ID (COMMA ID)* SEMI;
datatype : INT | BOOLEAN;
statement : location assign_op expr SEMI
| method_call SEMI
| IF LBRACE expr RBRACE block (ELSE block)?
| FOR ID EQ expr COMMA expr block
| RETURN (expr)? SEMI
| BREAK SEMI
| CONTINUE SEMI
| block;
assign_op : EQ
| ADD EQ
| SUB EQ;
method_call : method_name LBRACE (expr (COMMA expr)*)? RBRACE
| CALLOUT LBRACE STRING(COMMA callout_arg (COMMA callout_arg)*) RBRACE;
method_name : ID;
location : ID | ID LSQUARE expr RSQUARE;
expr : location
| method_call
| literal
| expr bin_op expr
| SUB expr
| EXCLAMATION expr
| LBRACE expr RBRACE;
callout_arg : expr
| STRING ;
bin_op : arith_op
| rel_op
| eq_op
| cond_op;
arith_op : ADD | SUB | MUL | DIV | '%' ;
rel_op : LESS | GREATER | LESSEQUAL | GREATEREQUAL ;
eq_op : EQUALTO | NOTEQUAL ;
cond_op : AND | '||' ;
literal : INT_LITERAL | CHAR_LITERAL | BOOL_LITERAL ;
Whenever there are 2 or more lexer rules that match the same characters, the one defined first wins. In your case, these 2 rules both match 10:
DECIMAL_LITERAL : DIGIT(DIGIT)*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
and since INT_LITERAL is defined after DECIMAL_LITERAL, the lexer will never create a INT_LITERAL token. If you now try to use it in a parser rule, you get an error message you posted.
The solution: remove INT_LITERAL from your lexer and create a parser rule instead:
int_literal : DECIMAL_LITERAL | HEX_LITERAL;
and use int_literal in your parser rules instead.

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

'Token collision' in Boolean Query Parser

I'm creating a simple boolean query parser. I would like to do something like this below.
grammar BooleanQuery;
options
{
language = Java;
output = AST;
}
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
QUOTE : ( '"' );
AND : ( 'AND' | '&' | 'EN' | '+' ) ;
OR : ( 'OR' | '|' | 'OF' );
WS : ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
MINUS : '-';
PLUS : '+';
expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : part (OR^ part)*;
phrase : QUOTE ( options {greedy=false;} : . )* QUOTE;
requiredexpr : PLUS atom;
excludedexpr : MINUS atom;
part : excludedexpr | requiredexpr | atom;
atom : phrase | WORD | LPAREN! expr RPAREN!;
The problem is that the MINUS and PLUS tokens 'collide' with the MINUS and PLUS signs in the AND and OR tokens. Sorry if I don't use the correct terminology. I'm a ANTLR newbie.
Below an example query:
foo OR (pow AND -"bar with cream" AND -bar)
What mistakes did I make?
A token must be unique. You can, however, use the same token for several purposes in you syntax (like the unary and binary minus in Java).
I do not know the exact syntax of your environment, but something like changing the following two clauses
AND : ( 'AND' | '&' | 'EN' ) ;
and
andexpr : orexpr ((AND^ | PLUS^) orexpr)*;
would probably solve this issue.

Resources