'Token collision' in Boolean Query Parser - parsing

I'm creating a simple boolean query parser. I would like to do something like this below.
grammar BooleanQuery;
options
{
language = Java;
output = AST;
}
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
QUOTE : ( '"' );
AND : ( 'AND' | '&' | 'EN' | '+' ) ;
OR : ( 'OR' | '|' | 'OF' );
WS : ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
MINUS : '-';
PLUS : '+';
expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : part (OR^ part)*;
phrase : QUOTE ( options {greedy=false;} : . )* QUOTE;
requiredexpr : PLUS atom;
excludedexpr : MINUS atom;
part : excludedexpr | requiredexpr | atom;
atom : phrase | WORD | LPAREN! expr RPAREN!;
The problem is that the MINUS and PLUS tokens 'collide' with the MINUS and PLUS signs in the AND and OR tokens. Sorry if I don't use the correct terminology. I'm a ANTLR newbie.
Below an example query:
foo OR (pow AND -"bar with cream" AND -bar)
What mistakes did I make?

A token must be unique. You can, however, use the same token for several purposes in you syntax (like the unary and binary minus in Java).
I do not know the exact syntax of your environment, but something like changing the following two clauses
AND : ( 'AND' | '&' | 'EN' ) ;
and
andexpr : orexpr ((AND^ | PLUS^) orexpr)*;
would probably solve this issue.

Related

Parse Rules Decaf grammar antlr4

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. I'm trying to parse a test file and keep getting an error, there must be something wrong in the grammar but i cant figure it out.
My test file looks like:
class Program {
int i[10];
}
The error is : line 2:8 mismatched input '10' expecting INT_LITERAL
And here is the full Decaf.g4 grammar file
grammar Decaf;
/*
LEXER RULES
-----------
Lexer rules define the basic syntax of individual words and symbols of a
valid Decaf program. Lexer rules follow regular expression syntax.
Complete the lexer rules following the Decaf Language Specification.
*/
CLASS : 'class';
INT : 'int';
RETURN : 'return';
VOID : 'void';
IF : 'if';
ELSE : 'else';
FOR : 'for';
BREAK : 'break';
CONTINUE : 'continue';
CALLOUT : 'callout';
TRUE : 'True' ;
FALSE : 'False' ;
BOOLEAN : 'boolean';
LCURLY : '{';
RCURLY : '}';
LBRACE : '(';
RBRACE : ')';
LSQUARE : '[';
RSQUARE : ']';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
EQ : '=';
SEMI : ';';
COMMA : ',';
AND : '&&';
LESS : '<';
GREATER : '>';
LESSEQUAL : '<=' ;
GREATEREQUAL : '>=' ;
EQUALTO : '==' ;
NOTEQUAL : '!=' ;
EXCLAMATION : '!';
fragment CHAR : (' '..'!') | ('#'..'&') | ('('..'[') | (']'..'~') | ('\\'[']) | ('\\"') | ('\\') | ('\t') | ('\n');
CHAR_LITERAL : '\'' CHAR '\'';
//STRING_LITERAL : '"' CHAR+ '"' ;
HEXMARK : '0x';
fragment HEXA : [a-fA-F];
fragment HEXDIGIT : DIGIT | HEXA ;
HEX_LITERAL : HEXMARK HEXDIGIT+;
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\';
fragment DIGIT : [0-9];
DECIMAL_LITERAL : DIGIT(DIGIT)*;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
fragment ALPHA : [a-zA-Z] | '_';
fragment ALPHA_NUM : ALPHA | DIGIT;
ID : ALPHA ALPHA_NUM*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
BOOL_LITERAL : TRUE | FALSE;
/*
PARSER RULES
------------
Parser rules are all lower case, and make use of lexer rules defined above
and other parser rules defined below. Parser rules also follow regular
expression syntax. Complete the parser rules following the Decaf Language
Specification.
*/
program : CLASS ID LCURLY field_decl* method_decl* RCURLY EOF;
field_name : ID | ID LSQUARE INT_LITERAL RSQUARE;
field_decl : datatype field_name (COMMA field_name)* SEMI;
method_decl : (datatype | VOID) ID LBRACE ((datatype ID) (COMMA datatype ID)*)? RBRACE block;
block : LCURLY var_decl* statement* RCURLY;
var_decl : datatype ID (COMMA ID)* SEMI;
datatype : INT | BOOLEAN;
statement : location assign_op expr SEMI
| method_call SEMI
| IF LBRACE expr RBRACE block (ELSE block)?
| FOR ID EQ expr COMMA expr block
| RETURN (expr)? SEMI
| BREAK SEMI
| CONTINUE SEMI
| block;
assign_op : EQ
| ADD EQ
| SUB EQ;
method_call : method_name LBRACE (expr (COMMA expr)*)? RBRACE
| CALLOUT LBRACE STRING(COMMA callout_arg (COMMA callout_arg)*) RBRACE;
method_name : ID;
location : ID | ID LSQUARE expr RSQUARE;
expr : location
| method_call
| literal
| expr bin_op expr
| SUB expr
| EXCLAMATION expr
| LBRACE expr RBRACE;
callout_arg : expr
| STRING ;
bin_op : arith_op
| rel_op
| eq_op
| cond_op;
arith_op : ADD | SUB | MUL | DIV | '%' ;
rel_op : LESS | GREATER | LESSEQUAL | GREATEREQUAL ;
eq_op : EQUALTO | NOTEQUAL ;
cond_op : AND | '||' ;
literal : INT_LITERAL | CHAR_LITERAL | BOOL_LITERAL ;
Whenever there are 2 or more lexer rules that match the same characters, the one defined first wins. In your case, these 2 rules both match 10:
DECIMAL_LITERAL : DIGIT(DIGIT)*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
and since INT_LITERAL is defined after DECIMAL_LITERAL, the lexer will never create a INT_LITERAL token. If you now try to use it in a parser rule, you get an error message you posted.
The solution: remove INT_LITERAL from your lexer and create a parser rule instead:
int_literal : DECIMAL_LITERAL | HEX_LITERAL;
and use int_literal in your parser rules instead.

Antlr3 - Non Greedy Double Quoted String with Escaped Double Quote

The following Antlr3 Grammar file doesn't cater for escaped double quotes as part of the STRING lexer rule. Any ideas why?
Expressions working:
\"hello\"
ref(\"hello\",\"hello\")
Expressions NOT working:
\"h\"e\"l\"l\"o\"
ref(\"hello\", \"hel\"lo\")
Antlr3 grammar file runnable in AntlrWorks:
grammar Grammar;
options
{
output=AST;
ASTLabelType=CommonTree;
language=CSharp3;
}
public oaExpression
: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
STRING
: '"' ( options { greedy=false; } : ~'"' )* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
Try this:
STRING
: '"' // a opening quote
( // start group
'\\' ~('\r' | '\n') // an escaped char other than a line break char
| // OR
~('\\' | '"'| '\r' | '\n') // any char other than '"', '\' and line breaks
)* // end group and repeat zero or more times
'"' // the closing quote
;
When I test the 4 different test cases from your comment:
"\"hello\""
"ref(\"hello\",\"hello\")"
"\"h\"e\"l\"l\"o\""
"ref(\"hello\", \"hel\"lo\")"
with the lexer rule I suggested:
grammar T;
parse
: string+ EOF
;
string
: STRING
;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
ANTLRWorks' debugger produces the following parse tree:
In other words: it works just fine (on my machine :)).
EDIT II
And I've also used your grammar (making some small changes to make it Java compatible) where I replaced the incorrect STRING rule into the one I suggested:
oaExpression
: STRING+ EOF!
//: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
//STRING
// : '"' ( options { greedy=false; } : ~'"' )* '"'
// ;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;} /*{$channel=Hidden;}*/
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
which parses the input from my previous example in an identical parse tree.
This is how I do this with strings that can contain escape sequences (not just \" but any):
DOUBLE_QUOTED_TEXT
#init { int escape_count = 0; }:
DOUBLE_QUOTE
(
DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }
| ESCAPE_OPERATOR . { escape_count++; }
| ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)
)*
DOUBLE_QUOTE
{ EMIT(); LTOKEN->user1 = escape_count; }
;
The rule additionally counts the escapes and stores them in the token. This allows the receiver to quickly see if it needs to do anything with the string (if user1 > 0). If you don't need that remove the #init part and the actions.

Antlr wrong rule invocation

I'm trying to implement a grammar for parsing lucene queries. So far everything went smooth until i tried to add support for range queries . Lucene details aside my grammar looks like this :
grammar ModifiedParser;
TERM_RANGE : '[' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) ']'
| '{' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) '}'
;
query : not (booleanOperator? not)* ;
booleanOperator : andClause
| orClause
;
andClause : 'AND' ;
notClause : 'NOT' ;
orClause : 'OR' ;
not : notClause? MODIFIER? clause;
clause : unqualified
| qualified
;
unqualified : TERM_RANGE # termRange
| TERM_PHRASE # termPhrase
| TERM_PHRASE_ANYTHING # termTruncatedPhrase
| '(' query ')' # queryUnqualified
| TERM_TEXT_TRUNCATED # termTruncatedText
| TERM_NORMAL # termText
;
qualified : TERM_NORMAL ':' unqualified
;
fragment TERM_CHAR : (~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '?' | '*' | '\\' ))
;
fragment TERM_START_CHAR : TERM_CHAR
| ESCAPE
;
fragment ESCAPE : '\\' ~[];
MODIFIER : '-'
| '+'
;
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
TERM_PHRASE_ANYTHING : '"' (ESCAPE|~('\"'|'\\'))+ '"' ;
TERM_PHRASE : '"' (ESCAPE|~('\"'|'\\'|'?'|'*'))+ '"' ;
TERM_TEXT_TRUNCATED : ('*'|'?')(TERM_CHAR+ ('*'|'?'))+ TERM_CHAR*
| TERM_START_CHAR (TERM_CHAR* ('?'|'*'))+ TERM_CHAR+
| ('?'|'*') TERM_CHAR+
;
TERM_NORMAL : TERM_TEXT;
fragment TERM_TEXT : TERM_START_CHAR TERM_CHAR* ;
WS : [ \t\r\n] -> skip ;
When i try to do a visitor and work with the tokens apparently parsing asd [ 10 TO 100 ] { 1 TO 1000 } 100..1000 will throw token recognition error for [ , ] , } and {, and only tries to visit the termRange rule on the third range . do you guys know what i'm missing here ? Thanks in advance
Since you made TERM_RANGE a lexer rule, you must account for everything at a character level. In particular, you forgot to allow whitespace characters in your input.
You would likely be in a much better position if you instead created termRange, a parser rule.

ANTLR ambiguity '-'

I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...

Parsing string interpolation in ANTLR

I'm working on a simple string manipulation DSL for internal purposes, and I would like the language to support string interpolation as it is used in Ruby.
For example:
name = "Bob"
msg = "Hello ${name}!"
print(msg) # prints "Hello Bob!"
I'm attempting to implement my parser in ANTLRv3, but I'm pretty inexperienced with using ANTLR so I'm unsure how to implement this feature. So far, I've specified my string literals in the lexer, but in this case I'll obviously need to handle the interpolation content in the parser.
My current string literal grammar looks like this:
STRINGLITERAL : '"' ( StringEscapeSeq | ~( '\\' | '"' | '\r' | '\n' ) )* '"' ;
fragment StringEscapeSeq : '\\' ( 't' | 'n' | 'r' | '"' | '\\' | '$' | ('0'..'9')) ;
Moving the string literal handling into the parser seems to make everything else stop working as it should. Cursory web searches didn't yield any information. Any suggestions as to how to get started on this?
I'm no ANTLR expert, but here's a possible grammar:
grammar Str;
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '(' (Identifier | stringLiteral) ')'
;
assignment
: Identifier (Space)* '=' (Space)* stringLiteral
;
stringLiteral
: '"' (Identifier | EscapeSequence | NormalChar | Space | Interpolation)* '"'
;
Interpolation
: '${' Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
As you notice, there are a couple of (Space)*-es inside the example grammar. This is because the stringLiteral is a parser-rule instead of a lexer-rule. Therefor, when tokenizing the source file, the lexer cannot know if a white space is part of a string literal, or is just a space inside the source file that can be ignored.
I tested the example with a little Java class and all worked as expected:
/* the same grammar, but now with a bit of Java code in it */
grammar Str;
#parser::header {
package antlrdemo;
import java.util.HashMap;
}
#lexer::header {
package antlrdemo;
}
#parser::members {
HashMap<String, String> vars = new HashMap<String, String>();
}
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '('
( id=Identifier {System.out.println("> "+vars.get($id.text));}
| st=stringLiteral {System.out.println("> "+$st.value);}
)
')'
;
assignment
: id=Identifier (Space)* '=' (Space)* st=stringLiteral {vars.put($id.text, $st.value);}
;
stringLiteral returns [String value]
: '"'
{StringBuilder b = new StringBuilder();}
( id=Identifier {b.append($id.text);}
| es=EscapeSequence {b.append($es.text);}
| ch=(NormalChar | Space) {b.append($ch.text);}
| in=Interpolation {b.append(vars.get($in.text.substring(2, $in.text.length()-1)));}
)*
'"'
{$value = b.toString();}
;
Interpolation
: '${' i=Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
And a class with a main method to test it all:
package antlrdemo;
import org.antlr.runtime.*;
public class ANTLRDemo {
public static void main(String[] args) throws RecognitionException {
String source = "name = \"Bob\"; \n"+
"msg = \"Hello ${name}\"; \n"+
"print(msg); \n"+
"print(\"Bye \\${for} now!\"); ";
ANTLRStringStream in = new ANTLRStringStream(source);
StrLexer lexer = new StrLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
StrParser parser = new StrParser(tokens);
parser.parse();
}
}
which produces the following output:
> Hello Bob
> Bye \${for} now!
Again, I am no expert, but this (at least) gives you a way to solve it.
HTH.

Resources