How to use antlr4 to analyze the grammar of .aidl files? - parsing

Actually, I was assigned to analyze the grammar of .aidl files and extract the grammar elements using listener methods.
In order to finish this, I have thought for long and finally I worked out a .g4 file.
grammar aidl3;
file : pack* imp* parcelable? interfa? ;
pack : 'package' WS+ PAC_NAME WS* ';' WS* ;
imp : 'import' WS+ IMP_NAME WS* ';' WS* ;
parcelable : 'parcelable' WS+ PARCE_NAME WS* ';' WS* ;
interfa : INTER_TAG? WS* 'interface' WS+ INTER_NAME WS* '{' WS* methods+ WS* '}' WS*;
methods : RETURN_TYPE WS+ METHOD_NAME WS* '(' WS* argmentsa* WS* argmentsb* WS* ')' WS* ';' WS* ;
argmentsa : TAG? WS* ARG_TYPE WS+ ARG_NAME WS* ',' WS* ;
argmentsb : TAG? WS* ARG_TYPE WS+ ARG_NAME WS* ;
PAC_NAME : ~[; \n\r]+ ;
//PAC_NAME : [_a-zA-Z] [_.a-zA-Z0-9]* ;
IMP_NAME : ~[ ;\n\r]+ ;
PARCE_NAME : ~[ ;\n\r.]+ ;
INTER_TAG : 'oneway';
INTER_NAME : ~[ ;\n\r{.]+ ;
RETURN_TYPE : ~[ ;\n\r.]+ ;
METHOD_NAME : ~[ ;\n\r(]+ ;
TAG : 'in' | 'out' | 'inout' ;
//ARG_TYPE : ~[) ,\n\r]+ ;
ARG_TYPE : [a-zA-Z] ~' '* | [a-zA-Z] ~' '* ' ' '[' ']' ;
ARG_NAME : ~[ ,\n\r).]+ ;
WS: [ \t\n\r];
However, I've run into a weird problem: that is when I'm trying to analyze the .aidl files e.g.
package android.view.accessibility;
oneway interface IAccessibilityInteractionConnection {
void findAccessibilityNodeInfoByAccessibilityId(long accessibilityNodeId, in Region bounds,
int interactionId, IAccessibilityInteractionConnectionCallback callback, int flags,
int interrogatingPid, long interrogatingTid, in MagnificationSpec spec);
void findAccessibilityNodeInfosByViewId(long accessibilityNodeId, String viewId,
in Region bounds, int interactionId, IAccessibilityInteractionConnectionCallback callback,
int flags, int interrogatingPid, long interrogatingTid, in MagnificationSpec spec);
void findAccessibilityNodeInfosByText(long accessibilityNodeId, String text, in Region bounds,
int interactionId, IAccessibilityInteractionConnectionCallback callback, int flags,
int interrogatingPid, long interrogatingTid, in MagnificationSpec spec);
void findFocus(long accessibilityNodeId, int focusType, in Region bounds, int interactionId,
IAccessibilityInteractionConnectionCallback callback, int flags, int interrogatingPid,
long interrogatingTid, in MagnificationSpec spec);
void focusSearch(long accessibilityNodeId, int direction, in Region bounds, int interactionId,
IAccessibilityInteractionConnectionCallback callback, int flags, int interrogatingPid,
long interrogatingTid, in MagnificationSpec spec);
void performAccessibilityAction(long accessibilityNodeId, int action, in Bundle arguments,
int interactionId, IAccessibilityInteractionConnectionCallback callback, int flags,
int interrogatingPid, long interrogatingTid);
}
it would give this output:
[#0,0:6='package',<'package'>,1:0]
[#1,7:7=' ',<WS>,1:7]
[#2,8:41='android.view.accessibility;\noneway',<ARG_TYPE>,1:8]
[#3,42:42=' ',<WS>,2:6]
[#4,43:51='interface',<'interface'>,2:7]
[#5,52:52=' ',<WS>,2:16]
[#6,53:87='IAccessibilityInteractionConnection',<PAC_NAME>,2:17]
[#7,88:88=' ',<WS>,2:52]
[#8,89:89='{',<'{'>,2:53]
[#9,90:90='\n',<WS>,2:54]
[#10,91:91=' ',<WS>,3:0]
[#11,92:92=' ',<WS>,3:1]
[#12,93:93=' ',<WS>,3:2]
[#13,94:94=' ',<WS>,3:3]
[#14,95:98='void',<PAC_NAME>,3:4]
[#15,99:99=' ',<WS>,3:8]
[#16,100:146='findAccessibilityNodeInfoByAccessibilityId(long',<PAC_NAME>,3:9]
[#17,147:147=' ',<WS>,3:56]
[#18,148:167='accessibilityNodeId,',<PAC_NAME>,3:57]
...
You can see in output line 3 '[#2,8:41='android.view.accessibility;\noneway',ARG_TYPE,1:8]' , in which the expression 'pack' uses 'ARG_TYPE' to match 'android.view.accessibility;\noneway'. But.. How could it be? 'ARG_TYPE' never appears in the expression 'pack' and it should have used 'PAC_NAME' to match 'android.view.accessibility'
It would be nice if someone could help me figure this thing out, because I'm facing a close deadline.
In fact, I'm just a new learner and I know that my g4 file doesn't look good, so if possible, could you please tell me how to program the g4 for .aidl in a better way? Or even show me the write answer?
I would be really grateful if you can help me! Thanks!

ANTLR's lexer tries to create tokens with as much characters as possible. And since ARG_TYPE is able to match android.view.accessibility;\noneway (and no other rule can match more characters), an ARG_TYPE token is created. Only when 2 or more rules match the same characters, ANTLR will choose the one defined first.
You must understand that the lexer does not create tokens based on what the parser is trying to match. Tokenisation is a process that is done independently from the parsing phase. Therefor, most of your rules that look like ~[ ;\n\r(]+ are way too broad.
I suggest you take a look at an existing Java grammar, and use that in order to work with AIDL files.
EDIT
If I take the grammar file posted above, and change:
formalParameter
: variableModifier* unannType variableDeclaratorId
;
into:
formalParameter
: 'in'? variableModifier* unannType variableDeclaratorId
;
and change:
interfaceModifier
: annotation
| 'public'
| 'protected'
| 'private'
| 'abstract'
| 'static'
| 'strictfp'
;
into:
interfaceModifier
: annotation
| 'public'
| 'protected'
| 'private'
| 'abstract'
| 'static'
| 'strictfp'
| 'oneway'
;
then your example file parses correctly.

Related

How to manage semantic rule of declaration of variable in bison

I have to build a compiler that translates the java language into pyhton. I'm using the Flex and Bison tools. I created the flex file and I defined the syntactic grammar in Bison for some restrictions that I have to implement (such as array, management of cycles, management of a class, management of logical-arithmetic operators, etc.).
I'm having trouble understanding how to handle semantic rules. For example, I should handle the semantics for import statement and variable declaration, add the variable in the symbol table and then handle the translation.
This is the structure of the symbol table in the symboltable.h module:
struct symtable{
char *scopename; // key
struct symtable2 *subtable; // symble table secondary
UT_hash_handle hh; // to make the structure hash
}
struct symtable2 // secondary symbol structure
{
char *name; // Name of the symbol (key)
char *element; // it can be a variable or an array
char *type; // Indicates the type assumed by the token
(int, float, char, bool)
char *value; // value assigned to the variable
int dim; // Array size, it is zero in the case of a variable.
UT_hash_handle hh; // to make the structure hash
};
And this is the add symbol function:
void add_symbol( char *name, char *current_scopename, char *element, char *current_type, char *current_value, int dim, int nr) { //Function to add a new symbol in the symbol table
struct symtable *s;
HASH_FIND_PTR(symbols, current_scopename, s);
if (s == NULL) {
s = (struct symtable *)malloc(sizeof *s);
s->scopename =current_scopename;
s->subtable=NULL;
s->scopename =current_scopename;
HASH_ADD_KEYPTR(hh,symbols,s->scopename,strlen(s->scopename),s);
}
struct symtable2 *s2;
HASH_FIND_PTR(symbols2, name, s2);
if (s2==NULL) {
s2 = (struct symtable2 *)malloc(sizeof *s2);
s2->name = name;
s2->element = element;
s2->type = current_type;
s2->value = current_value;
s2->dim = dim;
HASH_ADD_KEYPTR(hh,s->subtable,s2->name,strlen(s2->name),s2);
} else {
if (strcmp( s2->type,current_type) == 0){
s2->value =current_value;
} else {
printf("\033[01;31mRiga %i. [FATALE] SEMANTIC ERROR: assignment violates the primitive type of the variable.\033[00m\n", nr);
printf("\n\n\033[01;31mParsing failed.\033[00m\n");
}
}
}
This is a part of the bison file with the grammar to handle import statement and the variable declaration:
%{
#include <stdio.h>;
#include <ctype.h>;
#include <symboltable.h>;
file *f_ptr;
%}
%start program
%token NUMBER
%token ID
%token INT
%token FLOAT
%token DOUBLE
%token CHAR
%token IMPORT
%right ASSIGNOP
%left SCOR
%left SCAND
%left EQ NE
%left LT GT LE GE
%left ADD SUB
%left MULT DIV MOD
%right NOT
%left '(' ')' '[' ']'
%%
program
: ImportStatement GlobalVariableDeclarations
;
ImportStatement
: IMPORT LibraryName ';' { delete_file (); f_ptr = open_file (); fprintf(fptr, "import array \n"); }
;
LibraryName
: 'java.util.*'
;
GlobalVariableFieldDeclarations
: type GlobalVariableDeclarations ';'
;
GlobalVariableDeclarations
: GlobalVariableDeclaration
| GlobalVariableDeclarations ',' GlobalVariableDeclaration
;
GlobalVariableDeclaration
: VariableName
| VariableName ASSIGNOP VariableInitializer {if (typeChecking($1,$3)== 0) {$1= $3; $$=$1;}}
;
VariableName
: ID {$$ = $1 ;}
;
type
: INT
| CHAR
| FLOAT
| DOUBLE
| BOOLEAN
;
VariableInitializers
: VariableInitializer
| VariableInitializers ',' VariableInitializer
;
VariableInitializer
: ExpressionStatement
;
ExpressionStatement
: VariableName
| NUMBER
| ArithmeticExpression
| RelationalExpression
| BooleanExpression
;
ArithmeticExpression
: ExpressionStatement ADD ExpressionStatement
| ExpressionStatement SUB ExpressionStatement
| ExpressionStatement MULT ExpressionStatement
| ExpressionStatement DIV ExpressionStatement
| ExpressionStatement MOD ExpressionStatement
;
RelationalExpression
: ExpressionStatement GT ExpressionStatement
| ExpressionStatement LT ExpressionStatement
| ExpressionStatement GE ExpressionStatement
| ExpressionStatement LE ExpressionStatement
| ExpressionStatement EQ ExpressionStatement
| ExpressionStatement NE ExpressionStatement
;
BooleanExpression
: ExpressionStatement SCAND ExpressionStatement
| ExpressionStatement SCOR ExpressionStatement
| NOT ExpressionStatement
;
%%
int yyerror (char *s)
{
printf ("%s \n",s);
}
int main (void) {
return yyparse();
}
int typeChecking (variable1, variable2) {
struct symtable2 *s2;
s2=find_symbol (scopename, variable1);
if (s2!=NULL) {
int type1= s2->type;
char element1 = s2->element;
}
else{
printf("\n\n\033[01;31mVariable 1 not defined.\033[00m\n");
return -1;
}
s2=find_symbol (scopename, variable2);
if (s2!=NULL) {
int type2= s2->type;
char element2 = s2->element;
}
else{
printf("\n\n\033[01;31mVariable 2 not defined.\033[00m\n");
return -1;
}
if(element1=='variable' && element2=='variable'){
if (type1 == type2){
return 0;
}
else {
return 1;
}
}
else {
printf("\n\n\033[01;31m Different elements.\033[00m\n");
return -1;
}
}
I am a beginner with the syntax of the bison for the management of semantics, on the following productions I have doubts about the relative semantic rule:
GlobalVariableFieldDeclarations
: type GlobalVariableDeclarations ';'
;
GlobalVariableDeclarations
: GlobalVariableDeclaration
| GlobalVariableDeclarations ',' GlobalVariableDeclaration
;
GlobalVariableDeclaration
: VariableName
| VariableName ASSIGNOP VariableInitializer {if (typeChecking($1,$3)== 0) {$1= $3; $$=$1;}}
;
VariableName
: ID {$$ = $1 ;}
;
Is it correct to manage semantics in this way for a GlobalVariableDeclaration production? And how can I insert the required parameter values, in the symbol table, via the add_symbol function? (Or better, how can I acquire the required parameters starting from productions to insert them in the add_symbol function that I have implemented?) Forgive me but I am a beginner, and many things about the semantics are not clear to me. I hope you have the patience to help me, I thank you in advance.
You should use Bison to build an AST and then you would perform semantic analysis on the tree instead of in the grammar. Building an AST allows you to perform analysis on more complex data structures then just the grammar rules you built in Bison.
Once you have your AST for the input you can then make rules for how to convert that AST into a python program with the same syntax.
Here is an example of a Bison/Flex compiler for the Decaf language that might give you some ideas https://github.com/davidcox143/Decaf-Compiler

In antlr4.7 how to parse a rule like ISO 8601 interval "P3M2D" ahead of an "ID" rule

I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.

"No viable alternative at input" error for ANTLR 4 JSON grammar

I am trying to adapt the STRING part of Pair in Object to a CamelString, but it fails. and report "No viable alternative at input".
I have tried to used my CamelString as an independent grammar, it works well. I think it means there is ambiguity in my grammar, but I can not understand why.
For the test input
{
'BaaaBcccCdddd':'cc'
}
Ther error is
line 2:2 no viable alternative at input '{'BaaaBcccCdddd''
The following is my grammar. It's almost the same with the standard JSON grammar for ANTLR 4.
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar Json;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair : camel_string ':' value;
camel_string : '\'' (camel_body)*? '\'';
STRING
: '\'' (ESC | ~['\\])* '\'';
camel_body: CAMEL_BODY;
CAMEL_START: [a-z] ALPHA_NUM_LOWER*;
CAMEL_BODY: [A-Z] ALPHA_NUM_LOWER*;
CAMEL_END: [A-Z]+;
fragment ALPHA_NUM_LOWER : [0-9a-z];
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // recursion
| array // recursion
| 'true' // keywords
| 'false'
| 'null'
;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

ANTLR Expression List Conflict

Here is a basic structure for simple nested expressions...
infix : prefix (INFIX_OP^ prefix)*;
prefix : postfix | (PREFIX_OP postfix) -> ^(PREFIX_OP postfix);
postfix : INT (POSTFIX_OP^)?;
POSTFIX_OP : '!';
INFIX_OP : '+';
PREFIX_OP : '-';
INT : '0'..'9'*;
If I wanted to create a list of these expressions I could use the following...
list: infix (',' infix)*;
Here we use the ',' as a delimiter.
I want to be able to build a list of expressions without any delimiter.
So if I have the string 4 5 2+3 1 6 I would like to be able to interpret that as (4) (5) ^(+ 2 3) (1) (6)
The problem is that both 4 and 2+3 have the same first symbol (INT) so I have a conflict. I'm trying to figure out how I can resolve this.
EDIT
I've almost figured it out, just having trouble coming up with the correct rewrite for a certain condition...
expr: (a=atom -> $a)
(op='+' b=atom-> {$a.text != "+" && $b.text != "+"}? ^($op $expr $b) // infix
-> {$b.text != "+"}? // HAVING TROUBLE COMING UP WITH THIS CORRECT REWRITE!
-> $expr $op $b)*; // simple list
atom: INT | '+';
INT : '0'..'9'+;
This will parse 1+2+3++4+5+ as ^(+ ^(+ 1 2) 3) (+) (+) ^(+ 4 5) (+), which is what I want.
Now I'm trying to finish my rewrite rule so that ++1+2 will parse as (+) (+) ^(+ 1 2).
Overall I want a list of tokens and to find all the infix expressions, and leave the rest as a list.
There's a problem with your INT rule:
INT : '0'..'9'*;
which matches an empty string. It should always match at least 1 char:
INT : '0'..'9'+;
Besides that, it seems to work just fine.
Given the grammar:
grammar T;
options {
output=AST;
}
tokens {
LIST;
}
parse : list EOF -> list;
list : infix+ -> ^(LIST infix+);
infix : prefix (INFIX_OP^ prefix)*;
prefix : postfix -> postfix
| PREFIX_OP postfix -> ^(PREFIX_OP postfix)
;
postfix : INT (POSTFIX_OP^)?;
POSTFIX_OP : '!';
INFIX_OP : '+';
PREFIX_OP : '-';
INT : '0'..'9'+;
SPACE : ' ' {skip();};
which parses the input:
4 5 2+3 1 6
into the following AST:
EDIT
Introducing operators that can both be used in post- and infix expressions will make your list ambiguous (well, in my version below, that is... :)). So, I'll keep the comma in there for this demo:
grammar T;
options {
output=AST;
}
tokens {
LIST;
P_ADD;
}
parse : list EOF -> list;
list : expr (',' expr)* -> ^(LIST expr+);
expr : postfix_expr;
postfix_expr : (infix_expr -> infix_expr) (ADD -> ^(P_ADD infix_expr))?;
infix_expr : atom ((ADD | SUB)^ atom)*;
atom : INT;
ADD : '+';
SUB : '-';
INT : '0'..'9'+;
SPACE : ' ' {skip();};
In the grammar above, the + as an infix operator has precedence over the postfix-version, as you can see when parsing input like 2+5+:

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Resources