Generating a parser for expressions

Generating a parser for expressions - parsing

I am trying to generate a javascript parser for the ebnf grammar described in this Microsoft article. The ebnf specified in the article does not work when I use it as its written, so I have tried to simplify it to make it work with the REx parser generator.
The goal is in Javascript to be able to parse and evaluate expressions like these to True or False:
AttributeA > 2 - The value of AttributeA is greater than 2
HasCategory(Assembly) - The node has Category Assembly
Assembly - The node has Category Assembly
HasValue(AttributeA) - The attribute AttributeA has a value. Its not undefined.
AttributeA < AttributeB - The value of attribute AttributeA is less than the value of attribute Attribute B
IsReference - The value of the attribute IsReference is True
AttributeA + 2 > 5 and AttributeB - 5 != 7
AttributeA * 1.25 >= 500
I am using the REx parser generator online here: https://bottlecaps.de/rex/. If you have suggestions for other parser generators that produce JavaScript I would appreciate some links to where I can find them.
The issue I'm struggling with is the definition of the MethodCall. I have tried a lot of different definitions but all fail. When I remove the MethodCall and MethodArgs definition the REx parser generator produces a parser.
So I would appreciate any help to crack this problem a lot.
Below is the grammar as far as I have been able to get.
Expression
::= BinaryExpression | MethodCall | "(" Expression ")" | Number
BinaryExpression
::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
RelationalExpression
::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
AdditiveExpression
::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
MultiplicativeExpression
::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
UnaryExpression
::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier
MethodCall
::= Identifier "(" MethodArgs* ")"
MethodArgs
::= Expression | Expression "," MethodArgs
Identifier
::= Letter ( Letter | Digit | "_" )*
Number
::= Digit ('.' Digit) | (Digit)*
<?TOKENS?>
Letter
::= [A-Za-z]
Digit
::= [0-9]
Here are some of the different versions for the MethodCall definition I have tried but all of them fail.
MethodCall
::= Identifier "(" MethodArgs? ")"
MethodArgs
::= Expression ("," MethodArgs)*
MethodCall
::= Identifier "(" MethodArgs ")"
MethodArgs
::= ( Expression ("," MethodArgs)* )?
MethodCall
::= MethodName "(" MethodArgs? ")"
MethodName
::= Identifier
MethodArgs
::= Expression ("," MethodArgs)*
MethodCall
::= Identifier "(" MethodArgs ")"
MethodArgs
::= Expression ("," MethodArgs)* | ""
MethodCall
::= Identifier MethodArgs
MethodArgs
::= "(" (Expression ("," MethodArgs)* | "") ")"
MethodCall
::= Identifier "(" MethodArgs ")"
MethodArgs
::= Expression | Expression "," MethodArgs | ""
MethodCall
::= Identifier "(" Expression ")"
I have tried to get inspiration from a number of other languages to see how they do it, but with no luck yet so far:
https://bottlecaps.de/rex/EcmaScript.ebnf
https://bottlecaps.de/rex/Java.ebnf
https://cs.au.dk/~amoeller/RegAut/JavaBNF.html
https://docs.python.org/3/reference/compound_stmts.html#function-definitions
Update
Just wanted to let you know how this turned out. I struggled to get the EBNF grammar to do what I needed it to do, so I took a look at Nearley like #rici suggested and converted my grammar into the Nearley grammar syntax and got the tooling to work. It was just a much better choice for my project. The documentation is very good, the tooling is also great and the error messages is very helpful. So a huge thanks to #rici for suggesting Nearley.
Below is the grammar I have implemented. I have tested with the following inputs:
'2 + 4', '2 + 4 - 6', '(2 + 4)', '!true', '!(true)', 'hasCategory(test)', 'hasCategory(test,test2)', 'hasCategory( test , test2 )', 'hasCategory(test,test2, test3)', 'IsReference', 'IsReference()', '2 * 4', '(2 / 4)', '2 * 4 + 2', '(2 / 4) + 2', '2 > 4', '2 >= 2', '2 = 4', '2 == 2', '2 != 4', '2 !== 2', '(2 * 4 + 2) > 4', '(2 * 4 + 2) > (4 + 10)', 'true', 'true or false', 'true || false', 'true and false', 'true && false', '(true or false)', '!(true or false)', '2 != 1+1', '2 != (1+1)', '2 != (1+2)', '(2 > 2 or (2 != 1+1))',
#builtin "whitespace.ne" # `_` means arbitrary amount of whitespace
#builtin "number.ne" # `int`, `decimal`, and `percentage` number primitives
#builtin "postprocessors.ne"
#{%
function methodCall(nameId, argsId = -1) {
return function(data) {
return {
type: 'methodCall',
name: data[nameId],
args: argsId == -1 ? [] : data[argsId]
};
}
}
function value() {
return function(data) {
return {
type: 'value',
value: data[0]
};
}
}
%}
expression ->
methodCall {% id %}
| relationalExpression {% value() %}
| booleanExpression {% value() %}
| _ identifier _ {% methodCall(1) %}
booleanExpression ->
parentheses {% id %}
| parentheses _ "and"i _ parentheses {% d => d[0] && d[4] %}
| parentheses _ "&&" _ parentheses {% d => d[0] && d[4] %}
| parentheses _ "or"i _ parentheses {% d => d[0] || d[4] %}
| parentheses _ "||" _ parentheses {% d => d[0] || d[4] %}
parentheses ->
_ "(" relationalExpression ")" _ {% nth(2) %}
| _ "(" booleanExpression ")" _ {% nth(2) %}
| unaryExpression {% id %}
relationalExpression ->
_ additiveExpression _ {% nth(1) %}
| relationalExpression _ "=" _ additiveExpression {% d => d[0] == d[4] %}
| relationalExpression _ "==" _ additiveExpression {% d => d[0] == d[4] %}
| relationalExpression _ "!=" _ additiveExpression {% d => d[0] != d[4] %}
| relationalExpression _ "!==" _ additiveExpression {% d => d[0] != d[4] %}
| relationalExpression _ "<" _ additiveExpression {% d => d[0] < d[4] %}
| relationalExpression _ ">" _ additiveExpression {% d => d[0] > d[4] %}
| relationalExpression _ "<=" _ additiveExpression {% d => d[0] <= d[4] %}
| relationalExpression _ ">=" _ additiveExpression {% d => d[0] >= d[4] %}
additiveExpression ->
_ multiplicativeExpression _ {% nth(1) %}
| additiveExpression _ "+" _ multiplicativeExpression {% d => d[0] + d[4] %}
| additiveExpression _ "-" _ multiplicativeExpression {% d => d[0] - d[4] %}
multiplicativeExpression ->
_ parentheses _ {% nth(1) %}
| parentheses _ "*" _ parentheses {% d => d[0] * d[4] %}
| parentheses _ "/" _ parentheses {% d => d[0] / d[4] %}
| parentheses _ "%" _ parentheses {% d => d[0] % d[4] %}
unaryExpression ->
_ "!" _ expression _ {% d => !d[3] %}
| _ decimal _ {% nth(1) %}
| _ unsigned_int _ {% nth(1) %}
| _ boolean _ {% nth(1) %}
| _ identifier _ {% nth(1) %}
methodCall ->
identifier "(" methodArgs ")" {% methodCall(0, 2) %}
| identifier "(" _ ")" {% methodCall(0) %}
methodArgs ->
_ identifier _ {% d => [d[1]] %}
| _ identifier _ "," _ methodArgs _ {% d => [d[1]].concat(d[5]) %}
boolean ->
"true"i {% () => true %}
| "false"i {% () => false %}
identifier ->
[A-Za-z0-9_]:+ {% (data, l, reject) => {
var ident = data[0].join('');
if (ident.toLowerCase() === 'true' || ident.toLowerCase() === 'false') {
return reject;
} else {
return ident;
}
}
%}

There are a few problems with your grammar, but it's mostly fine.
'and' and 'or' conflict with Identifier. So, subtract those string literals from Identifier in its rule.
Number was missing parentheses. It should be Number ::= Digit ( ('.' Digit) | (Digit)* )
You are missing the EOF rule. Almost every parser generator I know requires a bottom/EOF rule to force consumption of the entire input. I added the rule "Input".
Make sure to click the "configure" box, then "backtracking". Your grammar is ambiguous, which is fine, but requires you to tell the parser generator to handle that.
Parser generators have a slightly different syntax for "EBNF", which is what REx takes. REx adds a <?TOKENS?> string to denote the boundary between parser and lexer rules. Microsoft says the grammar is "BNF" but it's not because it uses the Kleene operator <Identifier> ::= [^. ]*, an EBNF construct. It also fudges the definition of <Literal> and <Number> with prose.
I haven't tested the generated parser, but it seems like a straightforward recursive descent implementation. The parser generators that I am familiar with, and that are popular, are listed in the conversion page. (I'm writing converters for all of them and many more.)
Try this:
Input ::= Expression EOF
Expression
::= BinaryExpression | MethodCall | "(" Expression ")" | Number
BinaryExpression
::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
RelationalExpression
::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
AdditiveExpression
::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
MultiplicativeExpression
::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
UnaryExpression
::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier
MethodCall
::= Identifier "(" MethodArgs* ")"
MethodArgs
::= Expression | Expression "," MethodArgs
<?TOKENS?>
Identifier
::= ( ( Letter ( Letter | Digit | "_" )* ) - ( 'and' | 'or' ) )
Number ::= Digit ( ('.' Digit) | (Digit)* )
Letter
::= [A-Za-z]
Digit
::= [0-9]
EOF ::= $

Related

Resolving Shift/reduce conflicts in GNU Bison

I have the following grammar rules:
%precedence KW2
%left "or"
%left "and"
%left "==" "!=" ">=" ">" "<=" "<"
%left "-" "+"
%left "/" "*"
%start statement1
%%
param
: id
| id ":" expr // Conflict is caused by this line
| id "=" expr
;
param_list
: param_list "," param
| param
;
defparam
: param_list "," "/"
| param_list "," "/" ","
;
param_arg_list
: defparam param_list
| param_list
;
statement1
: KEYWORD1 "(" param_arg_list ")" ":" expr {}
expression1
: KEYWORD2 param_arg_list ":" expr %prec KW2 {} // This causes shift/reduce conflicts
expr
: id
| expr "+" expr
| expr "-" expr
| expr "*" expr
| expr "/" expr
| expr "==" expr
| expr "!=" expr
| expr "<" expr
| expr "<=" expr
| expr ">" expr
| expr ">=" expr
| expr "and" expr
| expr "or" expr
| expression1
id
: TK_NAME {}
.output
State 33
12 param: id . [":", ",", ")"]
13 | id . ":" expr
14 | id . "=" expr
":" shift, and go to state 55
"=" shift, and go to state 56
":" [reduce using rule 12 (param)]
$default reduce using rule 12 (param)
The problem here is that, For the expression1, id ":" expr rule in param is not required, so If I remove id ":" expr, the conflicts are resolved. But, I can not remove id ":" expr rule in param, because statement1 requires it.
I wanted to use para_arg_list for statement1 and expression1 is that, it simplifies the grammar rules by not allowing to use the grammar rules again and again.
My question is that is there any other way to resolve the conflict?

ANTLR4 mismatched input '' expecting

Currently, I've just defined simple rules in ANTLR4:
// Recognizer Rules
program : (class_dcl)+ EOF;
class_dcl: 'class' ID ('extends' ID)? '{' class_body '}';
class_body: (const_dcl|var_dcl|method_dcl)*;
const_dcl: ('static')? 'final' PRIMITIVE_TYPE ID '=' expr ';';
var_dcl: ('static')? id_list ':' type ';';
method_dcl: PRIMITIVE_TYPE ('static')? ID '(' para_list ')' block_stm;
para_list: (para_dcl (';' para_dcl)*)?;
para_dcl: id_list ':' PRIMITIVE_TYPE;
block_stm: '{' '}';
expr: <assoc=right> expr '=' expr | expr1;
expr1: term ('<' | '>' | '<=' | '>=' | '==' | '!=') term | term;
term: ('+'|'-') term | term ('*'|'/') term | term ('+'|'-') term | fact;
fact: INTLIT | FLOATLIT | BOOLLIT | ID | '(' expr ')';
type: PRIMITIVE_TYPE ('[' INTLIT ']')?;
id_list: ID (',' ID)*;
// Lexer Rules
KEYWORD: PRIMITIVE_TYPE | BOOLLIT | 'class' | 'extends' | 'if' | 'then' | 'else'
| 'null' | 'break' | 'continue' | 'while' | 'return' | 'self' | 'final'
| 'static' | 'new' | 'do';
SEPARATOR: '[' | ']' | '{' | '}' | '(' | ')' | ';' | ':' | '.' | ',';
OPERATOR: '^' | 'new' | '=' | UNA_OPERATOR | BIN_OPERATOR;
UNA_OPERATOR: '!';
BIN_OPERATOR: '+' | '-' | '*' | '\\' | '/' | '%' | '>' | '>=' | '<' | '<='
| '==' | '<>' | '&&' | '||' | ':=';
PRIMITIVE_TYPE: 'integer' | 'float' | 'bool' | 'string' | 'void';
BOOLLIT: 'true' | 'false';
FLOATLIT: [0-9]+ ((('.'[0-9]* (('E'|'e')('+'|'-')?[0-9]+)? ))|(('E'|'e')('+'|'-')? [0-9]+));
INTLIT: [0-9]+;
STRINGLIT: '"' ('\\'[bfrnt\\"]|~[\r\t\n\\"])* '"';
ILLEGAL_ESC: '"' (('\\'[bfrnt\\"]|~[\n\\"]))* ('\\'(~[bfrnt\\"]))
{if (true) throw new bkool.parser.IllegalEscape(getText());};
UNCLOSED_STRING: '"'('\\'[bfrnt\\"]|~[\r\t\n\\"])*
{if (true) throw new bkool.parser.UncloseString(getText());};
COMMENT: (BLOCK_COMMENT|LINE_COMMENT) -> skip;
BLOCK_COMMENT: '(''*'(('*')?(~')'))*'*'')';
LINE_COMMENT: '#' (~[\n])* ('\n'|EOF);
ID: [a-zA-z_]+ [a-zA-z_0-9]* ;
WS: [ \t\r\n]+ -> skip ;
ERROR_TOKEN: . {if (true) throw new bkool.parser.ErrorToken(getText());};
I opened the parse tree, and tried to test:
class abc
{
final integer x=1;
}
It returned errors:
BKOOL::program:3:8: mismatched input 'integer' expecting PRIMITIVE_TYPE
BKOOL::program:3:17: mismatched input '=' expecting {':', ','}
I still haven't got why. Could you please help me why it didn't recognize rules and tokens as I expected?

Lexer rules are exclusive. The longest wins, and the tiebreaker is the grammar order.
In your case; integer is a KEYWORD instead of PRIMITIVE_TYPE.
What you should do here:
Make one distinct token per keyword instead of an all-catching KEYWORD rule.
Turn PRIMITIVE_TYPE into a parser rule
Same for operators
Right now, your example:
class abc
{
final integer x=1;
}
Gets converted to lexemes such as:
class ID { final KEYWORD ID = INTLIT ; }
This is thanks to the implicit token typing, as you've used definitions such as 'class' in your parser rules. These get converted to anonymous tokens such as T_001 : 'class'; which get the highest priority.
If this weren't the case, you'd end up with:
KEYWORD ID SEPARATOR KEYWORD KEYWORD ID OPERATOR INTLIT ; SEPARATOR
And that's... not quite easy to parse ;-)
That's why I'm telling you to breakdown your tokens properly.

Removing ambiguity in bison

I am writing a simple parser in bison. The parser checks whether a program has any syntax errors with respect to my following grammar:
%{
#include <stdio.h>
void yyerror (const char *s) /* Called by yyparse on error */
{
printf ("%s\n", s);
}
%}
%token tNUM tINT tREAL tIDENT tINTTYPE tREALTYPE tINTMATRIXTYPE
%token tREALMATRIXTYPE tINTVECTORTYPE tREALVECTORTYPE tTRANSPOSE
%token tIF tENDIF tDOTPROD tEQ tNE tGTE tLTE tGT tLT tOR tAND
%left "(" ")" "[" "]"
%left "<" "<=" ">" ">="
%right "="
%left "+" "-"
%left "*" "/"
%left "||"
%left "&&"
%left "==" "!="
%% /* Grammar rules and actions follow */
prog: stmtlst ;
stmtlst: stmt | stmt stmtlst ;
stmt: decl | asgn | if;
decl: type vars "=" expr ";" ;
type: tINTTYPE | tINTVECTORTYPE | tINTMATRIXTYPE | tREALTYPE | tREALVECTORTYPE
| tREALMATRIXTYPE ;
vars: tIDENT | tIDENT "," vars ;
asgn: tIDENT "=" expr ";" ;
if: tIF "(" bool ")" stmtlst tENDIF ;
expr: tIDENT | tINT | tREAL | vectorLit | matrixLit | expr "+" expr| expr "-" expr
| expr "*" expr | expr "/" expr| expr tDOTPROD expr | transpose ;
transpose: tTRANSPOSE "(" expr ")" ;
vectorLit: "[" row "]" ;
matrixLit: "[" row ";" rows "]" ;
row: value | value "," row ;
rows: row | row ";" rows ;
value: tINT | tREAL | tIDENT ;
bool: comp | bool tAND bool | bool tOR bool ;
comp: expr relation expr ;
relation: tGT | tLT | tGTE | tLTE | tNE | tEQ ;
%%
int main ()
{
if (yyparse()) {
// parse error
printf("ERROR\n");
return 1;
}
else {
// successful parsing
printf("OK\n");
return 0;
}
}
The code may look long and complicated, but i think what i am going to ask does not need the full code, but in any case i preferred to write the code. I am sure my grammar is correct, but ambiguous. When i try to create the executable of the program by writing "bison -d filename.y", i get an error saying that conflicts: 13 shift/reduce. I defined the precedence of the operators at the beginning of this file, and i tried a lot of combinations of these precedences, but i still get this error. How can i remove this ambiguity? Thank you

tOR, tAND, and tDOTPROD need to have their precedence specified as well.

ANTLR ambiguity '-'

I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;

As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...

'Token collision' in Boolean Query Parser

I'm creating a simple boolean query parser. I would like to do something like this below.
grammar BooleanQuery;
options
{
language = Java;
output = AST;
}
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
QUOTE : ( '"' );
AND : ( 'AND' | '&' | 'EN' | '+' ) ;
OR : ( 'OR' | '|' | 'OF' );
WS : ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
MINUS : '-';
PLUS : '+';
expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : part (OR^ part)*;
phrase : QUOTE ( options {greedy=false;} : . )* QUOTE;
requiredexpr : PLUS atom;
excludedexpr : MINUS atom;
part : excludedexpr | requiredexpr | atom;
atom : phrase | WORD | LPAREN! expr RPAREN!;
The problem is that the MINUS and PLUS tokens 'collide' with the MINUS and PLUS signs in the AND and OR tokens. Sorry if I don't use the correct terminology. I'm a ANTLR newbie.
Below an example query:
foo OR (pow AND -"bar with cream" AND -bar)
What mistakes did I make?

A token must be unique. You can, however, use the same token for several purposes in you syntax (like the unary and binary minus in Java).
I do not know the exact syntax of your environment, but something like changing the following two clauses
AND : ( 'AND' | '&' | 'EN' ) ;
and
andexpr : orexpr ((AND^ | PLUS^) orexpr)*;
would probably solve this issue.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Generating a parser for expressions - parsing

Related

Resolving Shift/reduce conflicts in GNU Bison

ANTLR4 mismatched input '' expecting

Removing ambiguity in bison

ANTLR ambiguity '-'

'Token collision' in Boolean Query Parser

Categories

Resources