VBScript parser grammar: Array assignment modelling - parsing

I'm writing a GoldParser VBScript grammar. In my grammar array assignment statements such as id(1) = 2 are not parsed as assignment statements, but as call statements id ((1) = 2) (the = symbol can be both the assignment operator and the comparison operator). How can I change the following grammar to correctly parse array assignment statements?
<CallStmt> ::= 'Call' <CallExpr>
| '.' <CallPath>
| <CallPath>
<AssignStmt> ::= <CallExpr> '=' <Expr>
| 'Set' <CallExpr> '=' <Expr>
| 'Set' <CallExpr> '=' 'New' <CtorPath>
<CtorPath> ::= IDDot <CtorPath>
| <Member>
<CallPath> ::= <MemberDot> <CallPath>
| ID '(' ')'
| ID <ParameterList>
<CallExpr> ::= '.' <MemberPath>
| <MemberPath>
<MemberPath> ::= <MemberDot> <MemberPath>
| <Member>
<Member> ::= ID
| ID '(' <ParameterList> ')'
<MemberDot> ::= IDDot
| ID '(' <ParameterList> ').'
!VBScript allows to skip parameters a(1,,2)
<ParameterList> ::= <Expr> ',' <ParameterList>
| ',' <ParameterList>
| <Expr>
|
! Value can be reduced from <Expr>
<Value> ::= NumberLiteral
| StringLiteral
| <CallExpr>
| '(' <Expr> ')'
!--- The rest of the grammar ---
"Start Symbol" = <Start>
{WS} = {Whitespace} - {CR} - {LF}
{ID Head} = {Letter} + [_]
{ID Tail} = {Alphanumeric} + [_]
{String Chars} = {Printable} + {HT} - ["]
Whitespace = {WS}+
NewLine = {CR}{LF} | {CR} | {LF}
ID = {ID Head}{ID Tail}*
IDDot = {ID Head}{ID Tail}* '.'
StringLiteral = ('"' {String Chars}* '"')+
NumberLiteral = {Number}+ ('.' {Number}+ )?
<nl> ::= NewLine <nl> !One or more
| NewLine
<nl Opt> ::= NewLine <nl Opt> !Zero or more
| !Empty
<Start> ::= <nl opt> <StmtList>
<StmtList> ::= <CallStmt> <nl> <StmtList>
| <AssignStmt> <nl> <StmtList>
|
<Expr> ::= <Compare Exp>
<Compare Exp> ::= <Compare Exp> '=' <Add Exp>
| <Add Exp>
<Add Exp> ::= <Add Exp> '+' <Mult Exp>
| <Add Exp> '-' <Mult Exp>
| <Mult Exp>
<Mult Exp> ::= <Mult Exp> '*' <Negate Exp>
| <Mult Exp> '/' <Negate Exp>
| <Negate Exp>
<Negate Exp> ::= '-' <Value>
| <Value>
Note: I added the IDDot terminal to parse statements within With correctly, e.g: .obj.sub .obj.par1.

Related

ANTLR4 mismatched input '' expecting

Currently, I've just defined simple rules in ANTLR4:
// Recognizer Rules
program : (class_dcl)+ EOF;
class_dcl: 'class' ID ('extends' ID)? '{' class_body '}';
class_body: (const_dcl|var_dcl|method_dcl)*;
const_dcl: ('static')? 'final' PRIMITIVE_TYPE ID '=' expr ';';
var_dcl: ('static')? id_list ':' type ';';
method_dcl: PRIMITIVE_TYPE ('static')? ID '(' para_list ')' block_stm;
para_list: (para_dcl (';' para_dcl)*)?;
para_dcl: id_list ':' PRIMITIVE_TYPE;
block_stm: '{' '}';
expr: <assoc=right> expr '=' expr | expr1;
expr1: term ('<' | '>' | '<=' | '>=' | '==' | '!=') term | term;
term: ('+'|'-') term | term ('*'|'/') term | term ('+'|'-') term | fact;
fact: INTLIT | FLOATLIT | BOOLLIT | ID | '(' expr ')';
type: PRIMITIVE_TYPE ('[' INTLIT ']')?;
id_list: ID (',' ID)*;
// Lexer Rules
KEYWORD: PRIMITIVE_TYPE | BOOLLIT | 'class' | 'extends' | 'if' | 'then' | 'else'
| 'null' | 'break' | 'continue' | 'while' | 'return' | 'self' | 'final'
| 'static' | 'new' | 'do';
SEPARATOR: '[' | ']' | '{' | '}' | '(' | ')' | ';' | ':' | '.' | ',';
OPERATOR: '^' | 'new' | '=' | UNA_OPERATOR | BIN_OPERATOR;
UNA_OPERATOR: '!';
BIN_OPERATOR: '+' | '-' | '*' | '\\' | '/' | '%' | '>' | '>=' | '<' | '<='
| '==' | '<>' | '&&' | '||' | ':=';
PRIMITIVE_TYPE: 'integer' | 'float' | 'bool' | 'string' | 'void';
BOOLLIT: 'true' | 'false';
FLOATLIT: [0-9]+ ((('.'[0-9]* (('E'|'e')('+'|'-')?[0-9]+)? ))|(('E'|'e')('+'|'-')? [0-9]+));
INTLIT: [0-9]+;
STRINGLIT: '"' ('\\'[bfrnt\\"]|~[\r\t\n\\"])* '"';
ILLEGAL_ESC: '"' (('\\'[bfrnt\\"]|~[\n\\"]))* ('\\'(~[bfrnt\\"]))
{if (true) throw new bkool.parser.IllegalEscape(getText());};
UNCLOSED_STRING: '"'('\\'[bfrnt\\"]|~[\r\t\n\\"])*
{if (true) throw new bkool.parser.UncloseString(getText());};
COMMENT: (BLOCK_COMMENT|LINE_COMMENT) -> skip;
BLOCK_COMMENT: '(''*'(('*')?(~')'))*'*'')';
LINE_COMMENT: '#' (~[\n])* ('\n'|EOF);
ID: [a-zA-z_]+ [a-zA-z_0-9]* ;
WS: [ \t\r\n]+ -> skip ;
ERROR_TOKEN: . {if (true) throw new bkool.parser.ErrorToken(getText());};
I opened the parse tree, and tried to test:
class abc
{
final integer x=1;
}
It returned errors:
BKOOL::program:3:8: mismatched input 'integer' expecting PRIMITIVE_TYPE
BKOOL::program:3:17: mismatched input '=' expecting {':', ','}
I still haven't got why. Could you please help me why it didn't recognize rules and tokens as I expected?
Lexer rules are exclusive. The longest wins, and the tiebreaker is the grammar order.
In your case; integer is a KEYWORD instead of PRIMITIVE_TYPE.
What you should do here:
Make one distinct token per keyword instead of an all-catching KEYWORD rule.
Turn PRIMITIVE_TYPE into a parser rule
Same for operators
Right now, your example:
class abc
{
final integer x=1;
}
Gets converted to lexemes such as:
class ID { final KEYWORD ID = INTLIT ; }
This is thanks to the implicit token typing, as you've used definitions such as 'class' in your parser rules. These get converted to anonymous tokens such as T_001 : 'class'; which get the highest priority.
If this weren't the case, you'd end up with:
KEYWORD ID SEPARATOR KEYWORD KEYWORD ID OPERATOR INTLIT ; SEPARATOR
And that's... not quite easy to parse ;-)
That's why I'm telling you to breakdown your tokens properly.

Why are table literals treated differently from table references in Lua?

Here is a Lua 5.2.2 transcript, showing the declaration and indexing of a table:
> mylist = {'foo', 'bar'}
> print(mylist[1])
foo
Why isn't the following statement legal?
> print({'foo', 'bar'}[1])
stdin:1: ')' expected near '['
I can't think of any other language where a literal can't be substituted for a reference (except, of course, when an lvalue is needed).
FWIW, parenthesizing the table literal makes the statement legal:
> print(({'foo', 'bar'})[1])
foo
It is also related to the fact that in Lua this syntax is valid:
myfunc { 1, 2, 3 }
and it is equivalent to:
myfunc( { 1, 2, 3 } )
Therefore an expression such as:
myfunc { 1, 2, 3 } [2]
is parsed as:
myfunc( { 1, 2, 3 } )[2]
so first the function call is evaluated, then the indexing takes place.
If {1,2,3}[2] could be parsed as a valid indexing operation it could lead to ambiguities in the previous expression that would require more lookahead. The Lua team chose to make the Lua bytecode compiler fast by making it a single pass compiler, so it scans source code only once with a minimum of lookahead. This message to lua mailing list from Roberto Ierusalimschy (lead Lua developer) points in that direction.
The same problem exists for string literals and method calls. This is invalid:
"my literal":sub(1)
but this is valid:
("my literal"):sub(1)
Again, Lua allows this:
func "my literal"
as equivalent to this:
func( "my literal" )
Going of the grammar defined here, the reason that the non parenthesized version is invalid yet the parenthesized is, is because the syntax tree takes a different path and expects a closing bracket ) because there should be no other symbol in that context.
In the first case:
functioncall ::= prefixexp args | prefixexp `:´ Name args
prefixexp =>
prefixexp ::= var | functioncall | `(´ exp `)´
var -> print
THEREFORE prefixexp -> print
args =>
args ::= `(´ [explist] `)´ | tableconstructor | String
match '('
explist =>
explist ::= {exp `,´} exp
exp =>
exp ::= nil | false | true | Number | String | `...´ | function | prefixexp | tableconstructor | exp binop exp | unop exp
tableconstructor =>
explist-> {'foo','bar'}
THEREFORE explist = {'foo','bar'}
match ')' ERROR!!! found '[' expected ')'
On the other hand with parenthesis:
functioncall ::= prefixexp args | prefixexp `:´ Name args
prefixexp =>
prefixexp ::= var | functioncall | `(´ exp `)´
var -> print
THEREFORE prefixexp -> print
args =>
args ::= `(´ [explist] `)´ | tableconstructor | String
match '('
explist =>
explist ::= {exp `,´} exp
exp =>
exp ::= nil | false | true | Number | String | `...´ | function | prefixexp | tableconstructor | exp binop exp | unop exp
prefixexp =>
prefixexp ::= var | functioncall | `(´ exp `)´
var =>
var ::= Name | prefixexp `[´ exp `]´ | prefixexp `.´ Name
prefixexp =>
prefixexp ::= var | functioncall | `(´ exp `)´
match '('
exp =>
exp ::= nil | false | true | Number | String | `...´ | function | prefixexp | tableconstructor | exp binop exp | unop exp
tableconstructor =>
THEREFORE exp = {'foo','bar'}
match ')'
THEREFORE prefixexp = ({'foo','bar'})
match '['
exp => Number = 1
match ']'
THEREFORE VAR = ({'foo','bar'})[1]
THEREFORE prefixexp = VAR
THEREFOER exp = VAR
THEREFORE explist = VAR
match ')'
THEREFORE args = (VAR)
=> print(VAR)

Antlr wrong rule invocation

I'm trying to implement a grammar for parsing lucene queries. So far everything went smooth until i tried to add support for range queries . Lucene details aside my grammar looks like this :
grammar ModifiedParser;
TERM_RANGE : '[' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) ']'
| '{' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) '}'
;
query : not (booleanOperator? not)* ;
booleanOperator : andClause
| orClause
;
andClause : 'AND' ;
notClause : 'NOT' ;
orClause : 'OR' ;
not : notClause? MODIFIER? clause;
clause : unqualified
| qualified
;
unqualified : TERM_RANGE # termRange
| TERM_PHRASE # termPhrase
| TERM_PHRASE_ANYTHING # termTruncatedPhrase
| '(' query ')' # queryUnqualified
| TERM_TEXT_TRUNCATED # termTruncatedText
| TERM_NORMAL # termText
;
qualified : TERM_NORMAL ':' unqualified
;
fragment TERM_CHAR : (~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '?' | '*' | '\\' ))
;
fragment TERM_START_CHAR : TERM_CHAR
| ESCAPE
;
fragment ESCAPE : '\\' ~[];
MODIFIER : '-'
| '+'
;
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
TERM_PHRASE_ANYTHING : '"' (ESCAPE|~('\"'|'\\'))+ '"' ;
TERM_PHRASE : '"' (ESCAPE|~('\"'|'\\'|'?'|'*'))+ '"' ;
TERM_TEXT_TRUNCATED : ('*'|'?')(TERM_CHAR+ ('*'|'?'))+ TERM_CHAR*
| TERM_START_CHAR (TERM_CHAR* ('?'|'*'))+ TERM_CHAR+
| ('?'|'*') TERM_CHAR+
;
TERM_NORMAL : TERM_TEXT;
fragment TERM_TEXT : TERM_START_CHAR TERM_CHAR* ;
WS : [ \t\r\n] -> skip ;
When i try to do a visitor and work with the tokens apparently parsing asd [ 10 TO 100 ] { 1 TO 1000 } 100..1000 will throw token recognition error for [ , ] , } and {, and only tries to visit the termRange rule on the third range . do you guys know what i'm missing here ? Thanks in advance
Since you made TERM_RANGE a lexer rule, you must account for everything at a character level. In particular, you forgot to allow whitespace characters in your input.
You would likely be in a much better position if you instead created termRange, a parser rule.

ANTLR ambiguity '-'

I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...

Problems with LL(1) grammar

I have a 26 rule grammar for a sub-grammar of Mini Java. This grammar is supposed to be non-object-oriented. Anyway, I've been trying to left-factor it and remove left-recursion. However I test it with JFLAP, though, it tells me it is not LL(1). I have followed every step of the algorithm in the Aho-Sethi book.
Could you please give me some tips?
Goal ::= MainClass $
MainClass ::= class <IDENTIFIER> { MethodDeclarations public static void main ( ) {
VarDeclarations Statements } }
VarDeclarations ::= VarDeclaration VarDeclarations | e
VarDeclaration ::= Type <IDENTIFIER> ;
MethodDeclarations ::= MethodDeclaration MethodDeclarations | e
MethodDeclaration ::= public static Type <IDENTIFIER> ( Parameters ) {
VarDeclarations Statements return GenExpression ; }
Parameters ::= Type <IDENTIFIER> Parameter | e
Parameter ::= , Type <IDENTIFIER> Parameter | e
Type ::= boolean | int
Statements ::= Statement Statements | e
Statement ::= { Statements }
| if ( GenExpression ) Statement else Statement
| while ( GenExpression ) Statement
| System.out.println ( GenExpression ) ;
| <IDENTIFIER> = GenExpression ;
GenExpression ::= Expression | RelExpression
Expression ::= Term ExpressionRest
ExpressionRest ::= e | + Term ExpressionRest | - Term ExpressionRest
Term ::= Factor TermRest
TermRest ::= e | * Factor TermRest
Factor ::= ( Expression )
| true
| false
| <INTEGER-LITERAL>
| <IDENTIFIER> ArgumentList
ArgumentList ::= e | ( Arguments )
RelExpression ::= RelTerm RelExpressionRest
RelExpressionRest ::= e | && RelTerm RelExpressionEnd
RelExpressionEnd ::= e | RelExpressionRest
RelTerm ::= Term RelTermRest
RelTermRest ::= == Expression | < Expression | ExpressionRest RelTermEnding
RelTermEnding ::= == Expression | < Expression
Arguments ::= Expression Argument | RelExpression Argument | e
Argument ::= , GenExpression Argument | e
Each <IDENTIFIER> is a valid Java identifier, and <INTEGER-LITERAL> is a simple integer. Each e production stands for an epsilon production, and the $ in the first rule is the end-of-file marker.
I think I spotted two problems (there might be more):
Problem #1
In MainClass you have
MethodDeclarations public static void main
And a MethodDeclaration is
public static Type | e
That's not LL(1) since when the parser sees "public" it cannot tell if it's a MethodDeclaration or the "public static void main" method.
Problem #2
Arguments ::= Expression Argument | RelExpression Argument | e
Both Expression:
Expression ::= Term ExpressionRest
... and RelExpression:
RelExpression ::= RelTerm RelExpressionRest
RelTerm ::= Term RelTermRest
... start with "Term" so that's not LL(1) either.
I'd just go for LL(k) or LL(*) because they allow you to write much more maintainable grammars.
Is there anything to prevent IDENTIFIER being the same as one of your reserved words? if not then your grammar would be ambiguous. I don't see anything else though.
If all else fails, I'd remove all but the last line of the grammar, and test that. If that passes I'd add each line one at a time until I found the problem line.

Resources