Handling Token Ambiguity in JavaCC - parsing

I'm attempting to write a parser in JavaCC that can recognize a language that has some ambiguity at the token level. In this particular case the language supports the "/" token by itself as a division operator while it also supports regular expression literals.
Consider the following JavaCC grammar:
TOKEN :
{
...
< VAR : "var" > |
< DIV : "/" > |
< EQUALS : "=" > |
< SEMICOLON : ";" > |
...
}
TOKEN :
{
< IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
< #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
< #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) > |
< REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
< #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
< #REGEX_CHARS : ( <REGEX_CHAR> )* > |
< #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
< #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
< #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
< #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >
}
Given the following code:
var y = a/b/c;
Two different sets of tokens could be generated. The token stream should be either:
<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>
or
<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>
How can I ensure that that TokenManager generates the token stream that I expect for this case?

JavaCC will always consume the largest token available and there is no way to configure it otherwise. The only way to accomplish this is by adding a lexical state, in case say IGNORE_REGEX, that excludes the token, in this case <REGEX_LITERAL>. Then, when a token is recognized that cannot be followed by <REGEX_LITERAL> the lexical state must be switched to IGNORE_REGEX.
With the input:
var y = a/b/c
The following would occur:
<VAR> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<EQUALS> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
At this point, there is an ambiguity in the grammar, either a <DIV> or a <REGEX_LITERAL> will be consumed. Since the lexical state is IGNORE_REGEX and that state does not match <REGEX_LITERAL> a <DIV> will be consumed.
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX

as far as i remember (i worked with JavaCC sometime back)
the order in which you write each rule is the order in which it would be parsed, so write your rules in an order which would always generate the expression that you want.

Since JavaScript/EcmaScript does the same thing (that is, it contains regex literals and a divide operator that look just like those in your examples) you might want to look for an existing JavaCC grammar to learn from. I found one linked to from this blog entry, there may be others.

Related

Does antlr automatically discard whitespace?

I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.

Preferring shift over reduce in parser for language without statement terminators

I'm parsing a language that doesn't have statement terminators like ;. Expressions are defined as the longest sequence of tokens, so 5-5 has to be parsed as a subtraction, not as two statements (literal 5 followed by a unary negated -5).
I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.
The expected parses would be (individual statements in brackets):
"5 - 5" → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5" → -5
"5 5" → 5, 5
How do I change the grammar such that it always prefers the longer parse?
Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.
I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5), which could be parsed as the identifier print and a parenthesized number (5) or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.
For convenience, I created a repo (checkout and cargo run). The grammar is:
use std::str::FromStr;
grammar;
match {
"+",
"-",
"(",
")",
r"[0-9]+",
// Skip whitespace
r"\s*" => { },
}
Expr: i32 = {
<l:Expr> "+" <r:Unary> => l + r,
<l:Expr> "-" <r:Unary> => l - r,
Unary,
};
Unary: i32 = {
"-" <r:Unary> => -r,
Term,
}
Term: i32 = {
Num,
"(" <Expr> ")",
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap(),
};
Stmt: i32 = {
Expr
};
pub Stmts: Vec<i32> = {
Stmt*
};
Part of the error (full error message):
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected
The problem arises after having observed the following symbols in the input:
Stmt+ Expr
At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.
First, the parser could execute the production at
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
Expr ╷ Stmt
├─Stmt──┤ │
├─Stmt+─┘ │
└─Stmt+──────┘
Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
then yield a parse tree like
Stmt+ Expr "-" Unary
│ ├─Expr───────┤
│ └─Stmt───────┤
└─Stmt+────────────┘
See the LALRPOP manual for advice on making your grammar LR(1).
The issue you're going to have to confront is how to deal with function calls. I can't really give you any concrete advice based on your question, because the grammar you provide lacks any indication of the intended syntax of functions calls, but the hint that print(5) is a valid statement makes it clear that there are two distinct situations, which need to be handled separately.
Consider:
5 - 5 One statement 5 ( - 5 ) Two statements
print(-5) One statement print - 5 Two statements (presumably)
a - 5 ???
The ambiguity of the third expression could be resolved if the compiler knew whether a is a function or a variable (if we assume that functions are not first-class values, making print an invalid statement). But there aren't many ways that the parser could know that, and none of them seem very likely:
There might not be any user-defined functions. Then the lexer could be built to recognise identifier-like tokens which happen to be built-in functions (like print) and then a(-5) would be illegal since a is not a built-in function.
The names of functions and identifiers might differ in some way that the lexer can detect. For example, the language might require functions to start with a capital letter. I presume this is not the case since you wrote print rather than Print but there might be some other simple distinction, such as requiring identifiers to be a single character.
Functions must be declared as such before the first use of the function, and the parser shares the symbol table with the lexer. (I didn't search the rather inadequate documentation for the generator you're using to see if lexical feedback is practical.)
If there were an optional statement delimiter (as with Lua, for example), then you could simply require that statements which start with parentheses (usually a pretty rare case) be explicitly delimited unless they are the first statement in a block. Or there might be an optional keyword such as compute which can be used as an unambiguous statement starter and whose use is required for statements which start with a parenthesis. I presume that neither of these is the case here, since you could have used that to force 5 - 5 to be recognised as two statements (5; -5 or 5 compute - 5.)
Another unlikely possibility, again based on the print(5) example, is that function calls use a different bracket than expression grouping. In that case, a[5] (for example) would be a function call and a(5) would unambiguously be two statements.
Since I don't know the precise requirements here, I'll show a grammar (in yacc/bison syntax, although it should be easy enough to translate it) which attempts to illustrate a representative sample. It implements one statement (return) in addition to expression statements, and expressions include multiplication, subtraction, negation and single argument function calls. To force "greedy" expressions, it prohibits certain statement sequences:
statements starting with a unary operator
statements starting with an open parenthesis if the previous statement ends with an identifier. (This effectively requires that the function to be applied in a call expression be a simple identifier. Without that restriction, it becomes close to impossible to distinguish two consecutive parenthesized expressions from a single function call expression, and you then need some other way to disambiguate.)
Those rules are easy to state, but the actual implementation is annoyingly repetitive because it requires various different kinds of expressions, depending on what the first and last token in the expression is, and possibly different kinds of statements, if you have statements which might end with an expression. (return x, for example.) The formalism used by ECMAScript would be useful here, but I suspect that your parser-generator doesn't implement it -- although it's possible that its macro facility could be used to that effect, if it came with something resembling documentation. Without that, there is a lot of duplication.
In a vague attempt to generate the grammar, I used the following suffixes:
_un / _pr / _oth: starts with unary / parenthesis / other token
_id / _nid: ends / does not end with an id
The absence of a suffix is used for the union of different possibilities. There are probably more unit productions than necessary. It has not been thoroughly debugged, but it worked on a few test cases (see below):
program : block
block_id : stmt_id
| block_id stmt_oth_id
| block_nid stmt_pr_id
| block_nid stmt_oth_id
block_nid : stmt_nid
| block_id stmt_oth_nid
| block_nid stmt_pr_nid
| block_nid stmt_oth_nid
block : %empty
| block_id | block_nid
stmt_un_id : expr_un_id
stmt_un_nid : expr_un_nid
stmt_pr_id : expr_pr_id
stmt_pr_nid : expr_pr_nid
stmt_oth_id : expr_oth_id
| return_id
stmt_oth_nid : expr_oth_nid
| return_nid
stmt_id : stmt_un_id | stmt_pr_id | stmt_oth_id
stmt_nid : stmt_un_nid | stmt_pr_nid | stmt_oth_nid
return_id : "return" expr_id
return_nid : "return" expr_nid
expr_un_id : sum_un_id
expr_un_nid : sum_un_nid
expr_pr_id : sum_pr_id
expr_pr_nid : sum_pr_nid
expr_oth_id : sum_oth_id
expr_oth_nid : sum_oth_nid
expr_id : expr_un_id | expr_pr_id | expr_oth_id
expr_nid : expr_un_nid | expr_pr_nid | expr_oth_nid
expr : expr_id | expr_nid
sum_un_id : mul_un_id
| sum_un '-' mul_id
sum_un_nid : mul_un_nid
| sum_un '-' mul_nid
sum_un : sum_un_id | sum_un_nid
sum_pr_id : mul_pr_id
| sum_pr '-' mul_id
sum_pr_nid : mul_pr_nid
| sum_pr '-' mul_nid
sum_pr : sum_pr_id | sum_pr_nid
sum_oth_id : mul_oth_id
| sum_oth '-' mul_id
sum_oth_nid : mul_oth_nid
| sum_oth '-' mul_nid
sum_oth : sum_oth_id | sum_oth_nid
mul_un_id : unary_un_id
| mul_un '*' unary_id
mul_un_nid : unary_un_nid
| mul_un '*' unary_nid
mul_un : mul_un_id | mul_un_nid
mul_pr_id : mul_pr '*' unary_id
mul_pr_nid : unary_pr_nid
| mul_pr '*' unary_nid
mul_pr : mul_pr_id | mul_pr_nid
mul_oth_id : unary_oth_id
| mul_oth '*' unary_id
mul_oth_nid : unary_oth_nid
| mul_oth '*' unary_nid
mul_oth : mul_oth_id | mul_oth_nid
mul_id : mul_un_id | mul_pr_id | mul_oth_id
mul_nid : mul_un_nid | mul_pr_nid | mul_oth_nid
unary_un_id : '-' unary_id
unary_un_nid : '-' unary_nid
unary_pr_nid : term_pr_nid
unary_oth_id : term_oth_id
unary_oth_nid: term_oth_nid
unary_id : unary_un_id | unary_oth_id
unary_nid : unary_un_nid | unary_pr_nid | unary_oth_nid
term_oth_id : IDENT
term_oth_nid : NUMBER
| IDENT '(' expr ')'
term_pr_nid : '(' expr ')'
Here's a little test:
> 5-5
{ [- 5 5] }
> 5(-5)
{ 5; [~ -- 5] }
> a-5
{ [- a 5] }
> a(5)
{ [CALL a 5] }
> -7*a
{ [* [~ -- 7] a] }
> a*-7
{ [* a [~ -- 7]] }
> a-b*c
{ [- a [* b c]] }
> a*b-c
{ [- [* a b] c] }
> a*b(3)-c
{ [- [* a [CALL b 3]] c] }
> a*b-c(3)
{ [- [* a b] [CALL c 3]] }
> a*b-7(3)
{ [- [* a b] 7]; 3 }

ANTLR Making Negative Test Cases

I'm new to ANTLR and am trying to understand how to do some things with it. I need it to throw an error when a statement is missing things, like a semicolon or an end bracket. It's been called negative test cases by the problem set that I'm working through.
For example, the below code returns true, which is correct.
val program = """
1 + 2;
"""
recognize(program)
However, this code also returns true, despite it missing the semicolon at the end. It should return false ([PARSER error at line=1]: missing ';' at '').
val program = """
1 + 2
""".trimIndent()
recognize(program)
The grammar is as follows:
program: (expression ';')* | EOF;
expression: INT PLUS INT | OPENBRAC INT PLUS INT CLOSEBRAC | QUOTE IDENT QUOTE PLUS QUOTE IDENT QUOTE;
IDENT: [A-Za-z0-9]+;
INT: [-][0-9]+ | ('0'..'9')+;
PLUS: '+';
OPENBRAC: '(';
CLOSEBRAC: ')';
QUOTE: '"';
program: (expression ';')* | EOF;
This means a program can either be zero or more instances of expression ';' followed by whatever else is in the input stream or it can be empty. Since (expression ';')* can already match the empty input by itself, the | EOF is just redundant.
What you want is program: (expression ';')* EOF, which means that a program consists of zero or more instances of expression ';', followed by the end of input, meaning there must be nothing left in the input afterwards.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

What is the start symbol in this grammar?

What is the start symbol?
Based on some research "The start symbol we choose should allow the grammar to parse the most input sentences"
Clearly < Var > is NOT a start symbol as it will parse least input sentences, then start symbol is < Var > or < Group > ?
<Group> ::= [ <One>, <Group> ] | <One>
<One> ::= <Var> | ( <Group> )
<Var> ::= a | b | c
Final (start?) symbol is also called an AXIOM.
It is always given explicitly. It should never be deduced. It is decided by the author of the grammar.

Resources