Jison parser generator shift reduce conflict with parenthesis, how to solve? - parsing

I'm trying to implement parenthesis in my parser but i got conflict in my grammar.
"Conflict in grammar: multiple actions possible when lookahead token is )"
Here is simplified version of it:
// grammar
{
"Root": ["", "Body"],
"Body": ["Line", "Body TERMINATOR Line"],
"Line": ["Expression", "Statement"],
"Statement": ["VariableDeclaration", "Call", "With", "Invocation"],
"Expression": ["Value", "Parenthetical", "Operation", "Assign"],
"Identifier": ["IDENTIFIER"],
"Literal": ["String", "Number"],
"Value": ["Literal", "ParenthesizedInvocation"],
"Accessor": [". Property"],
"ParenthesizedInvocation": ["Value ParenthesizedArgs"],
"Invocation": ["Value ArgList"],
"Call": ["CALL ParenthesizedInvocation"],
"ParenthesizedArgs": ["( )", "( ArgList )"],
"ArgList": ["Arg", "ArgList , Arg"],
"Arg": ["Expression", "NamedArg"],
"NamedArg": ["Identifier := Value"],
"Parenthetical": ["( Expression )"],
"Operation": ["Expression + Expression", "Expression - Expression"]
}
//precedence
[
['right', 'RETURN'],
['left', ':='],
['left', '='],
['left', 'IF'],
['left', 'ELSE', 'ELSE_IF'],
['left', 'LOGICAL'],
['left', 'COMPARE'],
['left', '&'],
['left', '-', '+'],
['left', 'MOD'],
['left', '\\'],
['left', '*', '/'],
['left', '^'],
['left', 'CALL'],
['left', '(', ')'],
['left', '.'],
]
In my implementation i need function calls like this (with parenthesis and comma separated):
Foo(1, 2)
Foo 1, 2
And be able to use regular parenthesis for priority of operations. Even in function calls (but only in parenthesized function calls):
Foo(1, (2 + 4) / 2)
Foo 1, 2
Function call without parenthesis treated as statement, function call with parenthesis treated as expression.
How can i solve this conflict?

In VBA, function call statements (as opposed to expressions) have two forms (simplified):
CALL name '(' arglist ')'
name arglist
Note that the second one does not have parentheses around the argument list. That's precisely to avoid the ambiguity of:
Func (3)
which is the ambiguity you're running into.
The ambiguity is that it is not clear whether the parentheses are parentheses which surround an argument list, or parentheses which surround an parenthesized expression. That's not an essential ambiguity, since the result is effectively the same. But it's still important because of the possibility that the program continues like this:
Foo (3), (4)
in which case, it is essential that the parentheses be parsed as parentheses surrounding a parenthesized expression.
So one possibility is to modify your grammar to be similar to the grammar in the VBA reference:
call-statement = "Call" (simple-name-expression / member-access-expression / index-expression / with-expression)
call-statement =/ (simple-name-expression / member-access-expression / with-expression) argument-list
But I suppose that you really want to implement a language similar to VBA without being strictly conformant. That makes things slightly more complicated.
As a first approximation, you can require that the form name '(' [arglist] ')' have at least two arguments (unless it's empty):
# Not tested
"Invocation": ["Literal '(' ArgList2 ')' ", "Literal '(' ')' ", "Literal ArgList"],
"ArgList": ["Arg", "ArgList2"],
"ArgList2": ["Arg ',' Arg", "ArgList2 ',' Arg"],

Related

Antlr4 how to build a grammar allowed keywords as identifier

This is a demo code
label:
var id
let id = 10
goto label
If allowed keyword as identifier will be
let:
var var
let var = 10
goto let
This is totally legal code. But it seems very hard to do this in antlr.
AFAIK, If antlr match a token let, will never fallback to id token. so for antlr it will see
LET_TOKEN :
VAR_TOKEN <missing ID_TOKEN>VAR_TOKEN
LET_TOKEN <missing ID_TOKEN>VAR_TOKEN = 10
although antlr allowed predicate, I have to control ever token match and problematic. grammar become this
grammar Demo;
options {
language = Go;
}
#parser::members{
var _need = map[string]bool{}
func skip(name string,v bool){
_need[name] = !v
fmt.Println("SKIP",name,v)
}
func need(name string)bool{
fmt.Println("NEED",name,_need[name])
return _need[name]
}
}
proj#init{skip("inst",false)}: (line? NL)* EOF;
line
: VAR ID
| LET ID EQ? Integer
;
NL: '\n';
VAR: {need("inst")}? 'var' {skip("inst",true)};
LET: {need("inst")}? 'let' {skip("inst",true)};
EQ: '=';
ID: ([a-zA-Z] [a-zA-Z0-9]*);
Integer: [0-9]+;
WS: [ \t] -> skip;
Looks so terrible.
But this is easy in peg, test this in pegjs
Expression = (Line? _ '\n')* ;
Line
= 'var' _ ID
/ 'let' _ ID _ "=" _ Integer
Integer "integer"
= [0-9]+ { return parseInt(text(), 10); }
ID = [a-zA-Z] [a-zA-Z0-9]*
_ "whitespace"
= [ \t]*
I actually done this in peggo and javacc.
My question is how to handle these grammars in antlr4.6, I was so excited about the antlr4.6 go target, but seems I choose the wrong tool for my grammar ?
The simplest way is to define a parser rule for identifiers:
id: ID | VAR | LET;
VAR: 'var';
LET: 'let';
ID: [a-zA-Z] [a-zA-Z0-9]*;
And then use id instead of ID in your parser rules.
A different way is to use ID for identifiers and keywords, and use predicates for disambiguation. But it's less readable, so I'd use the first way instead.

Writing a parser with multiple expected completions

Let's say I have the following grammar for boolean queries:
Query := <Atom> [ <And-Chain> | <Or-Chain> ]
Atom := "(" <Query> ")" | <Var> <Op> <Var>
And-Chain := "&&" <Atom> [ <And-Chain> ]
Or-Chain := "||" <Atom> [ <Or-Chain> ]
Var := "A" | "B" | "C" | ... | "Z"
Op := "<" | ">" | "="
A parser for this grammar would have the pseudo-code:
parse(query) :=
tokens <- tokenize(query)
parse-query(tokens)
assert-empty(tokens)
parse-query(tokens) :=
parse-atom(tokens)
if peek(tokens) equals "&&"
parse-chain(tokens, "&&")
else if peek(tokens) equals "||"
parse-chain(tokens, "||")
parse-atom(tokens) :=
if peek(tokens) equals "("
assert-equal( "(", shift(tokens) )
parse-query(tokens)
assert-equal( ")", shift(tokens) )
else
parse-var(tokens)
parse-op(tokens)
parse-var(tokens)
parse-chain(tokens, connector) :=
assert-equal( connector, shift(tokens) )
parse-atom(tokens)
if peek(tokens) equals connector
parse-chain(tokens, connector)
parse-var(tokens) :=
assert-matches( /[A-Z]/, shift(tokens) )
parse-op(tokens) :=
assert-matches( /[<>=]/, shift(tokens) )
What I want to make sure of, though, is that my parser will report helpful parse errors. For example, given a query that starts with "( A < B && B < C || ...", I'd like an error like :
found "||" but expected "&&" or ")"
The trick with this is that it gathers expectations from across different parts of the parser. I can work out ways to do this, but it all ends up feeling a little clunky.
Like, in Java, I might throw a GreedyError when attempting to peek for a "&&" or "||"
// in parseAtom
if ( tokens.peek() == "(" ) {
assertEqual( "(", tokens.shift() );
try {
parseQuery(tokens);
caught = null;
}
catch (GreedyError e) {
caught = e;
}
try {
assertEqual( ")", tokens.shift() );
}
catch (AssertionError e) {
throw e.or(caught);
}
}
// ...
// in parseChain
assertEqual( connector, tokens.shift() );
parseAtom(tokens);
if (tokens.peek() == connector) {
parseChain(tokens, connector);
}
else {
new GreedyError(connector);
}
Or, in Haskell, I might use WriterT to keep track of my last failing comparisons for the current token, using censor to clear out after every successful token match.
But both of these solutions feel a bit hacked and clunky. I feel like I'm missing something fundamental, some pattern, that could handle this elegantly. What is it?
Build an L(AL)R(1) state machine. More details about LALR can be found here.
What you want to use for error reporting is the FIRSTOF set for dot in each state. This answer will make sense when you understand how the parser states are generated.
If you have such a set of states, you can record the union of the FIRSTOF sets with each state; then when in that state, and no transition is possible, your error message is "Exepect (firstof currentstate)".
If you don't want to record the FIRST sets, you can easily write an algorithm that will climb through the state tables to reconstruct it. That would the algorithmic equivalent of your "peeking ahead".

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

how to setup flex/bison rules for parsing a comma-delimited argument list

I would like to be able to parse a non-empty, one-or-many element, comma-delimited (and optionally parenthesized) list using flex/bison parse rules.
some e.g. of parseable lists :
1
1,2
(1,2)
(3)
3,4,5
(3,4,5,6)
etc.
I am using the following rules to parse the list (final result is parse element 'top level list'), but they do not seem to give the desired result when parsing (I get a syntax-error when supplying a valid list). Any suggestion on how I might set this up ?
cList : ELEMENT
{
...
}
| cList COMMA ELEMENT
{
...
}
;
topLevelList : LPAREN cList RPAREN
{
...
}
| cList
{
...
}
;
This sounds simple. Tell me if i missed something or if my example doesnt work
RvalCommaList:
RvalCommaListLoop
| '(' RvalCommaListLoop ')'
RvalCommaListLoop:
Rval
| RvalCommaListLoop ',' Rval
Rval: INT_LITERAL | WHATEVER
However if you accept rvals as well as this list you'll have a conflict confusing a regular rval with a single item list. In this case you can use the below which will either require the '('')' around them or require 2 items before it is a list
RvalCommaList2:
Rval ',' RvalCommaListLoop
| '(' RvalCommaListLoop ')'
I too want to know how to do this, thinking about it briefly, one way to achieve this would be to use a linked list of the form,
struct list;
struct list {
void *item;
struct list *next;
};
struct list *make_list(void *item, struct list *next);
and using the rule:
{ $$ = make_list( $1, $2); }
This solution is very similar in design to:
Using bison to parse list of elements
The hard bit is to figure out how to handle lists in the scheme of a (I presume) binary AST.
%start input
%%
input:
%empty
| integer_list
;
integer_list
: integer_loop
| '(' integer_loop ')'
;
integer_loop
: INTEGER
| integer_loop COMMA INTEGER
;
%%

ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;
Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Resources