I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!
Related
I am using Bison (AFAIK they use LL(1) parsing as default).
My grammar says something like this:
function_decl: ID '(' params ')' ':' TYPE ... // body may go here
function_call: ID '(' arguments ')'
params: ID ':' TYPE
| params ',' ID ':' TYPE
| %empty
arguments: ID
| arguments ',' ID
| %empty
Now, bison warns saying a reduce/reduce conflict, because params and arguments both are null-able (in case of a zero-parameter function()).
My question is, how can I remove (instead of suppressing) this conflict ?
Although someone suggested to use different parsing technique, I want to make it clear for myself if it is possible (and I should do so) or I should just ignore it.
If you ignore the warning, you will end up with a parser which cannot recognise a function call without arguments. So that is probably not a good idea.
You are quite correct that the conflict is the result of both params and arguments producing an empty string. Because the parser can only lookahead at one symbol a when it reads the ) in func(), it needs to decide whether to reduce an empty params (which will force it to proceed with function_decl) or an empty arguments (which will commit it to function_call). But there is no way to know until the next token is read.
The easiest solution is to avoid the empty reductions, although it makes the grammar slightly wordier. In the following, neither params nor arguments produce the empty string, and extra productions for function_decl and function_call are used to recognise these cases.
function_decl: ID '(' params ')' ':' TYPE ... // body may go here
function_decl: ID '(' ')' ':' TYPE ...
function_call: ID '(' arguments ')'
function_call: ID '(' ')'
params: ID ':' TYPE
| params ',' ID ':' TYPE
arguments: ID
| arguments ',' ID
This works because it lets the parser defer the decision between call and declaration. An LR parser can defer decisions until it has to reduce; it can keep several productions open at the same time, but it must reduce a production when it reaches its end.
Note that even without the conflict, your original grammar is incorrect. As written, it allows arguments (or params) to start with an arbitrary number of commas. Your intention was not to allow %empty as an alternative base case, but rather as an alternative complete expansion. The correct grammar for an optional comma-separated list requires two non-terminals:
arguments
: %empty
| argument_list
argument_list
: argument
| argument_list ',' argument
I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).
I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)
So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.
I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.