Ocamlyacc token not visible when performing semantic action - parsing

I am using ocamlyacc for a small parser which also performs some semantic actions on most parsing rules.
I have defined a set of tokens in the beginning:
%token T_plus
%token T_minus
%token <int> T_int_const
%left T_plus T_minus
A parser rule which performs a semantic action is the following:
exp: exp T_plus exp
{
checkType T_plus $1 $3
}
where checkType is an external helper function. However, I'm getting this strange warning (which refers to a line in my Parser.mly file)
warning: T_plus was selected from type Parser.token.
It is not visible in the current scope,
and will not be selected if the type becomes unknown.
I haven't found any relevant info in the ocamlyacc manual. Has anyone encountered a similar error? Why is the token not visible inside the scope of the semantic action?

It is not possible to guess what goes wrong on your side, since you're not disclosing enough information. I can guess, that you somehow misread the error message, and the problem is in another file. For example, the following file:
%{
let f PLUS _ = ()
%}
%token PLUS
%left PLUS
%start exp
%type <unit> exp
%%
exp : exp PLUS exp {f PLUS $1}
compiles any problems or warnings with
ocamlbuild Parser.byte
I can only suggest, to look at the generated Parser.ml and see what's is happening there.
In general, this message means, that you're referring to a constructor, that was not brought to the scope. In Parser.mly tokens are always in the scope, so you can't see this error in that file. Usually, you may do this in your lexer. So make sure, that you have open Parser in the intro section of your lexer.

Related

Solve shift/reduce conflict across rules

I'm trying to learn bison by writing a simple math parser and evaluator. I'm currently implementing variables. A variable can be part of a expression however I'd like do something different when one enters only a single variable name as input, which by itself is also a valid expression and hence the shift reduce conflict. I've reduced the language to this:
%token <double> NUM
%token <const char*> VAR
%nterm <double> exp
%left '+'
%precedence TWO
%precedence ONE
%%
input:
%empty
| input line
;
line:
'\n'
| VAR '\n' %prec ONE
| exp '\n' %prec TWO
;
exp:
NUM
| VAR %prec TWO
| exp '+' exp { $$ = $1 + $3; }
;
%%
As you can see, I've tried solving this by adding the ONE and TWO precedences manually to some rules, however it doesn't seem to work, I always get the exact same conflict. The goal is to prefer the line: VAR '\n' rule for a line consisting of nothing but a variable name, otherwise parse it as expression.
For reference, the conflicting state:
State 4
4 line: VAR . '\n'
7 exp: VAR . ['+', '\n']
'\n' shift, and go to state 8
'\n' [reduce using rule 7 (exp)]
$default reduce using rule 7 (exp)
Precedence comparisons are always, without exception, between a production and a token. (At least, on Yacc/Bison). So you can be sure that if your precedence level list does not contain a real token, it will have no effect whatsoever.
For the same reason, you cannot resolve reduce-reduce conflicts with precedence. That doesn't matter in this case, since it's a shift-reduce conflict, but all the same it's useful to know.
To be even more specific, the precedence comparison is between a reduction (using the precedence of the production to be reduced) and that of the incoming lookahead token. In this case, the lookahead token is \n and the reduction is exp: VAR. The precedence level of that production is the precedence of VAR, since that is the last terminal symbol in the production. So if you want the shift to win out over the reduction, you need to declare your precedences so that the shift is higher:
%precedence VAR
%precedence '\n'
No pseudotokens (or %prec modifiers) are needed.
This will not change the parse, because Bison always prefers shift if there are no applicable precedence rules. But it will suppress the warning.

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

Detect wrong grammar (error alternatives) for quick fixes

In my ANTLR grammar I would like to detect wrong keywords or wrong typed constants, e.g. 'null' instead of 'NULL'. I added error alternatives in the grammar, e.g.:
| extra='null' #error1
If the wrong constant is detected in my custom editor I can fix it by replacing it with the correct constant.
But I don't know if this is the correct way to address and detect wrong keywords or constants in a grammar.
In addition I tried to detect missing closing in the grammar (see ANTLR book chapter 9.4):
.......
| 'if' '(' expr expr #error2
| '{' exprlist #error3
| 'while' '(' expr expr #error4
........
But this massively slows down the parsing process and so I think that it is wrong to do so.
My questions are:
Is it correct to detect wrong keywords, constants, etc. in that way as described above?
Is it somehow possible to catch missing closing in the grammar without the massive speed decrease?
Any help is appreciated.
I found out that I can extract the second Token from the rule (to add a brace at that position):
| 'if' '(' expr expr #error2
with:
Token myToken = tokens.get(ctx.getChild(2).getSourceInterval().b);
parser.notifyErrorListeners(........
However it massively decreases the speed of the parser.

bison shift reduce conflict, i don't know where

Grammar: http://pastebin.com/ef2jt8Rg
y.output: http://pastebin.com/AEKXrrRG
I don't know where is those conflicts, someone can help me with this?
The y.output file tells you exactly where the conflicts are. The first one is in state 4, so if you go down to look at state 4, you see:
state 4
99 compound_statement: '{' . '}'
100 | '{' . statement_list '}'
IDENTIFIER shift, and go to state 6
:
IDENTIFIER [reduce using rule 1 (threat_as_ref)]
IDENTIFIER [reduce using rule 2 (func_call_start)]
This is telling you that in this state (parsing a compound_statement, having seen a {), and looking at the next token being IDENTIFIER, there are 3 possible things it could do -- shift the token (which would be the beginning of a statement_list), reduce the threat_as_ref empty production, or reduce the func_call_start empty production.
The brackets tell you that it has decided to never do those actions -- the default "prefer shift over reduce" conflict resolution means that it will always do the shift.
The problem with your grammar is these empty rules threat_as_ref and func_call_start -- they need to be reduced BEFORE shifting the IDENTIFIER, but in order to know if they're valid, the parser would need to see the tokens AFTER the identifer. func_call_start should only be reduced if this is the beginning of the function call (which depends on there being a ( after the IDENTIFIER.) So the parser needs more lookahead to deal with your gramar. In your specific case, you grammar is LALR(2) (2 token lookahead would suffice), but not LALR(1), so bison can't deal with it.
Now you could fix it by just getting rid of those empty rules -- func_call_start has no action at all, and the action for threat_as_ref could be moved into the action for variable, but if you want those rules in the future that may be a problem.
(1) I see at least one thing that looks odd. Your productions for expression_statement are similar to those for postfix_statement, but not quite the same. They don't have the '(' and ')' tokens:
expression_statement
: ';'
| expression ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } argument_expression_list func_call_end ';'
;
Since an expression can be a primary_expression, which can be an IDENTIFIER, and since func_call_start and func_call_end are epsilon (null) productions, when presented with the input
foo;
the parser has to decide whether to apply
expression_statement : expression ';'
or
expression_statement : func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
(2) Also, I'm not certain of this, but I suspect the epsilon non-terminal threat_as_ref might be causing you some trouble. I have not traced it through, but there may be a case where the parser has to decide whether something is a variable_ref or a variable.

Shift/reduce conflict in yacc due to look-ahead token limitation?

I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.

Resources