I'm using Python PLY to parse a specific language. For a grammar like:
IF LPAREN condition RPAREN LBRACE stmtlist RBRACE ELSE LBRACE stmtlist RBRACE
When I know the condition value, say True, then is there a way to stop parsing the stmtlist in the ELSE path?
Thanks,
You will have to continue the parse, because you need to find the end of the block enclosed by the second RBRACE; in other words, you need to parse to find the beginning of the next statement.
That said, when you analyze the results of the parse (to generate code, construct an AST, whatever you need to do), if you can determine that condition always evaluates to true (perhaps it is the expression 1 = 1), then you can suppress generating code for the second stmtlist.
Update:
Your grammar (the syntax of your language) is specified non-procedurally, so there's no place for you to attach conditional logic.
On the other hand, you specify semantic actions to take when a particular syntactic element of your grammar is matched, and you do this procedurally. In PLY, you do this by coding the body of grammar rule function(s). In the grammar rule function that matches the second stmtlist, you can write conditional code to skip over code generation, based on other information you have figured out about the input program (the input to your compiled language processor).
Related
I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".
In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).
Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.
In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.
statement : variable ':=' expression
| functionCall
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
variable : ID
| ID'.'variable
| ID '[' expression ']'
;
How could I adapt the lexer for LL(1) parser to support k look ahead tokens?
Are there some elegant ways?
I know I can add some buffers for tokens. I'd like to discuss some details of programming.
this is the Parser:
class Parser
{
private Lexer lex;
private Token look;
public Parser(Lexer l)
{
lex = l;
move();
}
private void move()
{
look = lex.scan();
}
}
and the Lexer.scan() returns the next token from the stream.
In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.
In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)
Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)
Two notes:
1) I'm skeptical about the rule:
functionCall : ID'(' (expression ( ',' expression )*)* ')'
;
That would allow, for example:
function(a[3], b[2] c[x] d[y], e.foo)
which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.
I'm trying to parse a simple grammar using an LALR(1) parser generator (Bison, but the problem is not specific to that tool), and I'm hitting a shift-reduce conflict. The docs and other sources I've found about fixing these tend to say one or more of the following:
If the grammar is ambiguous (e.g. if-then-else ambiguity), change the language to fix the ambiguity.
If it's an operator precedence issue, specify precedence explicitly.
Accept the default resolution and tell the generator not to complain about it.
However, none of these seem to apply to my situation: the grammar is unambiguous so far as I can tell (though of course it's ambiguous with only one character of lookahead), it has only one operator, and the default resolution leads to parse errors on correctly-formed input. Are there any techniques for reworking the definition of a grammar to remove shift-reduce conflicts that don't fall into the above buckets?
For concreteness, here's the grammar in question:
%token LETTER
%%
%start input;
input: /* empty */ | input input_elt;
input_elt: rule | statement;
statement: successor ';';
rule: LETTER "->" successor ';';
successor: /* empty */ | successor LETTER;
%%
The intent is to parse semicolon-separated lines of the form "[A-Za-z]+" or "[A-Za-z] -> [A-Za-z]+".
Using the Solaris version of yacc, I get:
1: shift/reduce conflict (shift 5, red'n 7) on LETTER
state 1
$accept : input_$end
input : input_input_elt
successor : _ (7)
$end accept
LETTER shift 5
. reduce 7
input_elt goto 2
rule goto 3
statement goto 4
successor goto 6
So, the trouble is, as it very often is, the empty rule - specifically, the empty successor. It isn't completely clear whether you want to allow a semi-colon as a valid input - at the moment, it is. If you modified the successor rule to:
successor: LETTER | successor LETTER;
the shift/reduce conflict is eliminated.
Thanks for whittling down the grammar and posting it. Changing the successor rule to -
successor: /* empty */ | LETTER successor;
...worked for me. ITYM the language looked unambiguous.