I implemented a type-checker and reducer of calculus of constructions in Haskell with a simple monadic parser using Megaparsec. Now I want to improve it so it can recognize this syntactic shortcut:
∀(x:A)->B (with x not free in B) = A -> B
The grammar for this syntax is as follows:
<expr>
= "(" <expr> ")"
| <expr> <expr>
| "λ" "(" <name> ":" <expr> ")" "→" <expr>
| "∀" "(" <name> ":" <expr> ")" "→" <expr>
| <expr> "→" <expr>
| <name>
| "*"
<name> = [_A-Za-z][_0-9A-Za-z]*
My current parser uses this variation with left recursion eliminated (without the shortcut):
<expr>
= "(" <appl> ")"
| "λ" "(" <name> ":" <appl> ")" "→" <appl>
| "∀" "(" <name> ":" <appl> ")" "→" <appl>
| <name>
| "*"
<appl> = <expr>+
<name> = [_A-Za-z][_0-9A-Za-z]*
The previously mentioned shortcut is left-recursive. I have no idea how to convert it to a right-recursive grammar so it can be handled by a conventional recursive descent parser.
I know there exist more powerful parsing techniques that can handle left-recursive grammars, but I want to keep it right-recursive to left open the possibility of implementing a parser by hand in the near future.
The answer has been evident after a short break. Use exactly the same trick that we did on <appl> and extend it as follows:
<expr>
= "(" <appl> ")"
| "λ" "(" <name> ":" <appl> ")" "→" <appl>
| "∀" "(" <name> ":" <appl> ")" "→" <appl>
| <name>
| "*"
<appl> = <expr>+ ("→" <appl>)?
<name> = [_A-Za-z][_0-9A-Za-z]*
I will leave the question open in case it helps somebody.
Related
I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.
I have a grammar that looks like this:
<type> ::= <base_type> <optional_array_size>
<optional_array_size> ::= "[" <INTEGER_LITERAL> "]" | ""
<base_type> ::= <integer_type> | <real_type> | <function_type>
<function_type> ::= "(" <params> ")" "->" <type>
<params> ::= <type> <params_tail> | ""
<params_tail> ::= "," <type> <params_tail> | ""
so that I can define types like Integer[42], Real, or (Integer, Real) -> Integer. This is all good and well, but I would like my functions to be first class citizens. Given the grammar above, I can't have arrays of functions, as it would only turn the return type into an array. (Integer, Real) -> Integer [42] won't be an array of 42 functions, but one function that returns an array of 42 integers.
I was considering adding optional parenthesis around function types ((Integer, Real) -> Integer)[42], but that creates another issue (note: I am using a top-down recursive descent parser, so my grammar has to be LL(1)).:
<function_type> ::= "(" <function_type_tail>
<function_type_tail> ::= <params> ")" "->" <type>
| "(" <params> ")" "->" <type> ")"
The issue is that first(params) contains "(" because function types could be passed as function parameters: ((Integer) -> Real, Real) -> Integer. This syntax was valid before I modified the grammar, but it no longer works now. How can I modify my grammar to get what I want?
That's definitely a challenge.
It's much easier to make an LR grammar for that language, although it's still a bit of a challenge. To start with, it's necessary to remove the ambiguity which from
<type> ::= <base_type> <optional_array_size>
<base_type> ::= <function_type>
<function_type> ::= "(" <params> ")" "->" <type>
The ambiguity, as I'm sure you know, results from not knowing whether the [42] in ()->Integer[42] is part of the top-level <type> or the enclosed <function_type>. To remove the ambiguity, we need to be explicit about what construct can take an array size. (Here, I've added the desired production which allows <type> to be parenthesized):
<type> ::= <atomic_type> <optional_array_size>
| <function_type>
<opt_array_size> ::= ""
| <array_size>
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
<opt_params> ::= ""
| <params>
<params> ::= <type>
| <params> "," <type>
Unfortunately, that grammar is LR(2), not LR(1). The problem occurs with
( Integer ) [ 42 ]
( Integer ) -> Integer
^
|
+----------------- Lookahead
At the lookahead point, the parser still doesn't know if it is looking at a (redundantly) parenthesized type or at the parameter list in a function type. It won't know that until it sees the following symbol (which might be the end of input, in addition to the two options above). In both cases, it needs to reduce Integer to <atomic_type> and then to <type>. But then, in the first case it can just shift the close parenthesis, while in the second case it needs to continue reducing, first to <params> and then to <opt_params>. That's a shift-reduce conflict. Of course, it can easily be resolved by looking one more token into the future, but the need to see two tokens into the future is what makes the grammar LR(2).
Fortunately, LR(k) grammars can always be reduced to LR(1) grammars. (This is not true of LL(k) grammars, by the way.) It just gets a bit messy because it is necessary to introduce a bit of redundancy. We do that by avoiding the need to reduce <type> until we know that we have a parameter list, which means that we need to accept "(" <type> ")" without committing to one or the other parse. That leads to the following, where an apparently redundant rule was added to <function_type> and <opt_params> was modified to accept either 0 or at least two parameters:
<type> ::= <atomic_type> <optional_array_size>
| <function_type>
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
| "(" <type> ")" "->" <type>
<opt_params> ::= ""
| <params2>
<params2> ::= <type> "," <type>
| <params2> "," <type>
Now, I personally would stop there. There are lots of LR parser generators out there, and the above grammar is LALR(1) and still reasonably easy to read. But it is possible to convert it to an LL(1) grammar, with quite a bit of work. (I used a grammar transformation tool to do some of these transformations.)
It's straight-forward to remove left-recursion and then left-factor the grammar:
# Not LL(1)
<type> ::= <atomic_type> <opt_size>
| <function_type>
<opt_size> ::= ""
| "[" integer "]"
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <fop>
<fop> ::= <opt_params> ")" to <type>
| <type> ")" to <type>
<opt_params> ::= ""
| <params2>
<params2> ::= <type> "," <type> <params_tail>
<params_tail> ::= "," <type> <params_tail>
| ""
But that's not sufficient, because <function_type> and <atomic_type> can both start with "(" <type>. And there's a similar problem between the productions for the parameter list. To get rid of these issues, we need yet another technique: expand non-terminals in place in order to get the conflicts into the same non-terminal so that we can left-factor them. As with this example, that often comes at the cost of some duplication.
By expanding <atomic_type>, <function_type> and <opt_params>, we get:
<type> ::= <integer_type> <opt_size>
| <real_type> <opt_size>
| "(" <type> ")" <opt_size>
| "(" ")" "->" <type>
| "(" <type> ")" "->" <type>
| "(" <type> "," <type> <params2> ")" "->" <type>
<opt_size> ::= ""
| "[" INTEGER_LITERAL "]"
<params2> ::= ""
| "," <type> <params2>
And then we can left-factor to produce
<type> ::= <integer_type> <opt_size>
| <real_type> <opt_size>
| "(" <fop>
<fop> ::= <type> <ftype>
| ")" "->" <type>
<ftype> ::= ") <fcp>
| "," <type> <params2> ")" "->" <type>
<fcp> ::= <opt_size>
| "->" <type>
<opt_size> ::= ""
| "[" INTEGER_LITERAL "]"
<params2> ::= ""
| "," <type> <params2>
which is LL(1). I'll leave it as an exercise to reattach all the appropriate actions to these productions.
I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.
I need help to solve this one and explanation how to deal with this SHIFT/REDUCE CONFLICTS in future.
I have some conflicts between few states in my cup file.
Grammer look like this:
I have conflicts between "(" [ActPars] ")" states.
1. Statement = Designator ("=" Expr | "++" | "‐‐" | "(" [ActPars] ")" ) ";"
2. Factor = number | charConst | Designator [ "(" [ActPars] ")" ].
I don't want to paste whole 700 lines of cup file.
I will give you the relevant states and error output.
This is code for the line 1.)
Matched ::= Designator LPAREN ActParamsList RPAREN SEMI_COMMA
ActParamsList ::= ActPars
|
/* EPS */
;
ActPars ::= Expr
|
Expr ActPComma
;
ActPComma ::= COMMA ActPars;
This is for the line 2.)
Factor ::= Designator ActParamsOptional ;
ActParamsOptional ::= LPAREN ActParamsList2 RPAREN
|
/* EPS */
;
ActParamsList2 ::= ActPars
|
/* EPS */
;
Expr ::= SUBSTRACT Term RepOptionalExpression
|
Term RepOptionalExpression
;
The ERROR output looks like this:
Warning : *** Shift/Reduce conflict found in state #182
between ActParamsOptional ::= LPAREN ActParamsList RPAREN (*)
and Matched ::= Designator LPAREN ActParamsList RPAREN (*) SEMI_COMMA
under symbol SEMI_COMMA
Resolved in favor of shifting.
Error : * More conflicts encountered than expected -- parser generation aborted
I believe the problem is that your parser won't know if it should shift to the token:
SEMI_COMMA
or reduce to the token
ActParamsOptional
since the tokens defined in both ActParamsOptional and Matched are
LPAREN ActPars RPAREN
How can I interpret this as ENBF grammer?
<assign>--> <id> = <expr>
<id>--> A | B | C
<expr> --> <expr> * <expr>
<expr> --> <expr> + <expr>
| <id> + <expr>
|( <expr> )
| <id>
I can make parse tree and derivation of any statement using this grammer, but am having trouble with EBNF.
<assign>--> <id> = <expr>
An assign is the sequence: id equals-sign expr.
<id>--> A | B | C
An id is one of A, B or C
<expr> --> <expr> * <expr>
<expr> --> <expr> + <expr>
| <id> + <expr>
|( <expr> )
| <id>
An expression can be:
The product of two expressions (infix notation)
The addition of two expression (infix notation)
The addition of an identifier and an expression (which is a particular case of addition of two expressions, where the first expresion is just <id>)
A parenthesized expression.
An identifier.