Unambigious grammar for higher order functions - parsing

I have a grammar that looks like this:
<type> ::= <base_type> <optional_array_size>
<optional_array_size> ::= "[" <INTEGER_LITERAL> "]" | ""
<base_type> ::= <integer_type> | <real_type> | <function_type>
<function_type> ::= "(" <params> ")" "->" <type>
<params> ::= <type> <params_tail> | ""
<params_tail> ::= "," <type> <params_tail> | ""
so that I can define types like Integer[42], Real, or (Integer, Real) -> Integer. This is all good and well, but I would like my functions to be first class citizens. Given the grammar above, I can't have arrays of functions, as it would only turn the return type into an array. (Integer, Real) -> Integer [42] won't be an array of 42 functions, but one function that returns an array of 42 integers.
I was considering adding optional parenthesis around function types ((Integer, Real) -> Integer)[42], but that creates another issue (note: I am using a top-down recursive descent parser, so my grammar has to be LL(1)).:
<function_type> ::= "(" <function_type_tail>
<function_type_tail> ::= <params> ")" "->" <type>
| "(" <params> ")" "->" <type> ")"
The issue is that first(params) contains "(" because function types could be passed as function parameters: ((Integer) -> Real, Real) -> Integer. This syntax was valid before I modified the grammar, but it no longer works now. How can I modify my grammar to get what I want?

That's definitely a challenge.
It's much easier to make an LR grammar for that language, although it's still a bit of a challenge. To start with, it's necessary to remove the ambiguity which from
<type> ::= <base_type> <optional_array_size>
<base_type> ::= <function_type>
<function_type> ::= "(" <params> ")" "->" <type>
The ambiguity, as I'm sure you know, results from not knowing whether the [42] in ()->Integer[42] is part of the top-level <type> or the enclosed <function_type>. To remove the ambiguity, we need to be explicit about what construct can take an array size. (Here, I've added the desired production which allows <type> to be parenthesized):
<type> ::= <atomic_type> <optional_array_size>
| <function_type>
<opt_array_size> ::= ""
| <array_size>
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
<opt_params> ::= ""
| <params>
<params> ::= <type>
| <params> "," <type>
Unfortunately, that grammar is LR(2), not LR(1). The problem occurs with
( Integer ) [ 42 ]
( Integer ) -> Integer
^
|
+----------------- Lookahead
At the lookahead point, the parser still doesn't know if it is looking at a (redundantly) parenthesized type or at the parameter list in a function type. It won't know that until it sees the following symbol (which might be the end of input, in addition to the two options above). In both cases, it needs to reduce Integer to <atomic_type> and then to <type>. But then, in the first case it can just shift the close parenthesis, while in the second case it needs to continue reducing, first to <params> and then to <opt_params>. That's a shift-reduce conflict. Of course, it can easily be resolved by looking one more token into the future, but the need to see two tokens into the future is what makes the grammar LR(2).
Fortunately, LR(k) grammars can always be reduced to LR(1) grammars. (This is not true of LL(k) grammars, by the way.) It just gets a bit messy because it is necessary to introduce a bit of redundancy. We do that by avoiding the need to reduce <type> until we know that we have a parameter list, which means that we need to accept "(" <type> ")" without committing to one or the other parse. That leads to the following, where an apparently redundant rule was added to <function_type> and <opt_params> was modified to accept either 0 or at least two parameters:
<type> ::= <atomic_type> <optional_array_size>
| <function_type>
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
| "(" <type> ")" "->" <type>
<opt_params> ::= ""
| <params2>
<params2> ::= <type> "," <type>
| <params2> "," <type>
Now, I personally would stop there. There are lots of LR parser generators out there, and the above grammar is LALR(1) and still reasonably easy to read. But it is possible to convert it to an LL(1) grammar, with quite a bit of work. (I used a grammar transformation tool to do some of these transformations.)
It's straight-forward to remove left-recursion and then left-factor the grammar:
# Not LL(1)
<type> ::= <atomic_type> <opt_size>
| <function_type>
<opt_size> ::= ""
| "[" integer "]"
<atomic_type> ::= <integer_type>
| <real_type>
| "(" <type> ")"
<function_type> ::= "(" <fop>
<fop> ::= <opt_params> ")" to <type>
| <type> ")" to <type>
<opt_params> ::= ""
| <params2>
<params2> ::= <type> "," <type> <params_tail>
<params_tail> ::= "," <type> <params_tail>
| ""
But that's not sufficient, because <function_type> and <atomic_type> can both start with "(" <type>. And there's a similar problem between the productions for the parameter list. To get rid of these issues, we need yet another technique: expand non-terminals in place in order to get the conflicts into the same non-terminal so that we can left-factor them. As with this example, that often comes at the cost of some duplication.
By expanding <atomic_type>, <function_type> and <opt_params>, we get:
<type> ::= <integer_type> <opt_size>
| <real_type> <opt_size>
| "(" <type> ")" <opt_size>
| "(" ")" "->" <type>
| "(" <type> ")" "->" <type>
| "(" <type> "," <type> <params2> ")" "->" <type>
<opt_size> ::= ""
| "[" INTEGER_LITERAL "]"
<params2> ::= ""
| "," <type> <params2>
And then we can left-factor to produce
<type> ::= <integer_type> <opt_size>
| <real_type> <opt_size>
| "(" <fop>
<fop> ::= <type> <ftype>
| ")" "->" <type>
<ftype> ::= ") <fcp>
| "," <type> <params2> ")" "->" <type>
<fcp> ::= <opt_size>
| "->" <type>
<opt_size> ::= ""
| "[" INTEGER_LITERAL "]"
<params2> ::= ""
| "," <type> <params2>
which is LL(1). I'll leave it as an exercise to reattach all the appropriate actions to these productions.

Related

Why doesn't Bison accept this grammar file?

When I use the command bison -d -o parser.java parser.y to generate a parser from my grammar file parser.y, Bison produces the following error:
:8.8-10: syntax error, unexpected string, expecting char or identifier or type
Here is the file parser.y:
%{
import java.util.;
import java.io.;
%}
%start PROGRAM
%token number identifier function break call if else let read return while write
%token "(" ")" "{" "}" ";" "=" "+" "-" "" "/" "%" "<" ">" " <= " " >= " "==" "!=" "&" "|" "~" "!"
%left "+" "-"
%left "" "/" "%"
%left "&" "|"
%nonassoc "!"
%type <Node> PROGRAM FUNCTION PARAMLIST BLOCK STATEMENT IF ELSE EXPR
%type <String> identifier
%type <Integer> number
%union {
Node node;
String identifier;
int number;
}
%%
PROGRAM:
| PROGRAM FUNCTION
| BLOCK
;
FUNCTION:
function identifier '(' PARAMLIST ')' BLOCK
;
PARAMLIST:
identifier
| identifier ',' PARAMLIST
|
;
BLOCK:
'{' STATEMENT '}'
;
STATEMENT:
BREAK
| CALL ';'
| IF
| LET
| READ
| RETURN
| WHILE
| WRITE
;
BREAK:
break ';'
;
CALL:
call identifier '(' ARGLIST ')'
;
ARGLIST:
EXPR
| EXPR ',' ARGLIST
|
;
IF:
if EXPR BLOCK ELSE
;
ELSE:
else BLOCK
|
;
LET:
let identifier '=' EXPR ';'
| let identifier '=' CALL ';'
;
READ:
read identifier ';'
;
RETURN:
return EXPR ';'
;
WHILE:
while EXPR BLOCK
;
WRITE:
write EXPR ';'
;
EXPR:
number
| identifier
| '(' EXPR ')'
| '!' EXPR
| '~' EXPR
| EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '%' EXPR
| EXPR '&' EXPR
| EXPR '|' EXPR
| EXPR '<' EXPR
| EXPR '>' EXPR
| EXPR "<=" EXPR
| EXPR ">=" EXPR
| EXPR "==" EXPR
| EXPR "!=" EXPR
;
%%
int yyerror(String s) {
System.err.println("error: " + s);
}
Bison doesn't allow you to declare quoted token names (such as "(") with the %token declaration. It knows they are tokens; they cannot be anything else.
You use the %token declaration to declare symbolic names for tokens, which you will find useful when writing your lexer. In the declaration, the symbolic name comes first, optionally followed by the double-quoted alias. You can repeat that as often as you like. For example, you could write:
%token TK_LE "<=" TK_GE ">="
You can then use either the symbolic name or the alias in your grammar, but using the alias makes your grammar more readable. Also, Bison uses the alias when constructing error messages, which is a good thing since "expecting TK_SEMIC" is not a great way to communicate with a user that a ";" was required.
Keep in mind that a single-quoted single character token, such as '(', is not the same token as the double-quoted alias. In your grammar, you use '(' but attempt to declare "(". Had you succeeded in declaring "(", you would have gotten an "unused token" warning. Since '(' doesn't require a symbolic name, you can just remove the declaration. You will only need them for multicharacter tokens like "<=". (Note that spaces are significant inside quotes. " <= " is not the same as "<=".)
Symbolic token names are used as Java values, so their names cannot conflict with variables or Java keywords. You cannot, for example, use break as a symbolic token name. Trying to do so will cause compilation errors.
For this reason, it's customary to write token names in ALL_CAPS, and non-terminals in lower case. Non-terminals names are not used in the generated code, so you can use whatever names you wish.
You reverse this convention, which will cause a variety of errors when you compile the generated parser, and which is hard to read for those of us accustomed to the standard style.
A couple of other notes:
The bison Java interface does not use a %union declaration. The %type declarations are sufficient.
You are missing precedence declarations for many operators, particularly comparison operators. That will lead to a large number of parser conflicts. Make sure you write the precedence levels in the correct order.

Unbalanced tree. Most probably caused by unbalanced markers

I'm working on an IntelliJ plugin which will add support for a custom language. Currently, I'm still just trying to get used to grammar kit and how plugin development works.
To that end, I've started working on a parser for basic expressions:
(1.0 * 5 + (3.44 ^ -2))
Following the documentation provided by JetBrains, I've attempted to write BNF and JFlex grammars for the above example.
The generated code for these grammars compiles, but when the plugin is run, it crashes with:
java.lang.Throwable: Unbalanced tree. Most probably caused by unbalanced markers. Try calling setDebugMode(true) against PsiBuilder passed to identify exact location of the problem
Enabling debug mode prints a long list of traces:
java.lang.Throwable: Created at the following trace.
at com.intellij.lang.impl.MarkerOptionalData.notifyAllocated(MarkerOptionalData.java:83)
at com.intellij.lang.impl.PsiBuilderImpl.createMarker(PsiBuilderImpl.java:820)
at com.intellij.lang.impl.PsiBuilderImpl.precede(PsiBuilderImpl.java:457)
at com.intellij.lang.impl.PsiBuilderImpl.access$700(PsiBuilderImpl.java:51)
at com.intellij.lang.impl.PsiBuilderImpl$StartMarker.precede(PsiBuilderImpl.java:361)
java.lang.Throwable: Created at the following trace.
at com.intellij.lang.impl.MarkerOptionalData.notifyAllocated(MarkerOptionalData.java:83)
at com.intellij.lang.impl.PsiBuilderImpl.createMarker(PsiBuilderImpl.java:820)
at com.intellij.lang.impl.PsiBuilderImpl.mark(PsiBuilderImpl.java:810)
at com.intellij.lang.impl.PsiBuilderAdapter.mark(PsiBuilderAdapter.java:107)
at com.intellij.lang.parser.GeneratedParserUtilBase.enter_section_(GeneratedParserUtilBase.java:432)
at com.example.intellij.mylang.MyLangParser.exp_expr_0(MyLangParser.java:154)
java.lang.Throwable: Created at the following trace.
at com.intellij.lang.impl.MarkerOptionalData.notifyAllocated(MarkerOptionalData.java:83)
at com.intellij.lang.impl.PsiBuilderImpl.createMarker(PsiBuilderImpl.java:820)
at com.intellij.lang.impl.PsiBuilderImpl.precede(PsiBuilderImpl.java:457)
at com.intellij.lang.impl.PsiBuilderImpl.access$700(PsiBuilderImpl.java:51)
at com.intellij.lang.impl.PsiBuilderImpl$StartMarker.precede(PsiBuilderImpl.java:361)
Even with these debug logs, I still don't understand what's going wrong. I've tried googling around, and I can't even figure out what 'marker' means in this context...
Here's the BNF grammar:
root ::= expr *
expr ::= add_expr
left add_expr ::= add_op mod_expr | mod_expr
private add_op ::= '+'|'-'
left mod_expr ::= mod_op int_div_expr | int_div_expr
private mod_op ::= 'mod'
left int_div_expr ::= int_div_op mult_expr | mult_expr
private int_div_op ::= '\'
left mult_expr ::= mult_op unary_expr | unary_expr
private mult_op ::= '*'|'/'
unary_expr ::= '-' unary_expr | '+' unary_expr | exp_expr
left exp_expr ::= exp_op exp_expr | value
private exp_op ::= '^'
// TODO: Add support for left_expr. Example: "someVar.x"
value ::= const_expr | '(' expr ')'
const_expr ::= bool_literal | integer_literal | FLOAT_LITERAL | STRING_LITERAL | invalid
bool_literal ::= 'true' | 'false'
integer_literal ::= INT_LITERAL | HEX_LITERAL
I figured out the issue. It had nothing to do with my BNF. The problem was that in my jflex file I was calling yybegin(YYINITIAL) while already in the YYINITIAL state.

Recursive descent parser for calculus of constructions

I implemented a type-checker and reducer of calculus of constructions in Haskell with a simple monadic parser using Megaparsec. Now I want to improve it so it can recognize this syntactic shortcut:
∀(x:A)->B (with x not free in B) = A -> B
The grammar for this syntax is as follows:
<expr>
= "(" <expr> ")"
| <expr> <expr>
| "λ" "(" <name> ":" <expr> ")" "→" <expr>
| "∀" "(" <name> ":" <expr> ")" "→" <expr>
| <expr> "→" <expr>
| <name>
| "*"
<name> = [_A-Za-z][_0-9A-Za-z]*
My current parser uses this variation with left recursion eliminated (without the shortcut):
<expr>
= "(" <appl> ")"
| "λ" "(" <name> ":" <appl> ")" "→" <appl>
| "∀" "(" <name> ":" <appl> ")" "→" <appl>
| <name>
| "*"
<appl> = <expr>+
<name> = [_A-Za-z][_0-9A-Za-z]*
The previously mentioned shortcut is left-recursive. I have no idea how to convert it to a right-recursive grammar so it can be handled by a conventional recursive descent parser.
I know there exist more powerful parsing techniques that can handle left-recursive grammars, but I want to keep it right-recursive to left open the possibility of implementing a parser by hand in the near future.
The answer has been evident after a short break. Use exactly the same trick that we did on <appl> and extend it as follows:
<expr>
= "(" <appl> ")"
| "λ" "(" <name> ":" <appl> ")" "→" <appl>
| "∀" "(" <name> ":" <appl> ")" "→" <appl>
| <name>
| "*"
<appl> = <expr>+ ("→" <appl>)?
<name> = [_A-Za-z][_0-9A-Za-z]*
I will leave the question open in case it helps somebody.

Why parsing this program with BNFC fails?

Given following grammar:
comment "/*" "*/" ;
TInt. Type1 ::= "int" ;
TBool. Type1 ::= "bool" ;
coercions Type 1 ;
BTrue. BExp ::= "true" ;
BFalse. BExp ::= "false" ;
EOr. Exp ::= Exp "||" Exp1 ;
EAnd. Exp1 ::= Exp1 "&&" Exp2 ;
EEq. Exp2 ::= Exp2 "==" Exp3 ;
ENeq. Exp2 ::= Exp2 "!=" Exp3 ;
ELt. Exp3 ::= Exp3 "<" Exp4 ;
EGt. Exp3 ::= Exp3 ">" Exp4 ;
ELte. Exp3 ::= Exp3 "<=" Exp4 ;
EGte. Exp3 ::= Exp3 ">=" Exp4 ;
EAdd. Exp4 ::= Exp4 "+" Exp5 ;
ESub. Exp4 ::= Exp4 "-" Exp5 ;
EMul. Exp5 ::= Exp5 "*" Exp6 ;
EDiv. Exp5 ::= Exp5 "/" Exp6 ;
EMod. Exp5 ::= Exp5 "%" Exp6 ;
ENot. Exp6 ::= "!" Exp ;
EVar. Exp8 ::= Ident ;
EInt. Exp8 ::= Integer ;
EBool. Exp8 ::= BExp ;
EIver. Exp8 ::= "[" Exp "]" ;
coercions Exp 8 ;
Decl. Decl ::= Ident ":" Type ;
terminator Decl ";" ;
LIdent. Lvalue ::= Ident ;
SBlock. Stm ::= "{" [Decl] [Stm] "}" ;
SExp. Stm ::= Exp ";" ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SReturn. Stm ::= "return" Exp ";" ;
SAssign. Stm ::= Lvalue "=" Exp ";" ;
SPrint. Stm ::= "print" Exp ";" ;
SIf. Stm ::= "if" "(" Exp ")" "then" Stm "endif" ;
SIfElse. Stm ::= "if" "(" Exp ")" "then" Stm "else" Stm "endif" ;
terminator Stm "" ;
entrypoints Stm;
parser created with bnfc fails to parse
{ c = a; }
although it parses
c = a;
or
{ print a; c = a; }
I think it could be a problem that parser sees Ident and doesn't know whether it's declaration or statement, LR stuff etc (still one token of lookeahed should be enough??). However I couldn't find any note in BNFC documentation that would say that it doesn't work for all grammars.
Any ideas how to get this working?
I would think you would get a shift/reduce conflict report for that grammar, although where that error message shows up might well depend on which tool BNFC is using to generate the parser. As far as I know, all the backend tools have the same approach to dealing with shift/reduce conflicts, which is to (1) warn the user about the conflict, and then (2) resolve the conflict in favour of shifting.
The problematic production is this one: (I've left out type annotations to reduce clutter)
Stm ::= "{" [Decl] [Stm] "}" ;
Here, [Decl] and [Stm] are macros, which automatically produce definitions for the non-terminals with those names (or something equivalent which will be accepted by the backend tool). Specifically, the automatically-produced productions are:
[Decl] ::= /* empty */
| Decl ';' [Decl]
[Stm] ::= /* empty */
| Stm [Stm]
(The ; in the first rule is the result of a "terminator" declaration. I don't know why BNFC generates right-recursive rules, but that's how I interpret the reference manual -- after a very quick glance -- and I'm sure they have their reasons. For the purpose of this problem, it doesn't matter.
What's important is that both Decl and Stm can start with an Ident. So let's suppose we're parsing { id ..., which might be { id : ... or { id = ..., but we've only read the { and the lookahead token id. So there are two possibilities:
id is the start of a Decl. We should shift the Ident and go to the state which includes Decl → Ident • ':' Type
id is the start of a Stm. In this case, we need to reduce the production [Decl] → • before we shift Ident into a Stm production.
So we have a shift/reduce conflict, because we cannot see the second next token (either : or =). And, as mentioned above, shift usually wins in this case, so the LR(1) parser will commit itself to expect a Decl. Consequently, { a = b ; } will fail.
An LR(2) parser generator would do fine with this grammar, but those are much harder to find. (Modern bison can produce GLR parsers, which are even more powerful than LR(2) at the cost of a bit of extra compute time, but not the version required by the BNFC tool.)
Possible solutions
Allow declarations to be intermingled with statements. This one is my preference. It is simple, and many programmers expect to be able to declare a variable at first use rather than at the beginning of the enclosing block.
Make the declaration recognizable from the first token, either by putting the type first (as in C) or by adding a keyword such as var (as in Javascript):
Modify the grammar to defer the lookahead. It is always possible to find an LR(1) grammar for any LR(k) language (provided k is finite), but it can be tedious. An ugly but effective alternative is to continue the lexical scan until either a : or some other non-whitespace character is found, so that id : gets tokenized as IdentDefine or some such. (This is the solution used by bison, as it happens. It means that you can't put comments between an identifier and the following :, but there are few, if any, good reasons to put a comment in that context.

How to solve SHIFT/REDUCE conflict - in parser generator

I need help to solve this one and explanation how to deal with this SHIFT/REDUCE CONFLICTS in future.
I have some conflicts between few states in my cup file.
Grammer look like this:
I have conflicts between "(" [ActPars] ")" states.
1. Statement = Designator ("=" Expr | "++" | "‐‐" | "(" [ActPars] ")" ) ";"
2. Factor = number | charConst | Designator [ "(" [ActPars] ")" ].
I don't want to paste whole 700 lines of cup file.
I will give you the relevant states and error output.
This is code for the line 1.)
Matched ::= Designator LPAREN ActParamsList RPAREN SEMI_COMMA
ActParamsList ::= ActPars
|
/* EPS */
;
ActPars ::= Expr
|
Expr ActPComma
;
ActPComma ::= COMMA ActPars;
This is for the line 2.)
Factor ::= Designator ActParamsOptional ;
ActParamsOptional ::= LPAREN ActParamsList2 RPAREN
|
/* EPS */
;
ActParamsList2 ::= ActPars
|
/* EPS */
;
Expr ::= SUBSTRACT Term RepOptionalExpression
|
Term RepOptionalExpression
;
The ERROR output looks like this:
Warning : *** Shift/Reduce conflict found in state #182
between ActParamsOptional ::= LPAREN ActParamsList RPAREN (*)
and Matched ::= Designator LPAREN ActParamsList RPAREN (*) SEMI_COMMA
under symbol SEMI_COMMA
Resolved in favor of shifting.
Error : * More conflicts encountered than expected -- parser generation aborted
I believe the problem is that your parser won't know if it should shift to the token:
SEMI_COMMA
or reduce to the token
ActParamsOptional
since the tokens defined in both ActParamsOptional and Matched are
LPAREN ActPars RPAREN

Resources