I am trying to implement a lambda calculus inside of Rascal but am having trouble getting the precedence and parsing to work the way I would like it to. Currently I have a grammar that looks something like:
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp LAYOUTLIST;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> right func: "fun" Ident "-\>" Exp
When I parse a program of the form:
(fun x -> fun y -> x) 1 2
The resulting tree is:
prog(app(
app(
func(
"x",
func(
"y",
var("x")
nat(1),
nat(2))))))
Where really I am looking for something like this (I think):
prog(app(
func(
"x",
app(
func(
"y",
var("x")),
nat(2))),
nat(1)))
I've tried a number of variations of the precedence in the grammar, I've tried wrapping the App rule in parenthesis, and a number of other variations. There seems to be something going on here I don't understand. Any help would be most appreciated. Thanks.
I've used the following grammar, which removes the extra LAYOUTLIST and the dead right, but this should not make a difference. It seems to work as you want when I use the generic implode function :
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> func: "fun" Ident "-\>" Exp
;
Then calling the parser and imploding to an untyped AST (I've removed the location annotations for readability):
rascal>import ParseTree;
ok
rascal>implode(#node, parse(#start[Prog], "(fun x -\> fun y -\> x) 1 2"))
node: "prog"("app"(
"app"(
"func"(
"x",
"func"(
"y",
"var"("x"))),
"nat"("1")),
"nat"("2")))
So, I am guessing you got the grammar right for the shape of tree you want. How do you go from concrete parse tree to abstract AST? Perhaps there is something funny going on there.
Related
I'm trying to write a parser that accepts a toy language for a software project class. Part of the production rules relevant to the question in EBNF-like syntax is given here (there's way more relational operators, but I've removed some of them to keep it simple):
cond_expr = rel_expr
| '!' '(' cond_expr ')'
| '(' cond_expr ')' '&&' '(' cond_expr ')' ;
rel_expr = rel_factor '==' rel_factor
| rel_factor '!=' rel_factor ;
rel_factor = VAR | INTEGER | expr ;
expr = expr '+' term
| expr '-' term
| expr ;
term = term '*' factor
| term '/' factor
| factor ;
factor = VAR | INTEGER | '(' expr ')' ;
VAR = [a-zA-Z][a-zA-Z0-9]* ;
INTEGER = '0' | [1-9][0-9]* ;
I've written more or less the entire parser already. I used recursive descent for majority of the language except for expressions, which I decided to use the shunting yard algorithm to parse (because I couldn't get recursive descent to work even after left recursion elimination/left factoring).
The real problem I have is in the cond_expr rule; shunting yard is too powerful for this grammar i.e the grammar can't accept certain conditional expressions. For example, the expression (x == 1) is not accepted, neither is !(x == 1) || (y == 1). I would use the recursive descent method to check if the expression can be accepted, but the issue is with the rel_expr in cond_expr, rel_expr can be substituted with rel_factor '==' rel_factor or rel_factor '!=' rel_factor, and each rel_factor can be substituted with '(' expr ')'. This leads to ambiguity (idk if that's the correct term) when deciding what branch to take in the cond_expr method upon seeing a '(' token. Something like the below:
Expression cond_expr() {
if (next() == "!") {
expect("!");
expect("(");
auto cond = cond_expr();
expect(")");
return cond;
} else if (next() == "(") {
// this will fail for e.g (x + 1) == 2
expect("(");
auto cond1 = cond_expr();
expect(")");
expect("&&");
expect("(");
auto cond2 = cond_expr();
expect(")");
return Node("&&", cond1, cond2);
} else {
return rel_expr();
}
}
My current strategy I'm attempting is to first validate that the expression can be accepted by the grammar using some subroutine, then calling the shunting yard algorithm to parse it into the required AST. However, I'm having a lot of trouble writing this validation subroutine. Anyone have any suggestions on any methods to solve this?
I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?
Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)
I have the following simple grammar:
E -> T | ^ v . E
T -> F T1
T1 -> F T1 | epsilon
F -> ( E ) | v
I'm pretty new to Bison, so I was hoping someone could help show me how to write it out in that format. All I have so far is the following, but I'm not sure if it's correct:
%left '.'
%left 'v'
%% /* The grammar follows. */
exp:
term {printf("1");}
| '^' 'v' '.' exp {printf("2");}
;
term:
factor term1 {printf("3");}
;
term1:
factor term1 {printf("4");}
| {printf("5");}
;
factor:
'(' exp ')' {printf("6");}
| 'v' {printf("7");}
;
%%
You are missing the closing semicolon from several of the productions. There's nothing in the source grammar to suggest you need the productions about lines.
I've been using regexes to go through a pile of Verilog files and pull out certain statements. Currently, regexes are fine for this, however, I'm starting to get to the point where a real parser is going to be needed in order to deal with nested structures so I'm investigating ocamllex/ocamlyacc. I'd like to first duplicate what I've got in my regex implementation and then slowly add more to the grammar.
Right now I'm mainly interested in pulling out module declarations and instantiations. To keep this question a bit more brief, let's look at module declarations only.
In Verilog a module declaration looks like:
module modmame ( ...other statements ) endmodule;
My current regex implementation simply checks that there is a module declared with a particular name ( checking against a list of names that I'm interested in - I don't need to find all module declarations just ones with certain names). So basically, I get each line of the Verilog file I want to parse and do a match like this (pseudo-OCaml with Pythonish and Rubyish elements ):
foreach file in list_of_files:
let found_mods = Hashtbl.create 17;
open file
foreach line in file:
foreach modname in modlist
let mod_patt= Str.regexp ("module"^space^"+"^modname^"\\("^space^"+\\|(\\)") in
try
Str.search_forward (mod_patt) line 0
found_mods[file] = modname; (* map filename to modname *)
with Not_found -> ()
That works great. The module declaration can occur anywhere in the Verilog file; I'm just wanting to find out if the file contains that particular declaration, I don't care about what else may be in that file.
My first attempt at converting this over to ocamllex/ocamlyacc:
verLexer.mll:
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ { lex lexbuf }
| eof
verParser.mly:
%{ type expr = Module of expr | Ident of string | Int of int %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
expr1:
| expr EOF { $1 };
Then trying it out in the REPL:
# #use "verLexer.ml" ;;
# #use "verParser.ml" ;;
# expr1 lex (Lexing.from_string "module foo (" ) ;;
- : expr = Module (Ident "foo")
That's great, it works!
However, a real Verilog file will have more than a module declaration in it:
# expr1 lex (Lexing.from_string "//comment\nmodule foo ( \nstuff" ) ;;
Exception: Failure "lexing: empty token".
I don't really care about what appeared before or after that module definition, is there a way to just extract that part of the grammar to determine that the Verilog files contains the 'module foo (' statement? Yes, I realize that regexes are working fine for this, however, as stated above, I am planning to grow this grammar slowly and add more elements to it and regexes will start to break down.
EDIT: I added a match any char to the lex rule:
| _ { lex lexbuf }
Thinking that it would skip any characters that weren't matched so far, but that didn't seem to work:
# expr1 lex (Lexing.from_string "fof\n module foo (\n" ) ;;
Exception: Parsing.Parse_error.
A first advertisement minute: instead of ocamlyacc you should consider using François Pottier's Menhir, which is like a "yacc, upgraded", better in all aspects (more readable grammars, more powerful constructs, easier to debug...) while still very similar. It can of course be used in combination with ocamllex.
Your expr1 rule only allows to begin and end with a expr rule. You should enlarge it to allow "stuff" before or after expr. Something like:
junk:
| junk LPAREN
| junk RPAREN
| junk INT
| junk IDENT
expr1:
| junk expr junk EOF
Note that this grammar does not allow the module token to appear in the junk section. Doing so would be a bit problematic as it would make the grammar ambiguous (the structure you're looking for could be embedded either in expr or junk). If you could have a module token happening outside the form you're looking form, you should consider changing the lexer to capture the whole module ident ( structure of interest in a single token, so that it can be atomically matched from the grammar. On the long term, however, have finer-grained tokens is probably better.
As suggested by #gasche I tried menhir and am already getting much better results. I changed the verLexer.ml to:
{
open VerParser
}
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ as c { lex lexbuf }
| eof { EOF }
And changed verParser.mly to:
%{ type expr = Module of expr | Ident of string | Int of int
|Lparen | Rparen | Junk %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
junk:
| LPAREN { }
| RPAREN { }
| INT { }
| IDENT { } ;
expr1:
| junk* expr junk* EOF { $2 };
The key here is that menhir allows a rule to be parameterized with a '*' as in the line above where I've got 'junk*' in a rule meaning match junk 0 or more times. ocamlyacc doesn't seem to allow that.
Now when I tried it in the REPL I get:
# #use "verParser.ml" ;;
# #use "verLexer.ml" ;;
# expr1 lex (Lexing.from_string "module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\nbar " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module fot foo (\n//comment " ) ;;
Exception: Error.
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
Which seems to work exactly as I want it to.
I am supposed to make a parser for a language with the following grammar:
Program ::= Stmts "return" Expr ";"
Stmts ::= Stmt Stmts
| ε
Stmt ::= ident "=" Expr ";"
| "{" Stmts "}"
| "for" ident "=" Expr "to" Expr Stmt
| "choice" "{" Choices "}"
Choices ::= Choice Choices
| Choice
Choice ::= integer ":" Stmt
Expr ::= Shift
Shift ::= Shift "<<" integer
| Shift ">>" integer
| Term
Term ::= Term "+" Prod
| Term "-" Prod
| Prod
Prod ::= Prod "*" Prim
| Prim
Prim ::= ident
| integer
| "(" Expr ")"
With the following data type for Expr:
data Expr = Var Ident
| Val Int
| Lshift Expr Int
| Rshift Expr Int
| Plus Expr Expr
| Minus Expr Expr
| Mult Expr Expr
deriving (Eq, Show, Read)
My problem is implementing the Shift operator, because I get the following error when I encounter a left or right shift:
unexpected ">"
expecting operator or ";"
Here is the code I have for Expr:
expr = try (exprOp)
<|> exprShift
exprOp = buildExpressionParser arithmeticalOps prim <?> "arithmetical expression"
prim :: Parser Expr
prim = new_ident <|> new_integer <|> pE <?> "primitive expression"
where
new_ident = do {i <- ident; return $ Var i }
new_integer = do {i <- first_integer; return $ Val i }
pE = parens expr
arithmeticalOps = [ [binary "*" Mult AssocLeft],
[binary "+" Plus AssocLeft, binary "-" Minus AssocLeft]
]
binary name fun assoc = Infix (do{ reservedOp name; return fun }) assoc
exprShift =
do
e <- expr
a <- aShift
i <- first_integer
return $ a e i
aShift = (reservedOp "<<" >> return Lshift)
<|> (reservedOp ">>" >> return Rshift)
I suspect the problem is concerning lookahead, but I can't seem to figure it out.
Here's a grammar with left recursion eliminated (untested). Stmts and Choices can be simplified with Parsec's many and many1. The other recursive productions have to be expanded:
Program ::= Stmts "return" Expr ";"
Stmts ::= #many# Stmt
Stmt ::= ident "=" Expr ";"
| "{" Stmts "}"
| "for" ident "=" Expr "to" Expr Stmt
| "choice" "{" Choices "}"
Choices ::= #many1# Choice
Choice ::= integer ":" Stmt
Expr ::= Shift
Shift ::= Term ShiftRest
ShiftRest ::= <empty>
| "<<" integer
| ">>" integer
Term ::= Prod TermRest
TermRest ::= <empty>
| "+" Term
| "-" Term
Prod ::= Prim ProdRest
ProdRest ::= <empty>
| "*" Prod
Prim ::= ident
| integer
| "(" Expr ")"
Edit - "Part Two"
"empty" (in angles) is the empty production, you were using epsilon in the original post, but I don't know its Unicode code point and didn't think to copy-paste it.
Here's an example of how I would code the grammar. Note - unlike the grammar I posted empty versions must always be the last choice to give the other productions chance to match. Also your datatypes and constructors for the Abstract Syntax Tree probably differ to the the guesses I've made, but it should be fairly clear what's going on. The code is untested - hopefully any errors are obvious:
shift :: Parser Expr
shift = do
t <- term
leftShift t <|> rightShift <|> emptyShift t
-- Note - this gets an Expr passed in - it is the "prefix"
-- of the shift production.
--
leftShift :: Expr -> Parser Expr
leftShift t = do
reservedOp "<<"
i <- int
return (LShift t i)
-- Again this gets an Expr passed in.
--
rightShift :: Expr -> Parser Expr
rightShift t = do
reservedOp ">>"
i <- int
return (RShift t i)
-- The empty version does no parsing.
-- Usually I would change the definition of "shift"
-- and not bother defining "emptyShift", the last
-- line of "shift" would then be:
--
-- > leftShift t <|> rightShift t <|> return t
--
emptyShift :: Expr -> Parser Expr
emptyShift t = return t
Parsec is still Greek to me, but my vague guess is that aShift should use try.
The parsec docs on Hackage have an example explaining the use of try with <|> that might help you out.