When to use "and" operator in Ocaml AST - parsing

I am translating the rules of my grammar into an AST.
Is it necessary to use the "and" operator in defining our AST?
For instance, I have translated my grammar thus far like so:
type program =
| Decls of typ * identifier * decls_prime
type typ =
| INT
| BOOL
| VOID
type identifier = string
(* decls_prime = vdecl decls | fdecl decls *)
type declsprime =
| Vdecl of variabledeclaration * decls
| Fdecl of functiondeclaration * decls
(*“lparen” formals_opt “rparen” “LBRACE” vdecl_list stmt_list “RBRACE”*)
type functiondeclaration =
| Fdecl of variabledeclarationlist * stmtlist
(*formals_opt = formal_list | epsilon *)
type FormalsOpt =
|FormalsOpt of formallist
(* typ “ID” formal_list_prime *)
type formalList =
| FormalList of typ * identifier * formallistprime
type formallistprime =
| FormalListPrime of formalList
type variabledeclarationlist =
| VdeclList of variabledeclaration * variabledeclarationlist
(*stmt stmt_list | epsilon*)
type stmtlist =
| StmtList of stmt * stmtlist
| StmtlistNil
(* stmt = “RETURN” stmt_prime| expr SEMI |“LBRACE” stmt_list RBRACE| IF LPAREN expr RPAREN stmt stmt_prime_prime| FOR LPAREN expr_opt SEMI expr SEMI expr_opt RPAREN stmt| WHILE LPAREN expr RPAREN stmt*)
type Stmt
| Return of stmtprime
| Expression of expr
| StmtList of stmtlist
| IF of expr * stmt * stmtprimeprime
| FOR of expropt * expr * expropt * stmt
| WHILE of expr * stmt
(*stmt_prime = SEMI| expr SEMI*)
type stmtprime
| SEMI
| Expression of expr
(*NOELSE | ELSE stmt*)
type stmtprimeprime
| NOELSE
| ELSE of stmt
(* Expr_opt = expr | epsilon *)
type expropt =
| Expression of expr
| ExprNil
type Expr
type ExprPrime
(* Actuals_opt = actuals_list | epsilon *)
type ActualsOpt=
| ActualsList of actualslist
| ActualsNil
type ActualsList =
| ActualsList of expr * actualslistprime
(*actualslistprime = COMMA expr actuals_list_prime | epsilon*)
type actualslistprime =
| ActualsListPrime of expr * actualslistprime
| ALPNil
But it looks as though this example from Illinois uses a slightly different structure:
type program = Program of (class_decl list)
and class_decl = Class of id * id * (var_decl list) * (method_decl list)
and method_decl = Method....
Is it necessary to use "and" when defining my AST? And moreover, is it wrong for me to use a StmtList type rather than (stmt list) even though I call the AST StmtList method correctly in my parser?

You only need and when your definitions are mutually recursive. That is, if a statement could contain an expression and an expression could in turn contain a statement, then Expr and Stmt would have to be connected with an and. If your code compiles without and, you don't need the and.
PS: This is unrelated to your question, but I think it would make a lot more sense to use the list and option types than to define your own versions for specific types (such as stmntlist, expropt etc.). stmtprime is another such case: You could just define Return as Return of expr option and get rid of the stmtprime type. Same with stmtprimeprime.

Related

How to write yacc grammar rules to identify function definitions vs function calls?

I have started learning about YACC, and I have executed a few examples of simple toy programs. But I have never seen a practical example that demonstrates how to build a compiler that identifies and implements function definitions and function calls, array implementation and so on, nor has it been easy to find an example using Google search. Can someone please provide one example of how to generate the tree using YACC? C or C++ is fine.
Thanks in advance!
Let's parse this code with yacc.
file test contains valid C code that we want to parse.
int main (int c, int b) {
int a;
while ( 1 ) {
int d;
}
}
A lex file c.l
alpha [a-zA-Z]
digit [0-9]
%%
[ \t] ;
[ \n] { yylineno = yylineno + 1;}
int return INT;
float return FLOAT;
char return CHAR;
void return VOID;
double return DOUBLE;
for return FOR;
while return WHILE;
if return IF;
else return ELSE;
printf return PRINTF;
struct return STRUCT;
^"#include ".+ ;
{digit}+ return NUM;
{alpha}({alpha}|{digit})* return ID;
"<=" return LE;
">=" return GE;
"==" return EQ;
"!=" return NE;
">" return GT;
"<" return LT;
"." return DOT;
\/\/.* ;
\/\*(.*\n)*.*\*\/ ;
. return yytext[0];
%%
file c.y for input to YACC:
%{
#include <stdio.h>
#include <stdlib.h>
extern FILE *fp;
%}
%token INT FLOAT CHAR DOUBLE VOID
%token FOR WHILE
%token IF ELSE PRINTF
%token STRUCT
%token NUM ID
%token INCLUDE
%token DOT
%right '='
%left AND OR
%left '<' '>' LE GE EQ NE LT GT
%%
start: Function
| Declaration
;
/* Declaration block */
Declaration: Type Assignment ';'
| Assignment ';'
| FunctionCall ';'
| ArrayUsage ';'
| Type ArrayUsage ';'
| StructStmt ';'
| error
;
/* Assignment block */
Assignment: ID '=' Assignment
| ID '=' FunctionCall
| ID '=' ArrayUsage
| ArrayUsage '=' Assignment
| ID ',' Assignment
| NUM ',' Assignment
| ID '+' Assignment
| ID '-' Assignment
| ID '*' Assignment
| ID '/' Assignment
| NUM '+' Assignment
| NUM '-' Assignment
| NUM '*' Assignment
| NUM '/' Assignment
| '\'' Assignment '\''
| '(' Assignment ')'
| '-' '(' Assignment ')'
| '-' NUM
| '-' ID
| NUM
| ID
;
/* Function Call Block */
FunctionCall : ID'('')'
| ID'('Assignment')'
;
/* Array Usage */
ArrayUsage : ID'['Assignment']'
;
/* Function block */
Function: Type ID '(' ArgListOpt ')' CompoundStmt
;
ArgListOpt: ArgList
|
;
ArgList: ArgList ',' Arg
| Arg
;
Arg: Type ID
;
CompoundStmt: '{' StmtList '}'
;
StmtList: StmtList Stmt
|
;
Stmt: WhileStmt
| Declaration
| ForStmt
| IfStmt
| PrintFunc
| ';'
;
/* Type Identifier block */
Type: INT
| FLOAT
| CHAR
| DOUBLE
| VOID
;
/* Loop Blocks */
WhileStmt: WHILE '(' Expr ')' Stmt
| WHILE '(' Expr ')' CompoundStmt
;
/* For Block */
ForStmt: FOR '(' Expr ';' Expr ';' Expr ')' Stmt
| FOR '(' Expr ';' Expr ';' Expr ')' CompoundStmt
| FOR '(' Expr ')' Stmt
| FOR '(' Expr ')' CompoundStmt
;
/* IfStmt Block */
IfStmt : IF '(' Expr ')'
Stmt
;
/* Struct Statement */
StructStmt : STRUCT ID '{' Type Assignment '}'
;
/* Print Function */
PrintFunc : PRINTF '(' Expr ')' ';'
;
/*Expression Block*/
Expr:
| Expr LE Expr
| Expr GE Expr
| Expr NE Expr
| Expr EQ Expr
| Expr GT Expr
| Expr LT Expr
| Assignment
| ArrayUsage
;
%%
#include"lex.yy.c"
#include<ctype.h>
int count=0;
int main(int argc, char *argv[])
{
yyin = fopen(argv[1], "r");
if(!yyparse())
printf("\nParsing complete\n");
else
printf("\nParsing failed\n");
fclose(yyin);
return 0;
}
yyerror(char *s) {
printf("%d : %s %s\n", yylineno, s, yytext );
}
A Makefile to put it together. I use flex-lexer and bison but the example will also work with lex and yacc.
miniC: c.l c.y
bison c.y
flex c.l
gcc c.tab.c -ll -ly
Compile and parse the test code:
$ make
bison c.y
flex c.l
gcc c.tab.c -ll -ly
c.tab.c: In function ‘yyparse’:
c.tab.c:1273:16: warning: implicit declaration of function ‘yylex’ [-Wimplicit-function-declaration]
yychar = yylex ();
^
c.tab.c:1402:7: warning: implicit declaration of function ‘yyerror’ [-Wimplicit-function-declaration]
yyerror (YY_("syntax error"));
^
c.y: At top level:
c.y:155:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
yyerror(char *s) {
^
$ ls
a.out c.l CMakeLists.txt c.tab.c c.y lex.yy.c Makefile README.md test
$ ./a.out test
Parsing complete
For reading resources I can recommend the books Modern Compiler Implementation in C by Andrew Appel and the flex/bison book by John Levine.

Is it possible to create a very permissive grammar using Menhir?

I'm trying to parse some bits and pieces of Verilog - I'm primarily interested in extracting module definitions and instantiations.
In verilog a module is defined like:
module foo ( ... ) endmodule;
And a module is instantiated in one of two different possible ways:
foo fooinst ( ... );
foo #( ...list of params... ) fooinst ( .... );
At this point I'm only interested in finding the name of the defined or instantiated module; 'foo' in both cases above.
Given this menhir grammar (verParser.mly):
%{
type expr = Module of expr
| ModInst of expr
| Ident of string
| Int of int
| Lparen
| Rparen
| Junk
| ExprList of expr list
%}
%token <string> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE TICK OTHER HASH EOF
%start expr2
%type <expr> mod_expr
%type <expr> expr1
%type <expr list> expr2
%%
mod_expr:
| MODULE IDENT LPAREN { Module ( Ident $2) }
| IDENT IDENT LPAREN { ModInst ( Ident $1) }
| IDENT HASH LPAREN { ModInst ( Ident $1) };
junk:
| LPAREN { }
| RPAREN { }
| HASH { }
| INT { };
expr1:
| junk* mod_expr junk* { $2 } ;
expr2:
| expr1* EOF { $1 };
When I try this out in the menhir interpretter it works fine extracting the module instantion:
MODULE IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: MODULE IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
It works fine for the single module instantiation:
IDENT IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: IDENT IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
But of course, if there is an IDENT that appears prior to any of these it will REJECT:
IDENT MODULE IDENT LPAREN IDENT IDENT LPAREN
REJECT
... and of course there will be identifiers in an actual verilog file prior to these defs.
I'm trying not to have to fully specify a Verilog grammar, instead I want to build the grammar up slowly and incrementally to eventually parse more and more of the language.
If I add IDENT to the junk rule, that fixes the problem above, but then the module instantiation rule doesn't work because now the junk rule is capturing the IDENT.
Is it possible to create a very permissive rule that will bypass stuff I don't want to match, or is it generally required that you must create a complete grammar to actually do something like this?
Is it possible to create a rule that would let me match:
MODULE IDENT LPAREN stuff* RPAREN ENDMODULE
where "stuff*" initially matches everything but RPAREN?
Something like :
stuff:
| !RPAREN { } ;
I've used PEG parsers in the past which would allow constructs like that.
I've decided that PEG is a better fit for a permissive, non-exhaustive grammar. Took a look at peg/leg and was able to very quickly put together a leg grammar that does what I need to do:
start = ( comment | mod_match | char)
line = < (( '\n' '\r'* ) | ( '\r' '\n'* )) > { lines++; chars += yyleng; }
module_decl = module modnm:ident lparen ( !rparen . )* rparen { chars += yyleng; printf("Module decl: <%s>\n",yytext);}
module_inst = modinstname:ident ident lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
|modinstname:ident hash lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
mod_match = ( module_decl | module_inst )
module = 'module' ws { modules++; chars +=yyleng; printf("Module: <%s>\n", yytext); }
endmodule = 'endmodule' ws { endmodules++; chars +=yyleng; printf("EndModule: <%s>\n", yytext); }
kwd = (module|endmodule)
ident = !kwd<[a-zA-z][a-zA-Z0-9_]+>- { words++; chars += yyleng; printf("Ident: <%s>\n", yytext); }
char = . { chars++; }
lparen = '(' -
rparen = ')' -
hash = '#'
- = ( space | comment )*
ws = space+
space = ' ' | '\t' | EOL
comment = '//' ( !EOL .)* EOL
| '/*' ( !'*/' .)* '*/'
EOF = !.
EOL = '\r\n' | '\n' | '\r'
Aurochs is possibly also an option, but I have concerns about speed and memory usage of an Aurochs generated parser. peg/leg produce a parser in C which should be quite speedy.

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Optimizing Bison Grammar

I have this grammar of a C# like language, and I want to make a parser for it, but when I put the grammar it tells me about Shift/Reduce conflicts. I tried to fix some but I can't seem to find another way to improve this grammar. Any help would be greatly appreciated :D Here's the grammar:
Program: Decl
| Program Decl
;
Decl: VariableDecl
| FunctionDecl
| ClassDecl
| InterfaceDecl
;
VariableDecl: Variable SEMICOLON
;
Variable: Type IDENTIFIER
;
Type: TOKINT
| TOKDOUBLE
| TOKBOOL
| TOKSTRING
| IDENTIFIER
| Type BRACKETS
;
FunctionDecl: Type IDENTIFIER OPARENS Formals CPARENS StmtBlock
| TOKVOID IDENTIFIER OPARENS Formals CPARENS StmtBlock
;
Formals: VariablePlus
| /* epsilon */
;
VariablePlus: Variable
| VariablePlus COMMA Variable
;
ClassDecl: TOKCLASS IDENTIFIER OptExtends OptImplements OBRACE ListaField CBRACE
;
OptExtends: TOKEXTENDS IDENTIFIER
| /* epsilon */
;
OptImplements: TOKIMPLEMENTS ListaIdent
| /* epsilon */
;
ListaIdent: ListaIdent COMMA IDENTIFIER
| IDENTIFIER
;
ListaField: ListaField Field
| /* epsilon */
;
Field: VariableDecl
| FunctionDecl
;
InterfaceDecl: TOKINTERFACE IDENTIFIER OBRACE ListaProto CBRACE
;
ListaProto: ListaProto Prototype
| /* epsilon */
;
Prototype: Type IDENTIFIER OPARENS Formals CPARENS SEMICOLON
| TOKVOID IDENTIFIER OPARENS Formals CPARENS SEMICOLON
;
StmtBlock: OBRACE ListaOptG CBRACE
;
ListaOptG: /* epsilon */
| VariableDecl ListaOptG
| Stmt ListaOptG
;
Stmt: OptExpr SEMICOLON
| IfStmt
| WhileStmt
| ForStmt
| BreakStmt
| ReturnStmt
| PrintStmt
| StmtBlock
;
OptExpr: Expr
| /* epsilon */
;
IfStmt: TOKIF OPARENS Expr CPARENS Stmt OptElse
;
OptElse: TOKELSE Stmt
| /* epsilon */
;
WhileStmt: TOKWHILE OPARENS Expr CPARENS Stmt
;
ForStmt: TOKFOR OPARENS OptExpr SEMICOLON Expr SEMICOLON OptExpr CPARENS Stmt
;
ReturnStmt: TOKRETURN OptExpr SEMICOLON
;
BreakStmt: TOKBREAK SEMICOLON
;
PrintStmt: TOKPRINT OPARENS ListaExprPlus CPARENS SEMICOLON
;
ListaExprPlus: Expr
| ListaExprPlus COMMA Expr
;
Expr: LValue LOCATION Expr
| Constant
| LValue
| TOKTHIS
| Call
| OPARENS Expr CPARENS
| Expr PLUS Expr
| Expr MINUS Expr
| Expr TIMES Expr
| Expr DIVIDED Expr
| Expr MODULO Expr
| MINUS Expr
| Expr LESSTHAN Expr
| Expr LESSEQUALTHAN Expr
| Expr GREATERTHAN Expr
| Expr GREATEREQUALTHAN Expr
| Expr EQUALS Expr
| Expr NOTEQUALS Expr
| Expr AND Expr
| Expr OR Expr
| NOT Expr
| TOKNEW OPARENS IDENTIFIER CPARENS
| TOKNEWARRAY OPARENS Expr COMMA Type CPARENS
| TOKREADINTEGER OPARENS CPARENS
| TOKREADLINE OPARENS CPARENS
| TOKMALLOC OPARENS Expr CPARENS
;
LValue: IDENTIFIER
| Expr PERIOD IDENTIFIER
| Expr OBRACKET Expr CBRACKET
;
Call: IDENTIFIER OPARENS Actuals CPARENS
| Expr PERIOD IDENTIFIER OPARENS Actuals CPARENS
| Expr PERIOD LibCall OPARENS Actuals CPARENS
;
LibCall: TOKGETBYTE OPARENS Expr CPARENS
| TOKSETBYTE OPARENS Expr COMMA Expr CPARENS
;
Actuals: ListaExprPlus
| /* epsilon */
;
Constant: INTCONSTANT
| DOUBLECONSTANT
| BOOLCONSTANT
| STRINGCONSTANT
| TOKNULL
;
The old Bison version on my school's server says you have 241 shift/reduce conflicts. One is the dangling if/else statement. Putting "OptElse" does NOT solve it. You should just write out the IfStmt and an IfElseStmt and then use %nonassoc and %prec options in bison to fix it.
Your expressions are the issue of almost all of the other 240 conflicts. What you need to do is either force precedence rules (messy and a terrible idea) or break your arithmetic expressions into stuff like:
AddSubtractExpr: AddSubtractExpr PLUS MultDivExpr | ....
;
MultDivExpr: MultiDivExpr TIMES Factor | ....
;
Factor: Variable | LPAREN Expr RPAREN | call | ...
;
Since Bison produces a bottom up parser, something like this will give you correct order of operations. If you have a copy of the first edition of the Dragon Book, you should look at the grammar in Appendix A. I believe the 2nd edition also has similar rules for simple expressions.
conflicts (shift/reduce or reduce/reduce) mean that your grammar is not LALR(1) so can't be handled by bison directly without help. There are a number of immediately obvious problems:
expression ambiguity -- there's no precedence in the grammar, so things like a + b * c are ambiguous. You can fix this by adding precedence rules, or by splitting the Expr rule into separate AdditiveExpr, MultiplicativeExpr, ConditionalExpr etc rules.
dangling else ambiguity -- if (a) if (b) x; else y; -- the else could be matched with either if. You can either ignore this if the default shift is correct (it usually is for this specific case, but ignoring errors is always dangerous) or split the Stmt rule
There are many books on grammars and parsing that will help with this.

How to write a recursive descent parser from scratch?

As a purely academic exercise, I'm writing a recursive descent parser from scratch -- without using ANTLR or lex/yacc.
I'm writing a simple function which converts math expressions into their equivalent AST. I have the following:
// grammar
type expr =
| Lit of float
| Add of expr * expr
| Mul of expr * expr
| Div of expr * expr
| Sub of expr * expr
// tokens
type tokens =
| Num of float
| LParen | RParen
| XPlus | XStar | XMinus | XSlash
let tokenize (input : string) =
Regex.Matches(input.Replace(" ", ""), "\d+|[+/*\-()]")
|> Seq.cast<Match>
|> Seq.map (fun x -> x.Value)
|> Seq.map (function
| "+" -> XPlus
| "-" -> XMinus
| "/" -> XSlash
| "*" -> XStar
| "(" -> LParen
| ")" -> RParen
| num -> Num(float num))
|> Seq.to_list
So, tokenize "10 * (4 + 5) - 1" returns the following token stream:
[Num 10.0; XStar; LParen; Num 4.0; XPlus; Num 5.0; RParen; XMinus; Num 1.0]
At this point, I'd like to map the token stream to its AST with respect to operator precedence:
Sub(
Mul(
Lit 10.0
,Add(Lit 4.0, Lit 5.0)
)
,Lit 1.0
)
However, I'm drawing a blank. I've never written a parser from scratch, and I don't know even in principle how to begin.
How do I convert a token stream its representative AST?
Do you know about language grammars?
Assuming yes, you have a grammar with rules along the lines
...
addTerm := mulTerm addOp addTerm
| mulTerm
addOp := XPlus | XMinus
mulTerm := litOrParen mulOp mulTerm
| litOrParen
...
which ends up turning into code like (writing code in browser, never compiled)
let rec AddTerm() =
let mulTerm = MulTerm() // will parse next mul term (error if fails to parse)
match TryAddOp with // peek ahead in token stream to try parse
| None -> mulTerm // next token was not prefix for addOp rule, stop here
| Some(ao) -> // did parse an addOp
let rhsMulTerm = MulTerm()
match ao with
| XPlus -> Add(mulTerm, rhsMulTerm)
| XMinus -> Sub(mulTerm, rhsMulTerm)
and TryAddOp() =
let next = tokens.Peek()
match next with
| XPlus | XMinus ->
tokens.ConsumeNext()
Some(next)
| _ -> None
...
Hopefully you see the basic idea. This assumes a global mutable token stream that allows both 'peek at next token' and 'consume next token'.
If I remember from college classes the idea was to build expression trees like:
<program> --> <expression> <op> <expression> | <expression>
<expression> --> (<expression>) | <constant>
<op> --> * | - | + | /
<constant> --> <constant><constant> | [0-9]
then once you have construction your tree completely so you get something like:
exp
exp op exp
5 + and so on
then you run your completed tree through another program that recursively descents into the tree calculating expressions until you have an answer. If your parser doesn't understand the tree, you have a syntax error. Hope that helps.

Resources