Using ocamllex/ocamlyacc to parse part of a grammar

Using ocamllex/ocamlyacc to parse part of a grammar - parsing

I've been using regexes to go through a pile of Verilog files and pull out certain statements. Currently, regexes are fine for this, however, I'm starting to get to the point where a real parser is going to be needed in order to deal with nested structures so I'm investigating ocamllex/ocamlyacc. I'd like to first duplicate what I've got in my regex implementation and then slowly add more to the grammar.
Right now I'm mainly interested in pulling out module declarations and instantiations. To keep this question a bit more brief, let's look at module declarations only.
In Verilog a module declaration looks like:
module modmame ( ...other statements ) endmodule;
My current regex implementation simply checks that there is a module declared with a particular name ( checking against a list of names that I'm interested in - I don't need to find all module declarations just ones with certain names). So basically, I get each line of the Verilog file I want to parse and do a match like this (pseudo-OCaml with Pythonish and Rubyish elements ):
foreach file in list_of_files:
let found_mods = Hashtbl.create 17;
open file
foreach line in file:
foreach modname in modlist
let mod_patt= Str.regexp ("module"^space^"+"^modname^"\\("^space^"+\\|(\\)") in
try
Str.search_forward (mod_patt) line 0
found_mods[file] = modname; (* map filename to modname *)
with Not_found -> ()
That works great. The module declaration can occur anywhere in the Verilog file; I'm just wanting to find out if the file contains that particular declaration, I don't care about what else may be in that file.
My first attempt at converting this over to ocamllex/ocamlyacc:
verLexer.mll:
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ { lex lexbuf }
| eof
verParser.mly:
%{ type expr = Module of expr | Ident of string | Int of int %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
expr1:
| expr EOF { $1 };
Then trying it out in the REPL:
# #use "verLexer.ml" ;;
# #use "verParser.ml" ;;
# expr1 lex (Lexing.from_string "module foo (" ) ;;
- : expr = Module (Ident "foo")
That's great, it works!
However, a real Verilog file will have more than a module declaration in it:
# expr1 lex (Lexing.from_string "//comment\nmodule foo ( \nstuff" ) ;;
Exception: Failure "lexing: empty token".
I don't really care about what appeared before or after that module definition, is there a way to just extract that part of the grammar to determine that the Verilog files contains the 'module foo (' statement? Yes, I realize that regexes are working fine for this, however, as stated above, I am planning to grow this grammar slowly and add more elements to it and regexes will start to break down.
EDIT: I added a match any char to the lex rule:
| _ { lex lexbuf }
Thinking that it would skip any characters that weren't matched so far, but that didn't seem to work:
# expr1 lex (Lexing.from_string "fof\n module foo (\n" ) ;;
Exception: Parsing.Parse_error.

A first advertisement minute: instead of ocamlyacc you should consider using François Pottier's Menhir, which is like a "yacc, upgraded", better in all aspects (more readable grammars, more powerful constructs, easier to debug...) while still very similar. It can of course be used in combination with ocamllex.
Your expr1 rule only allows to begin and end with a expr rule. You should enlarge it to allow "stuff" before or after expr. Something like:
junk:
| junk LPAREN
| junk RPAREN
| junk INT
| junk IDENT
expr1:
| junk expr junk EOF
Note that this grammar does not allow the module token to appear in the junk section. Doing so would be a bit problematic as it would make the grammar ambiguous (the structure you're looking for could be embedded either in expr or junk). If you could have a module token happening outside the form you're looking form, you should consider changing the lexer to capture the whole module ident ( structure of interest in a single token, so that it can be atomically matched from the grammar. On the long term, however, have finer-grained tokens is probably better.

As suggested by #gasche I tried menhir and am already getting much better results. I changed the verLexer.ml to:
{
open VerParser
}
rule lex = parse
| [' ' '\n' '\t'] { lex lexbuf }
| ['0'-'9']+ as s { INT(int_of_string s) }
| '(' { LPAREN }
| ')' { RPAREN }
| "module" { MODULE }
| ['A'-'Z''a'-'z''0'-'9''_']+ as s { IDENT(s) }
| _ as c { lex lexbuf }
| eof { EOF }
And changed verParser.mly to:
%{ type expr = Module of expr | Ident of string | Int of int
|Lparen | Rparen | Junk %}
%token <int> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE EOF
%start expr1
%type <expr> expr1
%%
expr:
| MODULE IDENT LPAREN { Module( Ident $2) };
junk:
| LPAREN { }
| RPAREN { }
| INT { }
| IDENT { } ;
expr1:
| junk* expr junk* EOF { $2 };
The key here is that menhir allows a rule to be parameterized with a '*' as in the line above where I've got 'junk*' in a rule meaning match junk 0 or more times. ocamlyacc doesn't seem to allow that.
Now when I tried it in the REPL I get:
# #use "verParser.ml" ;;
# #use "verLexer.ml" ;;
# expr1 lex (Lexing.from_string "module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo ( " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\nbar " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
- : expr = Module (Ident "foo")
# expr1 lex (Lexing.from_string "some module fot foo (\n//comment " ) ;;
Exception: Error.
# expr1 lex (Lexing.from_string "some module foo (\n//comment " ) ;;
Which seems to work exactly as I want it to.

Related

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?

Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Parsing and implementing a lambda calculus in Rascal

I am trying to implement a lambda calculus inside of Rascal but am having trouble getting the precedence and parsing to work the way I would like it to. Currently I have a grammar that looks something like:
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp LAYOUTLIST;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> right func: "fun" Ident "-\>" Exp
When I parse a program of the form:
(fun x -> fun y -> x) 1 2
The resulting tree is:
prog(app(
app(
func(
"x",
func(
"y",
var("x")
nat(1),
nat(2))))))
Where really I am looking for something like this (I think):
prog(app(
func(
"x",
app(
func(
"y",
var("x")),
nat(2))),
nat(1)))
I've tried a number of variations of the precedence in the grammar, I've tried wrapping the App rule in parenthesis, and a number of other variations. There seems to be something going on here I don't understand. Any help would be most appreciated. Thanks.

I've used the following grammar, which removes the extra LAYOUTLIST and the dead right, but this should not make a difference. It seems to work as you want when I use the generic implode function :
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> func: "fun" Ident "-\>" Exp
;
Then calling the parser and imploding to an untyped AST (I've removed the location annotations for readability):
rascal>import ParseTree;
ok
rascal>implode(#node, parse(#start[Prog], "(fun x -\> fun y -\> x) 1 2"))
node: "prog"("app"(
"app"(
"func"(
"x",
"func"(
"y",
"var"("x"))),
"nat"("1")),
"nat"("2")))
So, I am guessing you got the grammar right for the shape of tree you want. How do you go from concrete parse tree to abstract AST? Perhaps there is something funny going on there.

Shift/Reduce conflicts in a propositional logic parser in Happy

I'm making a simple propositional logic parser on happy based on this BNF definition of the propositional logic grammar, this is my code
{
module FNC where
import Data.Char
import System.IO
}
-- Parser name, token types and error function name:
--
%name parse Prop
%tokentype { Token }
%error { parseError }
-- Token list:
%token
var { TokenVar $$ } -- alphabetic identifier
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp } -- Implication
"<=>" { TokenDImp } --double implication
'(' { TokenOB } --open bracket
')' { TokenCB } --closing bracket
'.' {TokenEnd}
%left "<=>"
%left "=>"
%left or
%left and
%left '¬'
%left '(' ')'
%%
--Grammar
Prop :: {Sentence}
Prop : Sentence '.' {$1}
Sentence :: {Sentence}
Sentence : AtomSent {Atom $1}
| CompSent {Comp $1}
AtomSent :: {AtomSent}
AtomSent : var { Variable $1 }
CompSent :: {CompSent}
CompSent : '(' Sentence ')' { Bracket $2 }
| Sentence Connective Sentence {Bin $2 $1 $3}
| '¬' Sentence {Not $2}
Connective :: {Connective}
Connective : and {And}
| or {Or}
| "=>" {Imp}
| "<=>" {DImp}
{
--Error function
parseError :: [Token] -> a
parseError _ = error ("parseError: Syntax analysis error.\n")
--Data types to represent the grammar
data Sentence
= Atom AtomSent
| Comp CompSent
deriving Show
data AtomSent = Variable String deriving Show
data CompSent
= Bin Connective Sentence Sentence
| Not Sentence
| Bracket Sentence
deriving Show
data Connective
= And
| Or
| Imp
| DImp
deriving Show
--Data types for the tokens
data Token
= TokenVar String
| TokenOr
| TokenAnd
| TokenNot
| TokenImp
| TokenDImp
| TokenOB
| TokenCB
| TokenEnd
deriving Show
--Lexer
lexer :: String -> [Token]
lexer [] = [] -- cadena vacia
lexer (c:cs) -- cadena es un caracter, c, seguido de caracteres, cs.
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isSymbol c = lexSym (c:cs)
| c== '(' = TokenOB : lexer cs
| c== ')' = TokenCB : lexer cs
| c== '¬' = TokenNot : lexer cs --solved
| c== '.' = [TokenEnd]
| otherwise = error "lexer: Token invalido"
lexVar cs =
case span isAlpha cs of
("or",rest) -> TokenOr : lexer rest
("and",rest) -> TokenAnd : lexer rest
(var,rest) -> TokenVar var : lexer rest
lexSym cs =
case span isSymbol cs of
("=>",rest) -> TokenImp : lexer rest
("<=>",rest) -> TokenDImp : lexer rest
}
Now, I have two problems here
For some reason I get 4 shift/reduce conflicts, I don't really know where they might be since I thought the precedence would solve them (and I think I followed the BNF grammar correctly)...
(this is rather a Haskell problem) On my lexer function, for some reason I get parsing errors on the line where I say what to do with '¬', if I remove that line it's works, why could that be? (this issue is solved)
Any help would be great.

If you use happy with -i it will generate an info file. The file lists all the states that your parser has. It will also list all the possible transitions for each state. You can use this information to determine if the shift/reduce conflict is one you care about.
Information about invoking happy and conflicts:
http://www.haskell.org/happy/doc/html/sec-invoking.html
http://www.haskell.org/happy/doc/html/sec-conflict-tips.html
Below is some of the output of -i. I've removed all but State 17. You'll want to get a copy of this file so that you can properly debug the problem. What you see here is just to help talk about it:
-----------------------------------------------------------------------------
Info file generated by Happy Version 1.18.10 from FNC.y
-----------------------------------------------------------------------------
state 17 contains 4 shift/reduce conflicts.
-----------------------------------------------------------------------------
Grammar
-----------------------------------------------------------------------------
%start_parse -> Prop (0)
Prop -> Sentence '.' (1)
Sentence -> AtomSent (2)
Sentence -> CompSent (3)
AtomSent -> var (4)
CompSent -> '(' Sentence ')' (5)
CompSent -> Sentence Connective Sentence (6)
CompSent -> '¬' Sentence (7)
Connective -> and (8)
Connective -> or (9)
Connective -> "=>" (10)
Connective -> "<=>" (11)
-----------------------------------------------------------------------------
Terminals
-----------------------------------------------------------------------------
var { TokenVar $$ }
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp }
"<=>" { TokenDImp }
'(' { TokenOB }
')' { TokenCB }
'.' { TokenEnd }
-----------------------------------------------------------------------------
Non-terminals
-----------------------------------------------------------------------------
%start_parse rule 0
Prop rule 1
Sentence rules 2, 3
AtomSent rule 4
CompSent rules 5, 6, 7
Connective rules 8, 9, 10, 11
-----------------------------------------------------------------------------
States
-----------------------------------------------------------------------------
State 17
CompSent -> Sentence . Connective Sentence (rule 6)
CompSent -> Sentence Connective Sentence . (rule 6)
or shift, and enter state 12
(reduce using rule 6)
and shift, and enter state 13
(reduce using rule 6)
"=>" shift, and enter state 14
(reduce using rule 6)
"<=>" shift, and enter state 15
(reduce using rule 6)
')' reduce using rule 6
'.' reduce using rule 6
Connective goto state 11
-----------------------------------------------------------------------------
Grammar Totals
-----------------------------------------------------------------------------
Number of rules: 12
Number of terminals: 9
Number of non-terminals: 6
Number of states: 19
That output basically says that it runs into a bit of ambiguity when it's looking at connectives. It turns out, the slides you linked mention this (Slide 11), "ambiguities are resolved through precedence ¬∧∨⇒⇔ or parentheses".
At this point, I would recommend looking at the shift/reduce conflicts and your desired precedences to see if the parser you have will do the right thing. If so, then you can safely ignore the warnings. If not, you have more work for yourself.

I can answer No. 2:
| c== '¬' == TokenNot : lexer cs --problem here
-- ^^
You have a == there where you should have a =.

Is it possible to create a very permissive grammar using Menhir?

I'm trying to parse some bits and pieces of Verilog - I'm primarily interested in extracting module definitions and instantiations.
In verilog a module is defined like:
module foo ( ... ) endmodule;
And a module is instantiated in one of two different possible ways:
foo fooinst ( ... );
foo #( ...list of params... ) fooinst ( .... );
At this point I'm only interested in finding the name of the defined or instantiated module; 'foo' in both cases above.
Given this menhir grammar (verParser.mly):
%{
type expr = Module of expr
| ModInst of expr
| Ident of string
| Int of int
| Lparen
| Rparen
| Junk
| ExprList of expr list
%}
%token <string> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE TICK OTHER HASH EOF
%start expr2
%type <expr> mod_expr
%type <expr> expr1
%type <expr list> expr2
%%
mod_expr:
| MODULE IDENT LPAREN { Module ( Ident $2) }
| IDENT IDENT LPAREN { ModInst ( Ident $1) }
| IDENT HASH LPAREN { ModInst ( Ident $1) };
junk:
| LPAREN { }
| RPAREN { }
| HASH { }
| INT { };
expr1:
| junk* mod_expr junk* { $2 } ;
expr2:
| expr1* EOF { $1 };
When I try this out in the menhir interpretter it works fine extracting the module instantion:
MODULE IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: MODULE IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
It works fine for the single module instantiation:
IDENT IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: IDENT IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
But of course, if there is an IDENT that appears prior to any of these it will REJECT:
IDENT MODULE IDENT LPAREN IDENT IDENT LPAREN
REJECT
... and of course there will be identifiers in an actual verilog file prior to these defs.
I'm trying not to have to fully specify a Verilog grammar, instead I want to build the grammar up slowly and incrementally to eventually parse more and more of the language.
If I add IDENT to the junk rule, that fixes the problem above, but then the module instantiation rule doesn't work because now the junk rule is capturing the IDENT.
Is it possible to create a very permissive rule that will bypass stuff I don't want to match, or is it generally required that you must create a complete grammar to actually do something like this?
Is it possible to create a rule that would let me match:
MODULE IDENT LPAREN stuff* RPAREN ENDMODULE
where "stuff*" initially matches everything but RPAREN?
Something like :
stuff:
| !RPAREN { } ;
I've used PEG parsers in the past which would allow constructs like that.

I've decided that PEG is a better fit for a permissive, non-exhaustive grammar. Took a look at peg/leg and was able to very quickly put together a leg grammar that does what I need to do:
start = ( comment | mod_match | char)
line = < (( '\n' '\r'* ) | ( '\r' '\n'* )) > { lines++; chars += yyleng; }
module_decl = module modnm:ident lparen ( !rparen . )* rparen { chars += yyleng; printf("Module decl: <%s>\n",yytext);}
module_inst = modinstname:ident ident lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
|modinstname:ident hash lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
mod_match = ( module_decl | module_inst )
module = 'module' ws { modules++; chars +=yyleng; printf("Module: <%s>\n", yytext); }
endmodule = 'endmodule' ws { endmodules++; chars +=yyleng; printf("EndModule: <%s>\n", yytext); }
kwd = (module|endmodule)
ident = !kwd<[a-zA-z][a-zA-Z0-9_]+>- { words++; chars += yyleng; printf("Ident: <%s>\n", yytext); }
char = . { chars++; }
lparen = '(' -
rparen = ')' -
hash = '#'
- = ( space | comment )*
ws = space+
space = ' ' | '\t' | EOL
comment = '//' ( !EOL .)* EOL
| '/*' ( !'*/' .)* '*/'
EOF = !.
EOL = '\r\n' | '\n' | '\r'
Aurochs is possibly also an option, but I have concerns about speed and memory usage of an Aurochs generated parser. peg/leg produce a parser in C which should be quite speedy.

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.

You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart