I would like to implement the following grammar in OCaml using Menhir parser.
There should be four different statement coming each after another, however, any three of them can be missing. So any program contains at least one of these statements, but can contain more coming in some specific order.
Here is the grammar:
main = A (B) (C) (D)
| (A) B (C) (D)
| (A) (B) C (D)
| (A) (B) (C) D
Is it possible to express it in a more concise representation?
Here is an example of parser.mly for this grammar:
%token <char> ACHAR BCHAR CCHAR DCHAR
%token EOF
%start <char option list> main
%type <char> a b c d
%%
main:
a option(b) option(c) option(d) { [Some($1); $2; $3; $4] }
| option(a) b option(c) option(d) { [$1; Some($2); $3; $4] }
| option(a) option(b) c option(d) { [$1; $2; Some($3); $4] }
| option(a) option(b) option(c) d { [$1; $2; $3; Some($4)] }
| EOF { [] }
a:
ACHAR { $1 } (* returns 'A' *)
b:
BCHAR { $1 } (* returns 'B' *)
c:
CCHAR { $1 } (* returns 'C' *)
d:
DCHAR { $1 } (* returns 'D' *)
For this case menhir produces warnings:
Warning: production option(a) -> a is never reduced.
Warning: production option(d) -> d is never reduced.
and cases such as A B C D, A, A C, B D are not matched. How to improve the grammar/parser implementation in order to fix this?
Try this:
main:
a option(b) option(c) option(d) { [Some($1); $2; $3; $4] }
| b option(c) option(d) { [None; Some($1); $2; $3] }
| c option(d) { [None; None; Some($1); $2] }
| d { [None; None; None; Some($1)] }
I removed the last option, which matches the empty sequence, because it contradicts your requirement that at least one of a, b, c or d be present. If you are prepared to accept empty, you could just use
main:
option(a) option(b) option(c) option(d) { [$1; $2; $3; $4] }
although you might want to adjust the action to return [] in the case where all four options are None.
You can write a? instead of option(a).
Also if you want to return four elements, you should use a tuple instead of a list.
Related
What I would like to do
I would like to correctly parse minus floating-point numbers.
How should I fix my code?
What is not working
When I try to interpret - 5 as -5.000000, it shows me this error.
Fatal error: exception Stdlib.Parsing.Parse_error
1c1
< error: parse error at char=0, near token '-'
---
> - 5 = -5.000000
My source code
calc_ast.ml
(* abstract syntax tree *)
type expr =
Num of float
| Plus of expr * expr
| Times of expr * expr
| Div of expr * expr
| Minus of expr * expr
;;
calc_lex.ml
{
open Calc_parse
;;
}
rule lex = parse
| [' ' '\t' '\n' ] { lex lexbuf }
| '-'? ['0' - '9']+ as s { NUM(float_of_string s) }
| '-'? ['0' - '9']+ ('.' digit*)? as s { NUM(float_of_string s) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { EOF }
calc_parse.mly
%{
%}
%token <float> NUM
%token PLUS TIMES EOF MINUS DIV LPAREN RPAREN
%start program
%type <Calc_ast.expr> program
%%
program :
| compound_expr EOF { $1 }
compound_expr :
| expr { $1 }
| LPAREN expr RPAREN { $2 }
expr :
| mul { $1 }
| expr PLUS mul { Calc_ast.Plus($1, $3) }
| expr MINUS mul { Calc_ast.Minus($1, $3) }
mul :
| NUM { Calc_ast.Num $1 }
| mul TIMES NUM { Calc_ast.Times($1, Calc_ast.Num $3) }
| mul DIV NUM { Calc_ast.Div($1, Calc_ast.Num $3) }
%%
calc.ml
open Calc_parse
(* token -> string *)
let string_of_token t =
match t with
NUM(s) -> Printf.sprintf "NUM(%f)" s
| PLUS -> "PLUS"
| TIMES -> "TIMES"
| MINUS -> "MINUS"
| DIV -> "DIV"
| LPAREN -> "LPAREN"
| RPAREN -> "RPAREN"
| EOF -> "EOF"
;;
(* print token t and return it *)
let print_token t =
Printf.printf "%s\n" (string_of_token t);
t
;;
(* apply lexer to string s *)
let lex_string s =
let rec loop b =
match print_token (Calc_lex.lex b) with
EOF -> ()
| _ -> loop b
in
loop (Lexing.from_string s)
;;
(* apply parser to string s;
show some info when a parse error happens *)
let parse_string s =
let b = Lexing.from_string s in
try
program Calc_lex.lex b (* main work *)
with Parsing.Parse_error as exn ->
(* handle parse error *)
let c0 = Lexing.lexeme_start b in
let c1 = Lexing.lexeme_end b in
Printf.fprintf stdout
"error: parse error at char=%d, near token '%s'\n"
c0 (String.sub s c0 (c1 - c0));
raise exn
;;
(* evaluate expression (AST tree) *)
let rec eval_expr e =
match e with
Calc_ast.Num(c) -> c
| Calc_ast.Plus(e0, e1)
-> (eval_expr e0) +. (eval_expr e1)
| Calc_ast.Minus(e0, e1)
-> (eval_expr e0) -. (eval_expr e1)
| Calc_ast.Times(e0, e1)
-> (eval_expr e0) *. (eval_expr e1)
| Calc_ast.Div(e0, e1)
-> (eval_expr e0) /. (eval_expr e1)
;;
(* evaluate string *)
let eval_string s =
let e = parse_string s in
eval_expr e
;;
(* evaluate string and print it *)
let eval_print_string s =
let y = eval_string s in
Printf.printf "%s = %f\n" s y
;;
let eval_print_stdin () =
let ch = stdin in
let s = input_line ch in
eval_print_string (String.trim s)
;;
let main argv =
eval_print_stdin ()
;;
if not !Sys.interactive then
main Sys.argv
;;
As indicated in the comments, it's almost never a good idea for the lexical analyser to try to recognise the - as part of a numeric literal:
Since the lexical token must be a contiguous string, - 5 will not match. Instead, you'll get two tokens. So you need to handle that in the parser anyway.
On the other hand, if you don't put a space after the -, then 3-4 will be analysed as the two tokens 3 and -4, which is also going to lead to a syntax error.
A simple solution is to add term to recognise the unary negation operator:
mul :
| term { Calc_ast.Num $1 }
| mul TIMES term { Calc_ast.Times($1, Calc_ast.Num $3) }
| mul DIV term { Calc_ast.Div($1, Calc_ast.Num $3) }
term :
| NUM { $1 }
| MINUS term { Calc_ast.Minus(0, $2) }
| LPAREN expr RPAREN { $2 }
In the above, I also moved the handling of parentheses from the bottom to the top of the hierarchy, in order to make 4*(5+3) possible. With that change, you will no longer require compound_expr.
I'm trying to get familiar with Happy parser generator for Haskell. Currently, I have an example from the documentation but when I compile the program, I get an error.
This is the code:
{
module Main where
import Data.Char
}
%name calc
%tokentype { Token }
%error { parseError }
%token
let { TokenLet }
in { TokenIn }
int { TokenInt $$ }
var { TokenVar $$ }
'=' { TokenEq }
'+' { TokenPlus }
'-' { TokenMinus }
'*' { TokenTimes }
'/' { TokenDiv }
'(' { TokenOB }
')' { TokenCB }
%%
Exp : let var '=' Exp in Exp { \p -> $6 (($2,$4 p):p) }
| Exp1 { $1 }
Exp1 : Exp1 '+' Term { \p -> $1 p + $3 p }
| Exp1 '-' Term { \p -> $1 p - $3 p }
| Term { $1 }
Term : Term '*' Factor { \p -> $1 p * $3 p }
| Term '/' Factor { \p -> $1 p `div` $3 p }
| Factor { $1 }
Factor
: int { \p -> $1 }
| var { \p -> case lookup $1 p of
Nothing -> error "no var"
Just i -> i }
| '(' Exp ')' { $2 }
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
data Token
= TokenLet
| TokenIn
| TokenInt Int
| TokenVar String
| TokenEq
| TokenPlus
| TokenMinus
| TokenTimes
| TokenDiv
| TokenOB
| TokenCB
deriving Show
lexer :: String -> [Token]
lexer [] = []
lexer (c:cs)
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isDigit c = lexNum (c:cs)
lexer ('=':cs) = TokenEq : lexer cs
lexer ('+':cs) = TokenPlus : lexer cs
lexer ('-':cs) = TokenMinus : lexer cs
lexer ('*':cs) = TokenTimes : lexer cs
lexer ('/':cs) = TokenDiv : lexer cs
lexer ('(':cs) = TokenOB : lexer cs
lexer (')':cs) = TokenCB : lexer cs
lexNum cs = TokenInt (read num) : lexer rest
where (num,rest) = span isDigit cs
lexVar cs =
case span isAlpha cs of
("let",rest) -> TokenLet : lexer rest
("in",rest) -> TokenIn : lexer rest
(var,rest) -> TokenVar var : lexer rest
main = getContents >>= print . calc . lexer
}
I'm getting this error:
[1 of 1] Compiling Main ( gr.hs, gr.o )
gr.hs:310:24:
No instance for (Show ([(String, Int)] -> Int))
arising from a use of `print'
Possible fix:
add an instance declaration for (Show ([(String, Int)] -> Int))
In the first argument of `(.)', namely `print'
In the second argument of `(>>=)', namely `print . calc . lexer'
In the expression: getContents >>= print . calc . lexer
Do you know why and how can I solve it?
If you examine the error message
No instance for (Show ([(String, Int)] -> Int))
arising from a use of `print'
it's clear that the problem is that you are trying to print a function. And indeed, the value produced by the parser function calc is supposed to be a function which takes a lookup table of variable bindings and gives back a result. See for example the rule for variables:
{ \p -> case lookup $1 p of
Nothing -> error "no var"
Just i -> i }
So in main, we need to pass in a list for the p argument, for example an empty list. (Or you could add some pre-defined global variables if you wanted). I've expanded the point-free code to a do block so it's easier to see what's going on:
main = do
input <- getContents
let fn = calc $ lexer input
print $ fn [] -- or e.g. [("foo", 42)] if you wanted it pre-defined
Now it works:
$ happy Calc.y
$ runghc Calc.hs <<< "let x = 1337 in x * 2"
2674
I'm making a simple propositional logic parser on happy based on this BNF definition of the propositional logic grammar, this is my code
{
module FNC where
import Data.Char
import System.IO
}
-- Parser name, token types and error function name:
--
%name parse Prop
%tokentype { Token }
%error { parseError }
-- Token list:
%token
var { TokenVar $$ } -- alphabetic identifier
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp } -- Implication
"<=>" { TokenDImp } --double implication
'(' { TokenOB } --open bracket
')' { TokenCB } --closing bracket
'.' {TokenEnd}
%left "<=>"
%left "=>"
%left or
%left and
%left '¬'
%left '(' ')'
%%
--Grammar
Prop :: {Sentence}
Prop : Sentence '.' {$1}
Sentence :: {Sentence}
Sentence : AtomSent {Atom $1}
| CompSent {Comp $1}
AtomSent :: {AtomSent}
AtomSent : var { Variable $1 }
CompSent :: {CompSent}
CompSent : '(' Sentence ')' { Bracket $2 }
| Sentence Connective Sentence {Bin $2 $1 $3}
| '¬' Sentence {Not $2}
Connective :: {Connective}
Connective : and {And}
| or {Or}
| "=>" {Imp}
| "<=>" {DImp}
{
--Error function
parseError :: [Token] -> a
parseError _ = error ("parseError: Syntax analysis error.\n")
--Data types to represent the grammar
data Sentence
= Atom AtomSent
| Comp CompSent
deriving Show
data AtomSent = Variable String deriving Show
data CompSent
= Bin Connective Sentence Sentence
| Not Sentence
| Bracket Sentence
deriving Show
data Connective
= And
| Or
| Imp
| DImp
deriving Show
--Data types for the tokens
data Token
= TokenVar String
| TokenOr
| TokenAnd
| TokenNot
| TokenImp
| TokenDImp
| TokenOB
| TokenCB
| TokenEnd
deriving Show
--Lexer
lexer :: String -> [Token]
lexer [] = [] -- cadena vacia
lexer (c:cs) -- cadena es un caracter, c, seguido de caracteres, cs.
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isSymbol c = lexSym (c:cs)
| c== '(' = TokenOB : lexer cs
| c== ')' = TokenCB : lexer cs
| c== '¬' = TokenNot : lexer cs --solved
| c== '.' = [TokenEnd]
| otherwise = error "lexer: Token invalido"
lexVar cs =
case span isAlpha cs of
("or",rest) -> TokenOr : lexer rest
("and",rest) -> TokenAnd : lexer rest
(var,rest) -> TokenVar var : lexer rest
lexSym cs =
case span isSymbol cs of
("=>",rest) -> TokenImp : lexer rest
("<=>",rest) -> TokenDImp : lexer rest
}
Now, I have two problems here
For some reason I get 4 shift/reduce conflicts, I don't really know where they might be since I thought the precedence would solve them (and I think I followed the BNF grammar correctly)...
(this is rather a Haskell problem) On my lexer function, for some reason I get parsing errors on the line where I say what to do with '¬', if I remove that line it's works, why could that be? (this issue is solved)
Any help would be great.
If you use happy with -i it will generate an info file. The file lists all the states that your parser has. It will also list all the possible transitions for each state. You can use this information to determine if the shift/reduce conflict is one you care about.
Information about invoking happy and conflicts:
http://www.haskell.org/happy/doc/html/sec-invoking.html
http://www.haskell.org/happy/doc/html/sec-conflict-tips.html
Below is some of the output of -i. I've removed all but State 17. You'll want to get a copy of this file so that you can properly debug the problem. What you see here is just to help talk about it:
-----------------------------------------------------------------------------
Info file generated by Happy Version 1.18.10 from FNC.y
-----------------------------------------------------------------------------
state 17 contains 4 shift/reduce conflicts.
-----------------------------------------------------------------------------
Grammar
-----------------------------------------------------------------------------
%start_parse -> Prop (0)
Prop -> Sentence '.' (1)
Sentence -> AtomSent (2)
Sentence -> CompSent (3)
AtomSent -> var (4)
CompSent -> '(' Sentence ')' (5)
CompSent -> Sentence Connective Sentence (6)
CompSent -> '¬' Sentence (7)
Connective -> and (8)
Connective -> or (9)
Connective -> "=>" (10)
Connective -> "<=>" (11)
-----------------------------------------------------------------------------
Terminals
-----------------------------------------------------------------------------
var { TokenVar $$ }
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp }
"<=>" { TokenDImp }
'(' { TokenOB }
')' { TokenCB }
'.' { TokenEnd }
-----------------------------------------------------------------------------
Non-terminals
-----------------------------------------------------------------------------
%start_parse rule 0
Prop rule 1
Sentence rules 2, 3
AtomSent rule 4
CompSent rules 5, 6, 7
Connective rules 8, 9, 10, 11
-----------------------------------------------------------------------------
States
-----------------------------------------------------------------------------
State 17
CompSent -> Sentence . Connective Sentence (rule 6)
CompSent -> Sentence Connective Sentence . (rule 6)
or shift, and enter state 12
(reduce using rule 6)
and shift, and enter state 13
(reduce using rule 6)
"=>" shift, and enter state 14
(reduce using rule 6)
"<=>" shift, and enter state 15
(reduce using rule 6)
')' reduce using rule 6
'.' reduce using rule 6
Connective goto state 11
-----------------------------------------------------------------------------
Grammar Totals
-----------------------------------------------------------------------------
Number of rules: 12
Number of terminals: 9
Number of non-terminals: 6
Number of states: 19
That output basically says that it runs into a bit of ambiguity when it's looking at connectives. It turns out, the slides you linked mention this (Slide 11), "ambiguities are resolved through precedence ¬∧∨⇒⇔ or parentheses".
At this point, I would recommend looking at the shift/reduce conflicts and your desired precedences to see if the parser you have will do the right thing. If so, then you can safely ignore the warnings. If not, you have more work for yourself.
I can answer No. 2:
| c== '¬' == TokenNot : lexer cs --problem here
-- ^^
You have a == there where you should have a =.
I try to make a frontend for a kind of programs... there are 2 particularities:
1) When we meet a string beginning with =, I want to read the rest of the string as a formula instead of a string value. For instance, "123", "TRUE", "TRUE+123" are considered having string as type, while "=123", "=TRUE", "=TRUE+123" are considered having Syntax.formula as type. By the way,
(* in syntax.ml *)
and expression =
| E_formula of formula
| E_string of string
...
and formula =
| F_int of int
| F_bool of bool
| F_Plus of formula * formula
| F_RC of rc
and rc =
| RC of int * int
2) Inside the formula, some strings are interpreted differently from outside. For instance, in a command R4C5 := 4, R4C5 which is actually a variable, is considered as a identifier, while in "=123+R4C5" which tries to be translated to a formula, R4C5 is translated as RC (4,5): rc.
So I don't know how to realize this with 1 or 2 lexers, and 1 or 2 parsers.
At the moment, I try to realize all in 1 lexer and 1 parser. Here is part of code, which doesn't work, it still considers R4C5 as identifier, instead of rc:
(* in lexer.mll *)
let begin_formula = double_quote "="
let end_formula = double_quote
let STRING = double_quote ([^ "=" ])* double_quote
rule token = parse
...
| begin_formula { BEGIN_FORMULA }
| 'R' { R }
| 'C' { C }
| end_formula { END_FORMULA }
| lex_identifier as li
{ try Hashtbl.find keyword_table (lowercase li)
with Not_found -> IDENTIFIER li }
| STRING as s { STRING s }
...
(* in parser.mly *)
expression:
| BEGIN_FORMULA f = formula END_FORMULA { E_formula f }
| s = STRING { E_string s }
...
formula:
| i = INTEGER { F_int i }
| b = BOOL { F_bool b }
| f0 = formula PLUS f1 = formula { F_Plus (f0, f1) }
| rc { F_RC $1 }
rc:
| R i0 = INTEGER C i1 = INTEGER { RC (i0, i1) }
Could anyone help?
New idea: I am thinking of sticking on 1 lexer + 1 parser, and create a entrypoint for formula in lexer as what we do normally for comment... here are some updates in lexer.mll and parser.mly:
(* in lexer.mll *)
rule token = parse
...
| begin_formula { formula lexbuf }
...
| INTEGER as i { INTEGER (int_of_string i) }
| '+' { PLUS }
...
and formula = parse
| end_formula { token lexbuf }
| INTEGER as i { INTEGER_F (int_of_string i) }
| 'R' { R }
| 'C' { C }
| '+' { PLUS_F }
| _ { raise (Lexing_error ("unknown in formula")) }
(* in parser.mly *)
expression:
| formula { E_formula f }
...
formula:
| i = INTEGER_F { F_int i }
| f0 = formula PLUS_F f1 = formula { F_Plus (f0, f1) }
...
I have done some tests, for instance to parse "=R4", the problem is that it can parse well R, but it considers 4 as INTEGER instead of INTEGER_F, it seems that formula lexbuf needs to be added from time to time in the body of formula entrypoint (Though I don't understand why parsing in the body of token entrypoint works without always mentioning token lexbuf). I have tried several possibilities: | 'R' { R; formula lexbuf }, | 'R' { formula lexbuf; R }, etc. but it didn't work... ... Could anyone help?
I think the simplest choice would be to have two different lexers and two different parsers; call the lexer&parser for formulas from inside the global parser. After the fact you can see how much is shared between the two grammars, and factorize things when possible.
So I am trying to implement a pretty simple grammar for one-line statements:
# Grammar
c : Character c [a-z0-9-]
(v) : Vowel (= [a,e,u,i,o])
(c) : Consonant
(?) : Any character (incl. number)
(l) : Any alpha char (= [a-z])
(n) : Any integer (= [0-9])
(c1-c2) : Range from char c1 to char c2
(c1,c2,c3) : List including chars c1, c2 and c3
Examples:
h(v)(c)no(l)(l)jj-k(n)
h(v)(c)no(l)(l)(a)(a)(n)
h(e-g)allo
h(e,f,g)allo
h(x,y,z)uul
h(x,y,z)(x,y,z)(x,y,z)(x,y,z)uul
I am using the Happy parser generator (http://www.haskell.org/happy/) but for some reason there seems to be some ambiguity problem.
The error message is: "shift/reduce conflicts: 1"
I think the ambiguity is with these two lines:
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket char hyphen char rBracket { TRange $2 $4 }
An example case is: "(a)" vs "(a-z)"
The lexer would give the following for the two cases:
(a) : [CLBracket, CChar 'a', CRBracket]
(a-z) : [CLBracket, CChar 'a', CHyphen, CChar 'z', CRBracket]
What I don't understand is how this can be ambiguous with an LL[2] parser.
In case it helps here is the entire Happy grammar definition:
{
module XHappyParser where
import Data.Char
import Prelude hiding (lex)
import XLexer
import XString
}
%name parse
%tokentype { Character }
%error { parseError }
%token
lBracket { CLBracket }
rBracket { CRBracket }
hyphen { CHyphen }
question { CQuestion }
comma { CComma }
char { CChar $$ }
%%
xstring : tokens { XString (reverse $1) }
tokens : token { [$1] }
| tokens token { $2 : $1 }
token : char { TLiteral $1 }
| hyphen { TLiteral '-' }
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket question rBracket { TAny }
| lBracket char hyphen char rBracket { TRange $2 $4 }
| lBracket listitems rBracket { TList $2 }
listitems : char { [$1] }
| listitems comma char { $1 ++ [$3] }
{
parseError :: [Character] -> a
parseError _ = error "parse error"
}
Thank you!
Here's the ambiguity:
token : [...]
| lBracket char rBracket
| [...]
| lBracket listitems rBracket
listitems : char
| [...]
Your parser could accept (v) as both TString [TVowel] and TString [TList ['v']], not to mention the missing characters in that case expression.
One possible way of solving it would be to modify your grammar so lists are at least two items, or have some different notation for vowels, consonants, etc.
The problem seems to be:
| lBracket char rBracket
...
| lBracket listitems rBracket
or in cleaner syntax:
(c)
Can be a TVowel, TConsonant, TLetter, TNumber (as you know) or a singleton TList.
As the happy manual says, shift reduce usually isn't an issue. You can us precedence to force behavior/remove the warning if you'd like.