Managing position information with Alex and Happy - parsing

I'm learning to use Alex and Happy to write a small compiler. I want to maintain line and column information for my AST nodes so that I can provide meaningful error messages to the user. To illustrate how I plan to do it, I wrote a small example (see code below), and I'd like to know if the way I approached the problem (having AlexPosn attached to the tokens, attaching a polymorphic attribute field to AST nodes, using tkPos and astAttr) is good style or if there are better ways to handle position information.
Lexer.x:
{
module Lexer where
}
%wrapper "posn"
$white = [\ \t\n]
tokens :-
$white+ ;
[xX] { \pos s -> MkToken pos X }
"+" { \pos s -> MkToken pos Plus }
"*" { \pos s -> MkToken pos Times }
"(" { \pos s -> MkToken pos LParen }
")" { \pos s -> MkToken pos RParen }
{
data Token = MkToken AlexPosn TokenClass
deriving (Show, Eq)
data TokenClass = X
| Plus
| Times
| LParen
| RParen
deriving (Show, Eq)
tkPos :: Token -> (Int, Int)
tkPos (MkToken (AlexPn _ line col) _) = (line, col)
}
Parser.y:
{
module Parser where
import Lexer
}
%name simple
%tokentype { Token }
%token
'(' { MkToken _ LParen }
')' { MkToken _ RParen }
'+' { MkToken _ Plus }
'*' { MkToken _ Times }
x { MkToken _ X }
%%
Expr : Term '+' Expr { NAdd $1 $3 (astAttr $1) }
| Term { $1 }
Term : Factor '*' Term { NMul $1 $3 (astAttr $1) }
| Factor { $1 }
Factor : x { NX (tkPos $1) }
| '(' Expr ')' { $2 }
{
data AST a = NX a
| NMul (AST a) (AST a) a
| NAdd (AST a) (AST a) a
deriving (Show, Eq)
astAttr :: AST a -> a
astAttr (NX a) = a
astAttr (NMul _ _ a) = a
astAttr (NAdd _ _ a) = a
happyError :: [Token] -> a
happyError _ = error "parse error"
}
Main.hs:
module Main where
import Lexer
import Parser
main :: IO ()
main = do
s <- getContents
let toks = alexScanTokens s
print $ simple toks

I personally would be pretty ok with the style you've described. However, it is very manual and I was hoping to at least provide one alternative that might be easier to manage.
If you look a little further down the documentation for alex wrappers, you'll notice that the monad and monadstate wrappers both contain position information. The downside is that you now have the entire thing wrapped in a monad and it complicates the parser slightly. However, by wrapping it in a monad, the result of the parse is an Alex a which means you have full access to the line and column information when you create your ast nodes. Now this simply removes some of the boiler plate from the lexer and doesn't do much more.
By doing this, you could also carry around the AlexState with your token, but that might be unnecessary.
If you need help actually fixing the parser to handle a monad/monadstate wrapper, I wrote a response on how I managed to get it working here: How to use an Alex monadic lexer with Happy?

Related

What is wrong with my "token type" in Happy?

I am writing a simple arithmetic expression parser in the Haskell platform's Happy. The Happy tutorial (labeled "Documentation") from the Haskell site implements a similar grammar to what I need. The difference is that I want to include floating point numbers in my expressions and I do not need to define variables (i.e. expressions will not contain "let", "in", "=", or "x" or "y").
When I compile my grammar file with Happy it outputs:
unused terminals: 1
happy: no token type given
CallStack (from HasCallStack):
error, called at src/AbsSyn.lhs:93:24 in main:AbsSyn
I've searched StackOverflow for questions mentioning "no token type given" and found nothing mentioning this error. I also can't figure out what the "CallStack" trace means. (I'm quite new to Haskell).
I've defined a helper function to tell whether a token is a float:
isNum :: [Token] -> a -> Bool
isNum x = typeOf (read x) == typeOf 1.1
I've copied the documentation page's grammar file almost exactly, except where I've removed any production rules for variables, "=", or other alphabetic input, and where I've added rules for floating point numbers, i.e.
%token
int { TokenInt $$ }
float { TokenNum $$ }
...
Exp : Expl { Expl $1 }
Exp : Expl '+' Term { Plus $1 $3}
...
Factor : int { Int $1 }
| float { Float $1 }
...
data Exp
= Let String Exp Exp
| Expl Expl
deriving Show
data Expl
= Plus Expl Term
| Minus Expl Term
| Term Term
deriving Show
...
data Token
= TokenInt Int
| TokenFloat Float
| TokenNum Float
| TokenPlus
...
lexer (c:cs)
| isSpace c = lexer cs
| isDigit c = lexNum (c:cs)
lexer ('=':cs) = TokenEq : lexer cs
...
lexNum cs = TokenInt (read num) : lexer rest
where (num,rest) = span isDigit cs
lexFloat cs = TokenFloat (read num) : lexer rest
where (num,rest) = span isNum cs
That's about it, so far.

How do I generalize a Parser which extracts values from tokens?

I'm creating a parser in Haskell for a simple language. The parser takes a list of tokens, generated by a separate tokenizer, and turns it into a structured tree. While I was writing this Parser, I came across a reoccurring pattern that I would like to generalize. Here are the relevant definitions:
--Parser Definition
data Parser a b = Parser { runParser :: [a] -> Maybe ([a], b) }
--Token Definition
data Token = TokOp { getTokOp :: Operator }
| TokInt { getTokInt :: Int }
| TokIdent { getTokIdent :: String }
| TokAssign
| TokLParen
| TokRParen
deriving (Eq, Show)
The parser also has instances defined for MonadPlus and all of its super classes. Here are two examples of the reoccurring Pattern I am trying to generalize:
-- Identifier Parser
identP :: Parser Token String
identP = Parser $ \input -> case input of
TokIdent s : rest -> Just (rest, s)
_ -> Nothing
--Integer Parser
intP :: Parser Token Int
intP = Parser $ \input -> case input of
TokInt n : rest -> Just (rest, n)
_ -> Nothing
As you can see, these two examples are incredibly similar, yet I see no way generalize it. Ideally I would want some function of type extractToken :: ?? -> Parser Token a where a is the value contained by the token. I am guessing the solution involves >>=, but I am not experienced enough with Monads to figure it out. This also may be impossible, I am not sure.
It seems difficult to avoid at least some boilerplate. One simple approach is to manually define the field selectors to return Maybes:
{-# LANGUAGE LambdaCase #-}
data Token = TokOp Operator
| TokInt Int
| TokIdent String
| TokAssign
| TokLParen
| TokRParen
deriving (Eq, Show)
getTokOp = \case { TokOp x -> Just x ; _ -> Nothing }
getTokInt = \case { TokInt x -> Just x ; _ -> Nothing }
getTokIdent = \case { TokIdent x -> Just x ; _ -> Nothing }
and then the rest is:
fieldP :: (Token -> Maybe a) -> Parser Token a
fieldP sel = Parser $ \case tok:rest -> (,) rest <$> sel tok
[] -> Nothing
opP = fieldP getTokOp
identP = fieldP getTokIdent
intP = fieldP getTokInt
You could derive the getXXX selectors using Template Haskell or generics, though it's likely not worth it.

Is there any trick about translating BNF to Parsec program?

The BNF that match function call chain (like x(y)(z)...):
expr = term T
T = (expr) T
| EMPTY
term = (expr)
| VAR
Translate it to Parsec program that looks so tricky.
term :: Parser Term
term = parens expr <|> var
expr :: Parser Term
expr = do whiteSpace
e <- term
maybeAddSuffix e
where addSuffix e0 = do e1 <- parens expr
maybeAddSuffix $ TermApp e0 e1
maybeAddSuffix e = addSuffix e
<|> return e
Could you list all the design patterns about translating BNF to Parsec program?
The simplest think you could do if your grammar is sizeable is to just use the Alex/Happy combo. It is fairly straightforward to use, accepts the BNF format directly - no human translation needed - and perhaps most importantly, produces blazingly fast parsers/lexers.
If you are dead set on doing it with parsec (or you are doing this as a learning exercise), I find it easier in general to do it in two stages; first lexing, then parsing. Parsec will do both!
First write the appropriate types:
{-# LANGUAGE LambdaCase #-}
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Prim
import Text.Parsec.Pos
import Text.ParserCombinators.Parsec.Char
import Control.Applicative hiding ((<|>))
import Control.Monad
data Term = App Term Term | Var String deriving (Show, Eq)
data Token = LParen | RParen | Str String deriving (Show, Eq)
type Lexer = Parsec [Char] () -- A lexer accepts a stream of Char
type Parser = Parsec [Token] () -- A parser accepts a stream of Token
Parsing a single token is simple. For simplicity, a variable is 1 or more letters. You can of course change this however you like.
oneToken :: Lexer Token
oneToken = (char '(' >> return LParen) <|>
(char ')' >> return RParen) <|>
(Str <$> many1 letter)
Parsing the entire token stream is just parsing a single token many times, possible separated by whitespace:
lexer :: Lexer [Token]
lexer = spaces >> many1 (oneToken <* spaces)
Note the placement of spaces: this way, white space is accepted at the beginning and end of the string.
Since Parser uses a custom token type, you have to use a custom satisfy function. Fortunately, this is almost identical to the existing satisfy.
satisfy' :: (Token -> Bool) -> Parser Token
satisfy' f = tokenPrim show
(\src _ _ -> incSourceColumn src 1)
(\x -> if f x then Just x else Nothing)
Then we can write parsers for each of the primitive tokens.
lparen = satisfy' $ \case { LParen -> True ; _ -> False }
rparen = satisfy' $ \case { RParen -> True ; _ -> False }
strTok = (\(Str s) -> s) <$> (satisfy' $ \case { Str {} -> True ; _ -> False })
As you may imagine, parens would be useful for our purposes. It is very straightforward to write.
parens :: Parser a -> Parser a
parens = between lparen rparen
Now the interesting parts.
term, expr, var :: Parser Term
term = parens expr <|> var
var = Var <$> strTok
These two should be fairly obvious to you.
Parec contains combinators option and optionMaybe which are useful when you you need to "maybe do something".
expr = do
e0 <- term
option e0 (parens expr >>= \e1 -> return (App e0 e1))
The last line means - try to apply the parser given to option - if it fails, instead return e0.
For testing you can do:
tokAndParse = runParser (lexer <* eof) () "" >=> runParser (expr <* eof) () ""
The eof attached to each parser is to make sure that the entire input is consumed; the string cannot be a member of the grammar if there are extra trailing characters. Note - your example x(y)(z) is not actually in your grammar!
>tokAndParse "x(y)(z)"
Left (line 1, column 5):
unexpected LParen
expecting end of input
But the following is
>tokAndParse "(x(y))(z)"
Right (App (App (Var "x") (Var "y")) (Var "z"))

Haskell - Happy - "No instance ..." error

I'm trying to get familiar with Happy parser generator for Haskell. Currently, I have an example from the documentation but when I compile the program, I get an error.
This is the code:
{
module Main where
import Data.Char
}
%name calc
%tokentype { Token }
%error { parseError }
%token
let { TokenLet }
in { TokenIn }
int { TokenInt $$ }
var { TokenVar $$ }
'=' { TokenEq }
'+' { TokenPlus }
'-' { TokenMinus }
'*' { TokenTimes }
'/' { TokenDiv }
'(' { TokenOB }
')' { TokenCB }
%%
Exp : let var '=' Exp in Exp { \p -> $6 (($2,$4 p):p) }
| Exp1 { $1 }
Exp1 : Exp1 '+' Term { \p -> $1 p + $3 p }
| Exp1 '-' Term { \p -> $1 p - $3 p }
| Term { $1 }
Term : Term '*' Factor { \p -> $1 p * $3 p }
| Term '/' Factor { \p -> $1 p `div` $3 p }
| Factor { $1 }
Factor
: int { \p -> $1 }
| var { \p -> case lookup $1 p of
Nothing -> error "no var"
Just i -> i }
| '(' Exp ')' { $2 }
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
data Token
= TokenLet
| TokenIn
| TokenInt Int
| TokenVar String
| TokenEq
| TokenPlus
| TokenMinus
| TokenTimes
| TokenDiv
| TokenOB
| TokenCB
deriving Show
lexer :: String -> [Token]
lexer [] = []
lexer (c:cs)
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isDigit c = lexNum (c:cs)
lexer ('=':cs) = TokenEq : lexer cs
lexer ('+':cs) = TokenPlus : lexer cs
lexer ('-':cs) = TokenMinus : lexer cs
lexer ('*':cs) = TokenTimes : lexer cs
lexer ('/':cs) = TokenDiv : lexer cs
lexer ('(':cs) = TokenOB : lexer cs
lexer (')':cs) = TokenCB : lexer cs
lexNum cs = TokenInt (read num) : lexer rest
where (num,rest) = span isDigit cs
lexVar cs =
case span isAlpha cs of
("let",rest) -> TokenLet : lexer rest
("in",rest) -> TokenIn : lexer rest
(var,rest) -> TokenVar var : lexer rest
main = getContents >>= print . calc . lexer
}
I'm getting this error:
[1 of 1] Compiling Main ( gr.hs, gr.o )
gr.hs:310:24:
No instance for (Show ([(String, Int)] -> Int))
arising from a use of `print'
Possible fix:
add an instance declaration for (Show ([(String, Int)] -> Int))
In the first argument of `(.)', namely `print'
In the second argument of `(>>=)', namely `print . calc . lexer'
In the expression: getContents >>= print . calc . lexer
Do you know why and how can I solve it?
If you examine the error message
No instance for (Show ([(String, Int)] -> Int))
arising from a use of `print'
it's clear that the problem is that you are trying to print a function. And indeed, the value produced by the parser function calc is supposed to be a function which takes a lookup table of variable bindings and gives back a result. See for example the rule for variables:
{ \p -> case lookup $1 p of
Nothing -> error "no var"
Just i -> i }
So in main, we need to pass in a list for the p argument, for example an empty list. (Or you could add some pre-defined global variables if you wanted). I've expanded the point-free code to a do block so it's easier to see what's going on:
main = do
input <- getContents
let fn = calc $ lexer input
print $ fn [] -- or e.g. [("foo", 42)] if you wanted it pre-defined
Now it works:
$ happy Calc.y
$ runghc Calc.hs <<< "let x = 1337 in x * 2"
2674

Shift/Reduce conflicts in a propositional logic parser in Happy

I'm making a simple propositional logic parser on happy based on this BNF definition of the propositional logic grammar, this is my code
{
module FNC where
import Data.Char
import System.IO
}
-- Parser name, token types and error function name:
--
%name parse Prop
%tokentype { Token }
%error { parseError }
-- Token list:
%token
var { TokenVar $$ } -- alphabetic identifier
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp } -- Implication
"<=>" { TokenDImp } --double implication
'(' { TokenOB } --open bracket
')' { TokenCB } --closing bracket
'.' {TokenEnd}
%left "<=>"
%left "=>"
%left or
%left and
%left '¬'
%left '(' ')'
%%
--Grammar
Prop :: {Sentence}
Prop : Sentence '.' {$1}
Sentence :: {Sentence}
Sentence : AtomSent {Atom $1}
| CompSent {Comp $1}
AtomSent :: {AtomSent}
AtomSent : var { Variable $1 }
CompSent :: {CompSent}
CompSent : '(' Sentence ')' { Bracket $2 }
| Sentence Connective Sentence {Bin $2 $1 $3}
| '¬' Sentence {Not $2}
Connective :: {Connective}
Connective : and {And}
| or {Or}
| "=>" {Imp}
| "<=>" {DImp}
{
--Error function
parseError :: [Token] -> a
parseError _ = error ("parseError: Syntax analysis error.\n")
--Data types to represent the grammar
data Sentence
= Atom AtomSent
| Comp CompSent
deriving Show
data AtomSent = Variable String deriving Show
data CompSent
= Bin Connective Sentence Sentence
| Not Sentence
| Bracket Sentence
deriving Show
data Connective
= And
| Or
| Imp
| DImp
deriving Show
--Data types for the tokens
data Token
= TokenVar String
| TokenOr
| TokenAnd
| TokenNot
| TokenImp
| TokenDImp
| TokenOB
| TokenCB
| TokenEnd
deriving Show
--Lexer
lexer :: String -> [Token]
lexer [] = [] -- cadena vacia
lexer (c:cs) -- cadena es un caracter, c, seguido de caracteres, cs.
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isSymbol c = lexSym (c:cs)
| c== '(' = TokenOB : lexer cs
| c== ')' = TokenCB : lexer cs
| c== '¬' = TokenNot : lexer cs --solved
| c== '.' = [TokenEnd]
| otherwise = error "lexer: Token invalido"
lexVar cs =
case span isAlpha cs of
("or",rest) -> TokenOr : lexer rest
("and",rest) -> TokenAnd : lexer rest
(var,rest) -> TokenVar var : lexer rest
lexSym cs =
case span isSymbol cs of
("=>",rest) -> TokenImp : lexer rest
("<=>",rest) -> TokenDImp : lexer rest
}
Now, I have two problems here
For some reason I get 4 shift/reduce conflicts, I don't really know where they might be since I thought the precedence would solve them (and I think I followed the BNF grammar correctly)...
(this is rather a Haskell problem) On my lexer function, for some reason I get parsing errors on the line where I say what to do with '¬', if I remove that line it's works, why could that be? (this issue is solved)
Any help would be great.
If you use happy with -i it will generate an info file. The file lists all the states that your parser has. It will also list all the possible transitions for each state. You can use this information to determine if the shift/reduce conflict is one you care about.
Information about invoking happy and conflicts:
http://www.haskell.org/happy/doc/html/sec-invoking.html
http://www.haskell.org/happy/doc/html/sec-conflict-tips.html
Below is some of the output of -i. I've removed all but State 17. You'll want to get a copy of this file so that you can properly debug the problem. What you see here is just to help talk about it:
-----------------------------------------------------------------------------
Info file generated by Happy Version 1.18.10 from FNC.y
-----------------------------------------------------------------------------
state 17 contains 4 shift/reduce conflicts.
-----------------------------------------------------------------------------
Grammar
-----------------------------------------------------------------------------
%start_parse -> Prop (0)
Prop -> Sentence '.' (1)
Sentence -> AtomSent (2)
Sentence -> CompSent (3)
AtomSent -> var (4)
CompSent -> '(' Sentence ')' (5)
CompSent -> Sentence Connective Sentence (6)
CompSent -> '¬' Sentence (7)
Connective -> and (8)
Connective -> or (9)
Connective -> "=>" (10)
Connective -> "<=>" (11)
-----------------------------------------------------------------------------
Terminals
-----------------------------------------------------------------------------
var { TokenVar $$ }
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp }
"<=>" { TokenDImp }
'(' { TokenOB }
')' { TokenCB }
'.' { TokenEnd }
-----------------------------------------------------------------------------
Non-terminals
-----------------------------------------------------------------------------
%start_parse rule 0
Prop rule 1
Sentence rules 2, 3
AtomSent rule 4
CompSent rules 5, 6, 7
Connective rules 8, 9, 10, 11
-----------------------------------------------------------------------------
States
-----------------------------------------------------------------------------
State 17
CompSent -> Sentence . Connective Sentence (rule 6)
CompSent -> Sentence Connective Sentence . (rule 6)
or shift, and enter state 12
(reduce using rule 6)
and shift, and enter state 13
(reduce using rule 6)
"=>" shift, and enter state 14
(reduce using rule 6)
"<=>" shift, and enter state 15
(reduce using rule 6)
')' reduce using rule 6
'.' reduce using rule 6
Connective goto state 11
-----------------------------------------------------------------------------
Grammar Totals
-----------------------------------------------------------------------------
Number of rules: 12
Number of terminals: 9
Number of non-terminals: 6
Number of states: 19
That output basically says that it runs into a bit of ambiguity when it's looking at connectives. It turns out, the slides you linked mention this (Slide 11), "ambiguities are resolved through precedence ¬∧∨⇒⇔ or parentheses".
At this point, I would recommend looking at the shift/reduce conflicts and your desired precedences to see if the parser you have will do the right thing. If so, then you can safely ignore the warnings. If not, you have more work for yourself.
I can answer No. 2:
| c== '¬' == TokenNot : lexer cs --problem here
-- ^^
You have a == there where you should have a =.

Resources