Mixing identifier parse with whitespace parser - parsing

The parsing library has the function identifier and the function whitespace:
identifier :: Parser Char String
whitespace :: Parser Char ()
The function identifier made the parsing of an identifier, but don't discard the whitespaces on the input text. So, we have the withespace function, that consumes all whitespaces from an input.
I was wondering if I could mix this two functions into one. Just like:
identifier' :: Parser Char String
However, I'm new in Haskell and don't know how I do that exactly. Is there a way to do that?

You can work with (<*) :: Applicative f => f a -> f b -> f a that will here run the first parser, then the second and return the result of the first parser, so:
identifier' :: Parser Char String
identifier' = identifier <* whitespace
This is equivalent to:
identifier' :: Parser Char String
identifier' = do
idf <- identifier
whitespace
return idf

Related

Using makeExprParser with ambiguity

I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.
The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (<, >) and separated by ,.
Stuff like <>, <23>, <23,87> etc.
The problem now is that the expressions may themselves contain the > operator (meaning "greater than"), which causes my parser to fail.
<1223>234> should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)].
I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.
Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:
This is the relevant part of my parser:
-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type BinaryOperator = Text
type Name = Text
data Expr
= IntExpr Integer
| BinaryExpression BinaryOperator Expr Expr
deriving (Eq, Show)
type Parser = Parsec Void Text
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")
parseInteger :: Parser Expr
parseInteger = do
number <- some digitChar
_ <- sc
return $ IntExpr $ read number
parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]
parseBracketList :: Parser [Expr]
parseBracketList = do
_ <- symbol "<"
exprs <- sepBy parseExpr (symbol ",")
_ <- symbol ">"
return exprs
main :: IO ()
main = do
text : _ <- getArgs
let res = runParser parseBracketList "stdin" (pack text)
case res of
(Right suc) -> do
print suc
(Left err) ->
putStrLn $ errorBundlePretty err
You've (probably) misdiagnosed the problem. Your parser fails on <1233>234> because it's trying to parse > as a left associative operator, like +. In other words, the same way:
1+2+
would fail, because the second + has no right-hand operand, your parser is failing because:
1233>234>
has no digit following the second >. Assuming you don't want your > operator to chain (i.e., 1>2>3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.
Unfortunately, with or without this change your parser will still fail on the simpler test case:
<1233>
because the > is interpreted as an operator within a continuing expression.
In other words, the problem isn't that your parser can't handle expressions with > characters, it's that it's overly aggressive in treating > characters as part of an expression, preventing them from being recognized as the closing angle bracket.
To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where > can be part of a continuing expression and where it can't.
One rule that will probably work is to only consider a > as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:
symbol ">" <* lookAhead term
will parse a > operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the > symbol itself), so you must surround it with a try:
try (symbol ">" <* lookAhead term)
With the above two fixes applied to parseExpr:
parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger
you'll get the following parses:
λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
However, the following will fail:
λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
| ^
unexpected end of input
expecting ',', '>', or digit
λ>
because the fact that the > is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87>18 is followed by neither a comma nor a closing > angle bracket.
If you need to parse something like <23,87>18, you have bigger problems. Consider the following two test cases:
<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001
It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:
[IntExpr 1, IntExpr 2]
followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.

Parse letter or number with Parsec

I am trying to write a parser for strings such as x, A (i.e. single letters), 657 and 0 (i.e. integer positive numbers).
Here is the code I wrote.
import Text.Parsec
data Expression = String String | Number Int
value = letter <|> many1 digit
However I get the following error.
Couldn't match type ‘[Char]’ with ‘Char’
How to convert Char -> String inside the parser?
What should the type annotation be for value ?
letter parses just a single letter and returns a Char. You want to parse a String, namely [Char] (it's the same thing), so I guess you want to parse many letter?
But if you want to parse just a single letter as a String you can take advantage of the fact that Parsec _ _ has a Functor instance in order to map over its result and pack it in a list:
value :: Parsec s u String
value = fmap (:[]) letter <|> many1 digit
After the edit I guess you want to parse the Expression you have presented to us, so you will need some more fancy fmapping to wrap the results in proper constructors:
value :: Parsec s u Expression
value = fmap (String . (:[])) letter
<|> fmap (Number . read) (many1 digit)

Correctly parsing nested data using megaparsec

I am trying to get more familiar with megaparsec, and I am running into some issues with presedences. By 'nested data' in the title I refer to the fact that I am trying to parse types, which in turn could contain other types. If someone could explain why this does not behave as I would expect, please don't hesitate to tell me.
I am trying to parse types similar to those found in Haskell. Types are either base types Int, Bool, Float or type variables a (any lowercase word).
We can also construct algebraic data types from type constructors (Uppercase words) such as Maybe and type parameters (any other type). Examples are Maybe a and Either (Maybe Int) Bool. Functions associate to the right and are constructed with ->, such as Maybe a -> Either Int (b -> c). N-ary tuples are a sequence of types separated by , and enclosed in ( and ), such as (Int, Bool, a). A type can be wrapped in parenthesis to raise its precedence level (Maybe a). A unit type () is also defined.
I am using this ADT to describe this.
newtype Ident = Ident String
newtype UIdent = UIdent String
data Type a
= TLam a (Type a) (Type a)
| TVar a Ident
| TNil a
| TAdt a UIdent [Type a]
| TTup a [Type a]
| TBool a
| TInt a
| TFloat a
I have tried to write a megaparsec parser to parse such types, but I get unexpected results. I attach the relevant code below after which I will try to describe what I experience.
{-# LANGUAGE OverloadedStrings #-}
module Parser where
import AbsTinyCamiot
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
import Text.Megaparsec.Debug
import Control.Applicative hiding (many, some, Const)
import Control.Monad.Combinators.Expr
import Control.Monad.Identity
import Data.Void
import Data.Text (Text, unpack)
type Parser a = ParsecT Void Text Identity a
-- parse types
pBaseType :: Parser (Type ())
pBaseType = choice [
TInt () <$ label "parse int" (pSymbol "Int"),
TBool () <$ label "parse bool" (pSymbol "Bool"),
TFloat () <$ label "parse float" (pSymbol "Float"),
TNil () <$ label "parse void" (pSymbol "()"),
TVar () <$> label "parse type variable" pIdent]
pAdt :: Parser (Type ())
pAdt = label "parse ADT" $ do
con <- pUIdent
variables <- many $ try $ many spaceChar >> pType
return $ TAdt () con variables
pType :: Parser (Type ())
pType = label "parse a type" $
makeExprParser
(choice [ try pFunctionType
, try $ parens pType
, try pTupleType
, try pBaseType
, try pAdt
])
[]--[[InfixR (TLam () <$ pSymbol "->")]]
pTupleType :: Parser (Type ())
pTupleType = label "parse a tuple type" $ do
pSymbol "("
fst <- pType
rest <- some (pSymbol "," >> pType)
pSymbol ")"
return $ TTup () (fst : rest)
pFunctionType :: Parser (Type ())
pFunctionType = label "parse a function type" $ do
domain <- pType
some spaceChar
pSymbol "->"
some spaceChar
codomain <- pType
return $ TLam () domain codomain
parens :: Parser a -> Parser a
parens p = label "parse a type wrapped in parentheses" $ do
pSymbol "("
a <- p
pSymbol ")"
return a
pUIdent :: Parser UIdent
pUIdent = label "parse a UIdent" $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ UIdent (a:rest)
pIdent :: Parser Ident
pIdent = label "parse an Ident" $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ Ident (a:rest)
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
This might be overwhelming, so let me explain some key points. I understand that I have a lot of different constructions that could match on an opening parenthesis, so I've wrapped those parsers in try, such that if they fail I can try the next parser that might consume an opening parenthesis. Perhaps I am using try too much? Does it affect performance to potentially backtrack so much?
I've also tried to make an expression parser by defining some terms and an operator table. You can see now that I've commented out the operator (function arrow), however. As the code looks right now I loop infinitely when I try to parse a function type. I think this might be due to the fact that when I try to parse a function type (invoked from pType) I immediately try to parse a type representing the domain of the function, which again call pType. How would I do this correctly?
If I decide to use the operator table instead, and not use my custom parser for function types, I parse things using wrong precedences. E.g Maybe a -> b gets parsed as Maybe (a -> b), while I would want it to be parsed as (Maybe a) -> b. Is there a way where I could use the operator table and still have type constructors bind more tightly than the function arrow?
Lastly, as I am learning megaparsec as I go, if anyone sees any misunderstandings or things that are wierd/unexpected, please tell me. I've read most of this tutorial to get my this far.
Please let me know of any edits I can make to increase the quality of my question!
Your code does not handle precedences at all, and also as a result of this it uses looping left-recursion.
To give an example of left-recursion in your code, pFunctionType calls pType as the first action, which calls pFunctionType as the first action. This is clearly a loop.
For precedences, I recommend to look at tutorials on "recursive descent operator parsing", a quick Google search reveals that there are several of them. Nevertheless I can summarize here the key points. I write some code.
{-# language OverloadedStrings #-}
import Control.Monad.Identity
import Data.Text (Text)
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
type Parser a = ParsecT Void Text Identity a
newtype Ident = Ident String deriving Show
newtype UIdent = UIdent String deriving Show
data Type
= TVar Ident
| TFun Type Type -- instead of "TLam"
| TAdt UIdent [Type]
| TTup [Type]
| TUnit -- instead of "TNil"
| TBool
| TInt
| TFloat
deriving Show
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pChar :: Char -> Parser ()
pChar c = void (char c <* pSpace)
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
keywords :: [String]
keywords = ["Bool", "Int", "Float"]
pUIdent :: Parser UIdent
pUIdent = try $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
let x = a:rest
if elem x keywords
then fail "expected an ADT name"
else pure $ UIdent x
pIdent :: Parser Ident
pIdent = try $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
return $ Ident (a:rest)
Let's stop here.
I changed the names of constructors in Type to conform to how they are called in Haskell. I also removed the parameter on Type, to have less noise in my example, but you can add it back of course.
Note the changed pUIdent and the addition of keywords. In general, if you want to parse identifiers, you have to disambiguate them from keywords. In this case, Int could parse both as Int and as an upper case identifier, so we have to specify that Int is not an identifier.
Continuing:
pClosed :: Parser Type
pClosed =
(TInt <$ pSymbol "Int")
<|> (TBool <$ pSymbol "Bool")
<|> (TFloat <$ pSymbol "Float")
<|> (TVar <$> pIdent)
<|> (do pChar '('
ts <- sepBy1 pFun (pChar ',') <* pChar ')'
case ts of
[] -> pure TUnit
[t] -> pure t
_ -> pure (TTup ts))
pApp :: Parser Type
pApp = (TAdt <$> pUIdent <*> many pClosed)
<|> pClosed
pFun :: Parser Type
pFun = foldr1 TFun <$> sepBy1 pApp (pSymbol "->")
pExpr :: Parser Type
pExpr = pSpace *> pFun <* eof
We have to group operators according to binding strength. For each strength, we need to have a separate parsing function which parses all operators of that strength. In this case we have pFun, pApp and pClosed in increasing order of binding strength. pExpr is just a wrapper which handles top-level expressions, and takes care of leading whitespace and matches the end of the input.
When writing an operator parser, the first thing we should pin down is the group of closed expressions. Closed expressions are delimited by a keyword or symbol both on the left and the right. This is conceptually "infinite" binding strength, since text before and after such expressions don't change their parsing at all.
Keywords and variables are clearly closed, since they consist of a single token. We also have three more closed cases: the unit type, tuples and parenthesized expressions. Since all of these start with a (, I factor this out. After that, we have one or more types separated by , and we have to branch on the number of parsed types.
The rule in precedence parsing is that when parsing an operator expression of given strength, we always call the next stronger expression parser when reading the expressions between operator symbols.
, is the weakest operator, so we call the function for the second weakest operator, pFun.
pFun in turn calls pApp, which reads ADT applications, or falls back to pClosed. In pFun you can also see the handling of right associativity, as we use foldr1 TFun to combine expressions. In a left-associative infix operator, we would instead use foldl1.
Note that parser functions always parse all stronger expressions as well. So pFun falls back on pApp when there is no -> (because sepBy1 accepts the case with no separators), and pApp falls back on pClosed when there's no ADT application.

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Designing parsing code using Parsec

In the course for following the tutorial Write yourself a Scheme in 48 hours, I was attempting to enhance my parsing code to create support for hexadecimal, octal, binary and decimal literals.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Monad
import Numeric
hexChar :: Parser Char
hexChar = ...
octChar :: Parser Char
octChar = ...
hexNumber :: Parser Integer
hexNumber = do
char '#'
char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
char '#'
char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
If we forget about decimal and binary numbers in this discussion:
parseNumber :: Parse Integer
parseNumber = hexNumber <|> octNumber
Then this parser will fail to recognize octal numbers. This seems to be related to the number of lookahead characters required to tell apart and octal from hexadecimal numbers (if we drop the leading '#' in the syntax, then the
parser will work). Hence it seems we are forced to revisit the code and 'factorize' the leading '#' so to speak, by dropping the char '#' in the individual parsers and defining:
parseNumber = char '#' >> (hexNumber <|> octNumber)
This is fine but I find the code less pleasant. Somehow, if I have a function called hexNumber I would expect it to recognize #xffff (which is proper Scheme syntax) and not xffff. Is this something I have to live with, or are there ways to go around this 'forced factorization' of the leading character #?
If the first argument of (<|>) fails after having consumed some input, then it fails immediately without trying the second alternative. If a failure in the first argument should lead to a retry with the second argument, you can use try to avoid consuming input. In hexNumber you must consume '#' only if the following character matches 'x'.
hexNumber :: Parser Integer
hexNumber = do
try $ char '#' >> char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
try $ char '#' >> char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
Note that this is somewhat inefficient since you parse '#' twice, and it gets worse as the common prefix gets longer and more complex.

Resources