Converting normal attoparsec parser code to conduit/pipe based - parsing

I have written a following parsing code using attoparsec:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tParser :: Parser [Test]
tParser = many' $ testParser <* endOfLine
This works fine for small sized files, I execute it like this:
main :: IO ()
main = do
text <- TL.readFile "./testFile"
let (Right a) = parseOnly (manyTill anyChar endOfLine *> tParser) text
print a
But when the size of the file is greater than 70MB, it consumes tons of memory. As a solution, I thought I would use attoparsec-conduit. After going through their API, I'm not sure how to make them work together. My parser has the type Parser Test but it's sinkParser actually accepts parser of type Parser a b. I'm interested in how to execute this parser in constant memory ? (A pipes based solution is also acceptable, but I'm not used to the Pipes API.)

The first type parameter to Parser is just the data type of the input (either Text or ByteString). You can provide your testParser function as the argument to sinkParser and it will work fine. Here's a short example:
{-# LANGUAGE OverloadedStrings #-}
import Conduit (liftIO, mapM_C, runResourceT,
sourceFile, ($$), (=$))
import Data.Attoparsec.Text (Parser, decimal, endOfLine, space)
import Data.Conduit.Attoparsec (conduitParser)
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
space
b <- decimal
endOfLine
return $ Test a b
main :: IO ()
main = runResourceT
$ sourceFile "foo.txt"
$$ conduitParser testParser
=$ mapM_C (liftIO . print)

Here is the pipes solution (assuming that you are using a Text-based parser):
import Pipes
import Pipes.Text.IO (fromHandle)
import Pipes.Attoparsec (parsed)
import qualified System.IO as IO
main = IO.withFile "./testfile" IO.ReadMode $ \handle -> runEffect $
for (parsed (testParser <* endOfLine) (fromHandle handle)) (lift . print)

Related

Correctly parsing nested data using megaparsec

I am trying to get more familiar with megaparsec, and I am running into some issues with presedences. By 'nested data' in the title I refer to the fact that I am trying to parse types, which in turn could contain other types. If someone could explain why this does not behave as I would expect, please don't hesitate to tell me.
I am trying to parse types similar to those found in Haskell. Types are either base types Int, Bool, Float or type variables a (any lowercase word).
We can also construct algebraic data types from type constructors (Uppercase words) such as Maybe and type parameters (any other type). Examples are Maybe a and Either (Maybe Int) Bool. Functions associate to the right and are constructed with ->, such as Maybe a -> Either Int (b -> c). N-ary tuples are a sequence of types separated by , and enclosed in ( and ), such as (Int, Bool, a). A type can be wrapped in parenthesis to raise its precedence level (Maybe a). A unit type () is also defined.
I am using this ADT to describe this.
newtype Ident = Ident String
newtype UIdent = UIdent String
data Type a
= TLam a (Type a) (Type a)
| TVar a Ident
| TNil a
| TAdt a UIdent [Type a]
| TTup a [Type a]
| TBool a
| TInt a
| TFloat a
I have tried to write a megaparsec parser to parse such types, but I get unexpected results. I attach the relevant code below after which I will try to describe what I experience.
{-# LANGUAGE OverloadedStrings #-}
module Parser where
import AbsTinyCamiot
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
import Text.Megaparsec.Debug
import Control.Applicative hiding (many, some, Const)
import Control.Monad.Combinators.Expr
import Control.Monad.Identity
import Data.Void
import Data.Text (Text, unpack)
type Parser a = ParsecT Void Text Identity a
-- parse types
pBaseType :: Parser (Type ())
pBaseType = choice [
TInt () <$ label "parse int" (pSymbol "Int"),
TBool () <$ label "parse bool" (pSymbol "Bool"),
TFloat () <$ label "parse float" (pSymbol "Float"),
TNil () <$ label "parse void" (pSymbol "()"),
TVar () <$> label "parse type variable" pIdent]
pAdt :: Parser (Type ())
pAdt = label "parse ADT" $ do
con <- pUIdent
variables <- many $ try $ many spaceChar >> pType
return $ TAdt () con variables
pType :: Parser (Type ())
pType = label "parse a type" $
makeExprParser
(choice [ try pFunctionType
, try $ parens pType
, try pTupleType
, try pBaseType
, try pAdt
])
[]--[[InfixR (TLam () <$ pSymbol "->")]]
pTupleType :: Parser (Type ())
pTupleType = label "parse a tuple type" $ do
pSymbol "("
fst <- pType
rest <- some (pSymbol "," >> pType)
pSymbol ")"
return $ TTup () (fst : rest)
pFunctionType :: Parser (Type ())
pFunctionType = label "parse a function type" $ do
domain <- pType
some spaceChar
pSymbol "->"
some spaceChar
codomain <- pType
return $ TLam () domain codomain
parens :: Parser a -> Parser a
parens p = label "parse a type wrapped in parentheses" $ do
pSymbol "("
a <- p
pSymbol ")"
return a
pUIdent :: Parser UIdent
pUIdent = label "parse a UIdent" $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ UIdent (a:rest)
pIdent :: Parser Ident
pIdent = label "parse an Ident" $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ Ident (a:rest)
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
This might be overwhelming, so let me explain some key points. I understand that I have a lot of different constructions that could match on an opening parenthesis, so I've wrapped those parsers in try, such that if they fail I can try the next parser that might consume an opening parenthesis. Perhaps I am using try too much? Does it affect performance to potentially backtrack so much?
I've also tried to make an expression parser by defining some terms and an operator table. You can see now that I've commented out the operator (function arrow), however. As the code looks right now I loop infinitely when I try to parse a function type. I think this might be due to the fact that when I try to parse a function type (invoked from pType) I immediately try to parse a type representing the domain of the function, which again call pType. How would I do this correctly?
If I decide to use the operator table instead, and not use my custom parser for function types, I parse things using wrong precedences. E.g Maybe a -> b gets parsed as Maybe (a -> b), while I would want it to be parsed as (Maybe a) -> b. Is there a way where I could use the operator table and still have type constructors bind more tightly than the function arrow?
Lastly, as I am learning megaparsec as I go, if anyone sees any misunderstandings or things that are wierd/unexpected, please tell me. I've read most of this tutorial to get my this far.
Please let me know of any edits I can make to increase the quality of my question!
Your code does not handle precedences at all, and also as a result of this it uses looping left-recursion.
To give an example of left-recursion in your code, pFunctionType calls pType as the first action, which calls pFunctionType as the first action. This is clearly a loop.
For precedences, I recommend to look at tutorials on "recursive descent operator parsing", a quick Google search reveals that there are several of them. Nevertheless I can summarize here the key points. I write some code.
{-# language OverloadedStrings #-}
import Control.Monad.Identity
import Data.Text (Text)
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
type Parser a = ParsecT Void Text Identity a
newtype Ident = Ident String deriving Show
newtype UIdent = UIdent String deriving Show
data Type
= TVar Ident
| TFun Type Type -- instead of "TLam"
| TAdt UIdent [Type]
| TTup [Type]
| TUnit -- instead of "TNil"
| TBool
| TInt
| TFloat
deriving Show
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pChar :: Char -> Parser ()
pChar c = void (char c <* pSpace)
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
keywords :: [String]
keywords = ["Bool", "Int", "Float"]
pUIdent :: Parser UIdent
pUIdent = try $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
let x = a:rest
if elem x keywords
then fail "expected an ADT name"
else pure $ UIdent x
pIdent :: Parser Ident
pIdent = try $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
return $ Ident (a:rest)
Let's stop here.
I changed the names of constructors in Type to conform to how they are called in Haskell. I also removed the parameter on Type, to have less noise in my example, but you can add it back of course.
Note the changed pUIdent and the addition of keywords. In general, if you want to parse identifiers, you have to disambiguate them from keywords. In this case, Int could parse both as Int and as an upper case identifier, so we have to specify that Int is not an identifier.
Continuing:
pClosed :: Parser Type
pClosed =
(TInt <$ pSymbol "Int")
<|> (TBool <$ pSymbol "Bool")
<|> (TFloat <$ pSymbol "Float")
<|> (TVar <$> pIdent)
<|> (do pChar '('
ts <- sepBy1 pFun (pChar ',') <* pChar ')'
case ts of
[] -> pure TUnit
[t] -> pure t
_ -> pure (TTup ts))
pApp :: Parser Type
pApp = (TAdt <$> pUIdent <*> many pClosed)
<|> pClosed
pFun :: Parser Type
pFun = foldr1 TFun <$> sepBy1 pApp (pSymbol "->")
pExpr :: Parser Type
pExpr = pSpace *> pFun <* eof
We have to group operators according to binding strength. For each strength, we need to have a separate parsing function which parses all operators of that strength. In this case we have pFun, pApp and pClosed in increasing order of binding strength. pExpr is just a wrapper which handles top-level expressions, and takes care of leading whitespace and matches the end of the input.
When writing an operator parser, the first thing we should pin down is the group of closed expressions. Closed expressions are delimited by a keyword or symbol both on the left and the right. This is conceptually "infinite" binding strength, since text before and after such expressions don't change their parsing at all.
Keywords and variables are clearly closed, since they consist of a single token. We also have three more closed cases: the unit type, tuples and parenthesized expressions. Since all of these start with a (, I factor this out. After that, we have one or more types separated by , and we have to branch on the number of parsed types.
The rule in precedence parsing is that when parsing an operator expression of given strength, we always call the next stronger expression parser when reading the expressions between operator symbols.
, is the weakest operator, so we call the function for the second weakest operator, pFun.
pFun in turn calls pApp, which reads ADT applications, or falls back to pClosed. In pFun you can also see the handling of right associativity, as we use foldr1 TFun to combine expressions. In a left-associative infix operator, we would instead use foldl1.
Note that parser functions always parse all stronger expressions as well. So pFun falls back on pApp when there is no -> (because sepBy1 accepts the case with no separators), and pApp falls back on pClosed when there's no ADT application.

Recursively return all words from .txt file using attoparsec

I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say
"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
I only get back:
World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word =
"Hello"})
This is my current code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
{-
This is how I would usually get the length of the list of words in a .txt file normally.
countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)
-}
data Prose = Prose {
word :: String
} deriving Show
prose :: Parser Prose
prose = do
word <- many' $ letter
return $ Prose word
main :: IO()
main = do
input <- Txt.readFile "small.txt"
print $ parse prose input
Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?
You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.
From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].
EDIT:
Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.
{-# LANGUAGE OverloadedStrings #-}
module Atto where
--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)
import qualified Data.Text as T
data Prose = Prose {
word :: String
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter
separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()
wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional separator *> prose `sepBy1` separator
main :: IO ()
main = do
let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
let words = wordParser input
print words
print $ length words
The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.
Hope it helps! :)
You're on the right track! You've written a parser (prose) which reads a single word: many' letter recognises a sequence of letters.
So now that you've figured out how to parse a single word, your job is to scale this up to parse a sequence of words separated by spaces. That's what sepBy does: p `sepBy` q runs the p parser repeatedly with the q parser interspersed.
So a parser for a sequence of words looks something like this (I've taken the liberty of renaming your prose to word):
word = many letter
phrase = word `sepBy` some space -- "some" runs a parser one-or-more times
ghci> parseOnly phrase "wibble wobble wubble" -- with -XOverloadedStrings
Right ["wibble","wobble","wubble"]
Now, phrase, being composed out of letter and space, will die on non-letter non-space characters such as ' and .. I'll leave it to you to figure out how to fix that. (As a hint, you'll probably need to change many letter to many (letter <|> ...), depending on how exactly you want it to behave on the various punctuation marks.)

Auto-completion suggestions from parse error

I am writing a parser for a custom jupter kernel using megaparsec. I was able to re-use the parser to provide completions too: the custom error message generated from the megaparsec library are transformed to the list of expected symbols. It that way, whenever I change the parser, completion automatically adjust itself. Which is great.
The only thing I am struggling is how to get info from the optional parsers. The minimal example illustrating what I want to achieve is following:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Data.Monoid
import Data.Text (Text)
import Data.Set (singleton)
type Parser = Parsec MyError Text
data MyError = ExpectKeyword Text deriving (Eq, Ord, Show)
lexeme = L.lexeme sc
sc = L.space (skipSome (oneOf [' ', '\t'])) empty empty
-- | Reserved words
rword :: Text -> Parser Text
rword w = region (fancyExpect (ExpectKeyword w)) $
lexeme (string w *> return w)
fancyExpect f e = FancyError (errorPos e) (singleton . ErrorCustom $ f)
p1 = rword "foo" <|> rword "bar"
p2 = (<>) <$> option "def" (rword "opt") <*> p1
main = do
putStrLn . show $ parse p1 "" ("xyz" :: Text) -- shows "foo" and "bar" in errors
putStrLn . show $ parse p2 "" ("xyz" :: Text) -- like above, no optional "opt"
In the first case, parser fails and I get the list of all errors from all alternatives. Ideally, in the second case I would like to see the error of the failed optional parser too.
This example can be simply solved by removing option and making two branches with <|>: one with option and the other without. However in real case the optional part is a permutation parser consisting of several optional parts, so such trick is not feasible.

How can I parse fixed-length, non-delimited integers with attoparsec?

I'm trying to parse two integers from 3 characters using attoparsec. A sample input might look something like this:
341
... which I would like to parse into:
Constructor 34 1
I have two solutions that work but which are somewhat clunky:
stdK :: P.Parser Packet
stdK = do
P.char '1'
qstr <- P.take 2
let q = rExt $ P.parseOnly P.decimal qstr
n <- P.decimal
return $ Std q n
stdK2 :: P.Parser Packet
stdK2 = do
P.char '1'
qn <- P.decimal
let q = div qn 10
let n = rem qn 10
return $ Std q n
There must be a better way to achieve something as simple as this. Am I missing something?
Your code snippet is far from being self-contained (in particular, imports and the definition of your Packet data type are missing), but you seem to be overcomplicating things.
First, define a parser for one-digit integers. Then, use the latter as a building block for a parser for two-digit integers. After that, use applicative operators to combine those two parsers and define a parser for your custom Packet data type. See below.
Note that you don't need the full power of monads; applicative parsing is sufficient, here.
-- test_attoparsec.hs
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative ((<$>))
import Data.Attoparsec.Text
import Data.Char
data Packet = Std {-# UNPACK #-} !Int
{-# UNPACK #-} !Int
deriving (Show)
stdK :: Parser Packet
stdK = char '1' *> (Std <$> twoDigitInt <*> oneDigitInt)
twoDigitInt :: Parser Int
twoDigitInt = timesTenPlus <$> oneDigitInt <*> oneDigitInt
where
timesTenPlus x y = 10 * x + y
oneDigitInt :: Parser Int
oneDigitInt = digitToInt <$> digit
Tests in GHCi:
λ> :l test_attoparsec.hs
[1 of 1] Compiling Main ( test_attoparsec.hs, interpreted )
Ok, modules loaded: Main.
λ> :set -XOverloadedStrings
λ> parseOnly stdK "1341"
Right (Std 34 1)
λ> parseOnly stdK "212"
Left "1: Failed reading: satisfyWith"

Using `opt` combinator in uu-parsinglib

I am writing a parser for a simple text template language for my project, and I am completely stuck on opt combinator in uu-parsinglib (version 2.7.3.2 in case that matters). Any ideas on how to use it properly?
Here is a very simplified example that shows my predicament.
{-# LANGUAGE FlexibleContexts #-}
import Text.ParserCombinators.UU hiding (pEnd)
import Text.ParserCombinators.UU.Utils
import Text.ParserCombinators.UU.BasicInstances
pIdentifier :: Parser String
pIdentifier = pMany pLetter
pIfClause :: Parser ((String, String), String, Maybe (String, String), String)
pIfClause = (,,,) <$> pIf <*> pIdentifier <*> pOptionalElse <*> pEnd
pIf :: Parser (String, String)
pIf = pBraces ((,) <$> pToken "if " <*> pIdentifier)
pOptionalElse :: Parser (Maybe (String, String))
pOptionalElse = (((\x y -> Just (x, y)) <$> pElse <*> pIdentifier) `opt` Nothing)
pElse :: Parser String
pElse = pBraces (pToken "else")
pEnd :: Parser String
pEnd = pBraces (pToken "end")
main :: IO ()
main = do
putStrLn $ show $ runParser "works" pIfClause "{if abc}def{else}ghi{end}"
putStrLn $ show $ runParser "doesn't work" pIfClause "{if abc}def{end}"
The first string parses properly but the second fails with error:
main: Failed parsing 'doesn't work' :
Expected at position LineColPos 0 12 12 expecting one of [Whitespace, "else"] at LineColPos 0 12 12 :
v
{if abc}def{end}
^
The documentation for opt says:
If p can be recognized, the return value of p is used. Otherwise, the value v is used. Note that opt by default is greedy.
What greedy means is explained in the documentation for <<|>:
<<|> is the greedy version of <|>. If its left hand side parser can make any progress then it commits to that alternative.
In your case, the first argument to opt does recognize part of the input, because else and end both start with e. Thus, it commits to pElse, which fails and makes the whole parse fail.
An easy way to solve this is to use ... <|> pure Nothing, as the documentation suggests.

Resources