I'm trying to parse two integers from 3 characters using attoparsec. A sample input might look something like this:
341
... which I would like to parse into:
Constructor 34 1
I have two solutions that work but which are somewhat clunky:
stdK :: P.Parser Packet
stdK = do
P.char '1'
qstr <- P.take 2
let q = rExt $ P.parseOnly P.decimal qstr
n <- P.decimal
return $ Std q n
stdK2 :: P.Parser Packet
stdK2 = do
P.char '1'
qn <- P.decimal
let q = div qn 10
let n = rem qn 10
return $ Std q n
There must be a better way to achieve something as simple as this. Am I missing something?
Your code snippet is far from being self-contained (in particular, imports and the definition of your Packet data type are missing), but you seem to be overcomplicating things.
First, define a parser for one-digit integers. Then, use the latter as a building block for a parser for two-digit integers. After that, use applicative operators to combine those two parsers and define a parser for your custom Packet data type. See below.
Note that you don't need the full power of monads; applicative parsing is sufficient, here.
-- test_attoparsec.hs
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative ((<$>))
import Data.Attoparsec.Text
import Data.Char
data Packet = Std {-# UNPACK #-} !Int
{-# UNPACK #-} !Int
deriving (Show)
stdK :: Parser Packet
stdK = char '1' *> (Std <$> twoDigitInt <*> oneDigitInt)
twoDigitInt :: Parser Int
twoDigitInt = timesTenPlus <$> oneDigitInt <*> oneDigitInt
where
timesTenPlus x y = 10 * x + y
oneDigitInt :: Parser Int
oneDigitInt = digitToInt <$> digit
Tests in GHCi:
λ> :l test_attoparsec.hs
[1 of 1] Compiling Main ( test_attoparsec.hs, interpreted )
Ok, modules loaded: Main.
λ> :set -XOverloadedStrings
λ> parseOnly stdK "1341"
Right (Std 34 1)
λ> parseOnly stdK "212"
Left "1: Failed reading: satisfyWith"
Related
I am writing a parser for a custom jupter kernel using megaparsec. I was able to re-use the parser to provide completions too: the custom error message generated from the megaparsec library are transformed to the list of expected symbols. It that way, whenever I change the parser, completion automatically adjust itself. Which is great.
The only thing I am struggling is how to get info from the optional parsers. The minimal example illustrating what I want to achieve is following:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Data.Monoid
import Data.Text (Text)
import Data.Set (singleton)
type Parser = Parsec MyError Text
data MyError = ExpectKeyword Text deriving (Eq, Ord, Show)
lexeme = L.lexeme sc
sc = L.space (skipSome (oneOf [' ', '\t'])) empty empty
-- | Reserved words
rword :: Text -> Parser Text
rword w = region (fancyExpect (ExpectKeyword w)) $
lexeme (string w *> return w)
fancyExpect f e = FancyError (errorPos e) (singleton . ErrorCustom $ f)
p1 = rword "foo" <|> rword "bar"
p2 = (<>) <$> option "def" (rword "opt") <*> p1
main = do
putStrLn . show $ parse p1 "" ("xyz" :: Text) -- shows "foo" and "bar" in errors
putStrLn . show $ parse p2 "" ("xyz" :: Text) -- like above, no optional "opt"
In the first case, parser fails and I get the list of all errors from all alternatives. Ideally, in the second case I would like to see the error of the failed optional parser too.
This example can be simply solved by removing option and making two branches with <|>: one with option and the other without. However in real case the optional part is a permutation parser consisting of several optional parts, so such trick is not feasible.
Haskell's Parsec supports parsing of permutation phrases via the Perm module. In the documentation for permute you can see how the various permutations "abc", "aab", "bc", "cbaaaaa", etc. can be parsed. While the example shows support for parting many contiguous instances of the same element like "aaaa", it won't parse non-contiguous instances like "aabca", presumably because each parser is included only once in each permutation (the paper seems to imply this in the tree...)
Besides sorting the input so like instances are contiguous, what options do I have for parsing non-contiguous instance?
Depending on what you actually want, you may be able to use many $ oneOf ['a','b','c'].
If you really need to use the permutation parser, keep in mind that allowing parses of multiple adjacent characters introduces ambiguity. For example, in the string "bacacbbca", it could be parsed as the perms bac, acb bca, or, if you allow repeated characters, bac, acbb, with a leftover non-permutation of ca.
If you allow repeated letters,
{-# LANGUAGE FlexibleContexts #-}
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.Perm
import Data.Text
import Control.Monad.Identity
perm :: (Stream s Identity Char) => Parsec s u (String,String,String)
perm = permute $ triple <$$>
(many1 $ char 'a') <||>
(many1 $ char 'b') <||>
(many1 $ char 'c')
where triple a b c = (a,b,c)
multiPerm :: (Stream s Identity Char) => Parsec s u [(String,String,String)]
multiPerm = many $ try $ perm
main :: IO ()
main = parseTest multiPerm $ "bacacbbca"
main produces [("a","b","c"),("a","bb","c")].
If not:
perm :: (Stream s Identity Char) => Parsec s u (Char,Char,Char)
perm = permute $ triple <$$>
(char 'a') <||>
(char 'b') <||>
(char 'c')
where triple a b c = (a,b,c)
you get the arguably better: [('a','b','c'),('a','b','c'),('a','b','c')].
I have written a following parsing code using attoparsec:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tParser :: Parser [Test]
tParser = many' $ testParser <* endOfLine
This works fine for small sized files, I execute it like this:
main :: IO ()
main = do
text <- TL.readFile "./testFile"
let (Right a) = parseOnly (manyTill anyChar endOfLine *> tParser) text
print a
But when the size of the file is greater than 70MB, it consumes tons of memory. As a solution, I thought I would use attoparsec-conduit. After going through their API, I'm not sure how to make them work together. My parser has the type Parser Test but it's sinkParser actually accepts parser of type Parser a b. I'm interested in how to execute this parser in constant memory ? (A pipes based solution is also acceptable, but I'm not used to the Pipes API.)
The first type parameter to Parser is just the data type of the input (either Text or ByteString). You can provide your testParser function as the argument to sinkParser and it will work fine. Here's a short example:
{-# LANGUAGE OverloadedStrings #-}
import Conduit (liftIO, mapM_C, runResourceT,
sourceFile, ($$), (=$))
import Data.Attoparsec.Text (Parser, decimal, endOfLine, space)
import Data.Conduit.Attoparsec (conduitParser)
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
space
b <- decimal
endOfLine
return $ Test a b
main :: IO ()
main = runResourceT
$ sourceFile "foo.txt"
$$ conduitParser testParser
=$ mapM_C (liftIO . print)
Here is the pipes solution (assuming that you are using a Text-based parser):
import Pipes
import Pipes.Text.IO (fromHandle)
import Pipes.Attoparsec (parsed)
import qualified System.IO as IO
main = IO.withFile "./testfile" IO.ReadMode $ \handle -> runEffect $
for (parsed (testParser <* endOfLine) (fromHandle handle)) (lift . print)
I'm trying to figure out the "right" way to parse a particular text file in Haskell.
In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.
The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:
MY COMPANY'S NAME
PROGRAM LISTING
STATE: OK PRODUCT: ProductName
(DESCRIPTION OF REPORT)
DATE: 11/03/2013
This is the first line of a a two-line description of the contents of this report. The description, as noted,
spans two lines. This is more text. I'm running out of things to write. Blah.
DIVISION CODE: 3 XYZ CODE: FAA3 AGENT CODE: 0007 PAGE NO: 1
AGENT TARGET NAME ST UD TARGET# XYZ# X-DATE YEAR CO ENCODING
----- ------------------------------ -- -- ------- ---- ---------- ---- ---------- ----------
0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL
0007 SMITH, JANE 43 3 2345678 001 12/07/2013 2005 ACME YELLOW
0007 DOE, JOHN 43 3 3456789 004 12/09/2013 2008 MICROSOFT GREEN
0007 DOE, JANE 43 3 4567890 002 12/09/2013 2007 MICROSOFT BLUE
0007 BORGES, JORGE LUIS 43 3 5678901 001 12/09/2013 2008 DUFEMSCHM Y1500
0007 DEWEY, JOHN & 43 3 6789012 003 12/11/2013 2013 ERTZEVILI X1500
0007 NIETZSCHE, FRIEDRICH 43 3 7890123 004 12/11/2013 2006 NCORPORAT X7
I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.
Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.
Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.
So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.
Thanks!
P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.
Edit:
Thank you all so much for the great comments and answers!
Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.
My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!
This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like
module ReportParser where
import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)
import Control.Applicative
import Data.Attoparsec.Text
data ReportHeaderData = Company Text
| Program Text
| State Text
-- ...
| FieldNames [Text]
data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text
data Date = Date Int Int Int
and we can say, for the sake of argument, that a report is
data Report = Report [ReportHeaderData] [ReportData]
Now, I generally create a parser which is a function of the same name as the data type
-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space
-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs
-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()
-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)
-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f3 = decimal <* clearSpaces
f4 = decimal <* clearSpaces
f5 = decimal <* clearSpaces
f6 = decimal <* clearSpaces
f7 = date <* clearSpaces
f8 = decimal <* clearSpaces
f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10
By proper running of one of the parse functions and the use of one of the combinators (such as many (and possibly feed, if you end up with a Partial result), you should end up with a list of ReportDatas. You can then convert them to CSV with some function you've created.
Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report with e.g.
-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)
Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace). Data.Alternative is also useful, for reasons implied by the name.
For playing with this, I highly recommend GHCI and the parseTest function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.
There are very few languages that I would recommend using a parser for something so simple (I've parsed many a file like this using regular expressions in the past), but parsec makes it so easy-
parseLine = do
first <- count 4 anyChar
second <- count 4 anyChar
return (first, second)
parseFile = endBy parseLine (char '\n')
main = interact $ show . parse parseFile "-"
The function "parseLine" creates a parser for an individual line by chaining together two fields made up of fixed length (4 chars, any char will do).
The function "parseFile" then chains these together as a list of lines.
Of course you will have to add more fields, and cut off the header in your data still, but all of this is easy in parsec.
This is arguably much easier to read than regexps....
Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's far more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format.
I'll use Attoparsec. We'll also need to use the ByteString data type (and the OverloadedStrings PRAGMA which lets Haskell interpret string literals as both String and ByteString) and some combinators from Control.Applicative and Control.Monad.
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8
import Control.Applicative
import Control.Monad
import qualified Data.ByteString.Char8 as S
First, we'll build a data type representing each record.
data YearMonthDay =
YearMonthDay { ymdYear :: Int
, ymdMonth :: Int
, ymdDay :: Int
}
deriving ( Show )
data Line =
Line { agent :: Int
, name :: S.ByteString
, st :: Int
, ud :: Int
, targetNum :: Int
, xyz :: Int
, xDate :: YearMonthDay
, year :: Int
, co :: S.ByteString
, encoding :: S.ByteString
}
deriving ( Show )
You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a Parser Line type---read that as a parser type which returns a Line if it succeeds.
To do this, we'll build our Line type "inside of" the Parser using its Applicative interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the YearMonthDay type as a warm-up
parseYMDWrong :: Parser YearMonthDay
parseYMDWrong =
YearMonthDay <$> decimal
<*> decimal
<*> decimal
Here, decimal is a built-in Attoparsec parser which parses an integral type like Int. You can read this parser as nothing more than "parse three decimal numbers and use them to build my YearMonthDay type" and you'd be basically correct. The (<*>) operator (read as "apply") sequences the parses and collects their results into our YearMonthDay constructor function.
Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the '/' characters which delimit the numbers inside of our YearMonthDay. We fix this by using the "sequence and throw away" operator (<*). It's a visual pun on (<*>) and we use it when we want to perform a parsing action... but we don't want to keep the result.
We use (<*) to augment the first two decimal parsers with their following '/' characters using the built-in char8 parser.
parseYMD :: Parser YearMonthDay
parseYMD =
YearMonthDay <$> (decimal <* char8 '/')
<*> (decimal <* char8 '/')
<*> decimal
And we can test that this is a valid parser using Attoparsec's parseOnly function
>>> parseOnly parseYMD "2013/12/12"
Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12})
We'd like to now generalize this technique to the entire Line parser. There's one hitch, however. We'd like to parse ByteString fields like "SMITH, JOHN" which might contain spaces... while also delimiting each field of our Line by double spaces. This means that we need a special ByteString parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row.
We can build this using the scan combinator. scan allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too.
parseStringField :: Parser S.ByteString
parseStringField = scan False step where
step :: Bool -> Char -> Maybe Bool
step b ' ' | b = Nothing
| otherwise = Just True
step _ _ = Just False
We can again test this little piece using parseOnly. Let's try parsing three string fields.
>>> let p = (,,) <$> parseStringField <*> parseStringField <*> parseStringField
>>> parseOnly p "foo bar baz"
Right ("foo "," bar "," baz")
>>> parseOnly p "foo bar baz quux end"
Right ("foo bar "," baz quux "," end")
>>> parseOnly p "a sentence with no double space delimiters"
Right ("a sentence with no double space delimiters","","")
Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now.
We can now build our Line parser. Like with parseYMD, we'll follow each field's parser with a delimiting parser, someSpaces which consumes two or more spaces. We'll use the MonadPlus interface to Parser to build this atop the built-in parser space by (1) parsing some spaces and (2) checking to be sure that we got at least two of them.
someSpaces :: Parser Int
someSpaces = do
sps <- some space
let count = length sps
if count >= 2 then return count else mzero
>>> parseOnly someSpaces " "
Right 2
>>> parseOnly someSpaces " "
Right 4
>>> parseOnly someSpaces " "
Left "Failed reading: mzero"
And now we can build the line parser
lineParser :: Parser Line
lineParser =
Line <$> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseYMD <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (parseStringField <* some space)
>>> parseOnly lineParser "0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL "
Right (Line { agent = 7
, name = "SMITH, JOHN "
, st = 43
, ud = 3
, targetNum = 1234567
, xyz = 1
, xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013}
, year = 2004
, co = "ABC "
, encoding = "SIZE XL "
})
And then we can just cut off the header and parse each line.
parseFile :: S.ByteString -> [Either String Line]
parseFile = map (parseOnly parseLine) . drop 14 . lines
I am writing a parser for a simple text template language for my project, and I am completely stuck on opt combinator in uu-parsinglib (version 2.7.3.2 in case that matters). Any ideas on how to use it properly?
Here is a very simplified example that shows my predicament.
{-# LANGUAGE FlexibleContexts #-}
import Text.ParserCombinators.UU hiding (pEnd)
import Text.ParserCombinators.UU.Utils
import Text.ParserCombinators.UU.BasicInstances
pIdentifier :: Parser String
pIdentifier = pMany pLetter
pIfClause :: Parser ((String, String), String, Maybe (String, String), String)
pIfClause = (,,,) <$> pIf <*> pIdentifier <*> pOptionalElse <*> pEnd
pIf :: Parser (String, String)
pIf = pBraces ((,) <$> pToken "if " <*> pIdentifier)
pOptionalElse :: Parser (Maybe (String, String))
pOptionalElse = (((\x y -> Just (x, y)) <$> pElse <*> pIdentifier) `opt` Nothing)
pElse :: Parser String
pElse = pBraces (pToken "else")
pEnd :: Parser String
pEnd = pBraces (pToken "end")
main :: IO ()
main = do
putStrLn $ show $ runParser "works" pIfClause "{if abc}def{else}ghi{end}"
putStrLn $ show $ runParser "doesn't work" pIfClause "{if abc}def{end}"
The first string parses properly but the second fails with error:
main: Failed parsing 'doesn't work' :
Expected at position LineColPos 0 12 12 expecting one of [Whitespace, "else"] at LineColPos 0 12 12 :
v
{if abc}def{end}
^
The documentation for opt says:
If p can be recognized, the return value of p is used. Otherwise, the value v is used. Note that opt by default is greedy.
What greedy means is explained in the documentation for <<|>:
<<|> is the greedy version of <|>. If its left hand side parser can make any progress then it commits to that alternative.
In your case, the first argument to opt does recognize part of the input, because else and end both start with e. Thus, it commits to pElse, which fails and makes the whole parse fail.
An easy way to solve this is to use ... <|> pure Nothing, as the documentation suggests.