Operating on parsed data with attoparsec - parsing

Background
I've written a logfile parser using attoparsec. All my smaller parsers succeed, as does the composed final parser. I've confirmed this with tests. But I'm stumbling over performing operations with the parsed stream.
What I've tried
I started by trying to pass the successfully parsed input to a function. But all the seems to get is Done (), which I'm presuming means the logfile has been consumed by this point.
prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
Fail _ _ _ -> putStrLn $ "Parsing failed"
Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.
main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"
What I'm trying to do
I want to accumulate some stats from the logfile as I consume the input. For example, I'm parsing response codes and I'd like to count how many 2** responses there were and how many 4/5** ones. I'm parsing the number of bytes each response returned as Ints, and I'd like to efficiently sum these (sounds like a foldl'?). I've defined a data type like this:
data Stats = Stats {
successfulRequestsPerMinute :: Int
, failingRequestsPerMinute :: Int
, meanResponseTime :: Int
, megabytesPerMinute :: Int
} deriving Show
And I'd like to constantly update that as I parse the input. But the part of performing operations as I consume is where I got stuck. So far print is the only function I've successfully passed output to and it showed the parsing is succeeding by returning Done before printing the output.
My main parser(s) look like this:
parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse
type Log = [LogEntry]
parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine
Desired outcome
I want to pass each parsed line to a function that will update the above data type. Ideally I want this to be very memory efficient because it'll be operating on large files.

You have to make your unit of parsing a single log entry rather than a list of log entries.
It's not pretty, but here is an example of how to interleave parsing and processing:
(Depends on bytestring, attoparsec and mtl)
{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}
import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict
aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum
getNext :: MonadState [a] m => m (Maybe a)
getNext = do
xs <- get
case xs of
[] -> return Nothing
(y:ys) -> put ys >> return (Just y)
loop iresult =
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; loop (parse aWord x')
Partial _ -> do
mx <- getNext
case mx of
Just y -> loop (feed iresult y)
Nothing -> case feed iresult BS.empty of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; return ()
Partial _ -> error $ "partial returned" -- probably can't happen
process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w
theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]
main = runStateT (loop (Partial (parse aWord))) theWords
Notes:
We parse a aWord at a time and call process after each word is recognized.
Use feed to feed the parser more input when it returns a Partial.
Feed the parser an empty string when there is no more input left.
When Done is return, process the recognized word and continue with parse aWord.
getNext is just an example of a monadic function which gets the next unit of input. Replace it with your own version - i.e. something that reads the next line from a file.
Update
Here is a solution using parseWith as #dfeuer suggested:
noMoreInput = fmap null get
loop2 x = do
iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword;
if BS.null x'
then do b <- noMoreInput
if b then return ()
else loop2 x'
else loop2 x'
Partial _ -> error $ "huh???" -- this really can't happen
main2 = runStateT (loop2 BS.empty) theWords

If each log entry is exactly one line, here's a simpler solution:
do loglines <- fmap BS.lines $ BS.readfile "input-file.log"
foldl' go initialStats loglines
where
go stats logline =
case parseOnly yourParser logline of
Left e -> error $ "oops: " ++ e
Right r -> let stats' = ... combine r with stats ...
in stats'
Basically you are just reading the file line-by-line and calling parseOnly on each line and accumulating the results.

This is properly done with a streaming library
main = do
f:_ <- getArgs
withFile f ReadMode $ \h -> do
result <- foldStream $ streamProcess $ streamHandle h
print result
where
streamHandle = undefined
streamProcess = undefined
foldStream = undefined
where the blanks can be filled by any streaming library, e.g.
import qualified Pipes.Prelude as P
import Pipes
import qualified Pipes.ByteString as PB
import Pipes.Group (folds)
import qualified Control.Foldl as L
import Control.Lens (view) -- or import Lens.Simple (view), or whatever
streamHandle = Pipes.ByteStream.fromHandle :: Handle -> Producer ByteString IO ()
in that case we might then divide the labor further thus:
streamProcess :: Producer ByteString m r -> Producer LogEntry m r
streamProcess p = streamLines p >-> lineParser
streamLines :: Producer ByteString m r -> Producer ByteString m r
streamLines p = L.purely fold L.list (view (Pipes.ByteString.lines p)) >-> P.map B.toStrict
lineParser :: Pipe ByteString LogEntry m r
lineParser = P.map (parseOnly line_parser) >-> P.concat -- concat removes lefts
(This is slightly laborious because pipes is sensible persnickety about accumulating lines, and memory generally: we are just trying to get a producer of individual strict bytestring lines, and then to convert that into a producer of parsed lines, and then to throw out bad parses, if there are any. With io-streams or conduit, things will be basically the same, and that particular step will be easier.)
We are now in a position to fold over our Producer LogEntry IO (). This can be done explicitly using Pipes.Prelude.fold, which makes a strict left fold. Here we will just cop the structure from user5402
foldStream str = P.fold go initial_stats id
where
go stats_till_now new_entry = undefined
If you get used to the use of the foldl library and the application of a fold to a Producer with L.purely fold some_fold, then you can build Control.Foldl.Folds for your LogEntries out of components and slot in different requests as you please.
If you use pipes-attoparsec and include the newline bit in your parser, then you can just write
handleToLogEntries :: Handle -> Producer LogEntry IO ()
handleToLogEntries h = void $ parsed my_line_parser (fromHandle h) >-> P.concat
and get the Producer LogEntry IO () more directly. (This ultra-simple way of writing it will, however, stop at a bad parse; dividing on lines first will be faster than using attoparsec to recognize newlines.) This is very simple with io-streams too, you would write something like
import qualified System.IO.Streams as Streams
io :: Handle -> IO ()
io h = do
bytes <- Streams.handleToInputStream h
log_entries <- Streams.parserToInputStream my_line_parser bytes
fold_result <- Stream.fold go initial_stats log_entries
print fold_result
or to keep with the structure above:
where
streamHandle = Streams.handleToInputStream
streamProcess io_bytes =
io_bytes >>= Streams.parserToInputStream my_line_parser
foldStream io_logentries =
log_entries >>= Stream.fold go initial_stats
Either way, my_line_parser should return a Maybe LogEntry and should recognize the newline.

Related

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

How do I parse S-expressions into a data structure in Haskell?

I'm new to Haskell and could use some guidance.
The challenge: take an S-expression and parse it into a record.
Where I have succeeded: I can take a file and read it into a parsed String.
Yet, using parsing Text to DFA s.t
let
toDFA :: [L.Text] -> EntryDFA
toDFA t =
let [q,a,d,s,f] = t
in EntryDFA {
state = read q
,alpha = read a
,delta = read d
,start = read s
,final = read f }
returns this error:
• Couldn't match type ‘L.Text’ with ‘[Char]’
Expected type: String
Actual type: L.Text
There must be a more idiomatic approach.
read is a partial function with type Read a => String -> a, which throws an exception on parsing failure. Normally you want to avoid it (use readMaybe instead if you have a string). String and L.Text are different types, which is why you're getting an error.
Your sample code is missing an extra ) after the trans-func.
I'm using the Megaparsec package which provides an easy way to work with parser combinators. The author of the library has written a longer tutorial here.
The basic idea is that Parser a is the type of a value that can parse something of type a. In Text.Megaparsec there are several functions which you can use (parse, parseMaybe etc.), to "run" the parser on a "stringy" data type (e.g. String or strict/lazy Text).
When you use do notation for IO, it means "do one action after another". Similarly, you can use do notation with Parser, it means "parse this one thing, then parse the next thing".
p1 *> p2 means run the parser p1, run p2 and return the result of running p2. p1 <* p2 means run the parser p1, run p2 and return the result of running p1. You can also look up documentation on Hoogle in case you're having trouble understanding something.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NamedFieldPuns #-}
-- In practice, many of these imports would be unqualified, but I've
-- opted for explicitness for clarity.
import Control.Applicative (empty, many, some, (<*), (*>))
import Control.Exception (try, IOException)
import Data.Maybe (fromMaybe)
import Data.Set (Set)
import Data.Text (Text)
import qualified Data.Set as Set
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MPC
import qualified Text.Megaparsec.Char.Lexer as MPCL
type Q = Text
type E = Char
data EntryDFA = EntryDFA
{ state :: Set Q
, alpha :: Set E
, delta :: Set (Q,E,Q)
, start :: Q
, final :: Set Q
} deriving Show
inputFile = "foo.sexp"
main :: IO ()
main = do
-- read file and check for exception instead of checking if
-- it exists and then trying to read it
result <- try (TIO.readFile inputFile)
case result of
Left e -> print (e :: IOException)
Right txt -> do
case MP.parse dfaParser inputFile txt of
Left e -> print e
Right dfa -> print dfa
type Parser = MP.Parsec () Text
-- There are no comments in the S-exprs, so leave those empty
spaceConsumer :: Parser ()
spaceConsumer = MPCL.space MPC.space1 empty empty
symbol :: Text -> Parser Text
symbol txt = MPCL.symbol spaceConsumer txt
parens :: Parser a -> Parser a
parens p = MP.between (symbol "(") (symbol ")") p
setP :: Ord a => Parser a -> Parser (Set a)
setP p = do
items <- parens (p `MP.sepBy1` (symbol ","))
return (Set.fromList items)
pair :: Parser a -> Parser b -> Parser (a, b)
pair p1 p2 = parens $ do
x1 <- p1
x2 <- symbol "," *> p2
return (x1, x2)
stateP :: Parser Text
stateP = do
c <- MPC.letterChar
cs <- many MPC.alphaNumChar
return (T.pack (c:cs))
dfaParser :: Parser EntryDFA
dfaParser = do
() <- spaceConsumer
(_, state) <- pair (symbol "states") (setP stateP)
(_, alpha) <- pair (symbol "alpha") (setP alphaP)
(_, delta) <- pair (symbol "trans-func") (setP transFuncP)
(_, start) <- pair (symbol "start") valP
(_, final) <- pair (symbol "final") (setP valP)
return (EntryDFA {state, alpha, delta, start, final})
where
alphaP :: Parser Char
alphaP = MPC.letterChar <* spaceConsumer
transFuncP :: Parser (Text, Char, Text)
transFuncP = parens $ do
s1 <- stateP
a <- symbol "," *> alphaP
s2 <- symbol "," *> stateP
return (s1, a, s2)
valP :: Parser Text
valP = fmap T.pack (some MPC.digitChar)

Lazy parsing of large files with millions of data points

I have a parser I'm trying to write and I've gone over multiple versions of it and I can't seem to keep the memory usage down. I'm trying to parse the wikipedia sql dumps and in this example take the page entries file and throw them all in one giant vector (13 million pages). I got this working before, but I sharded the file into smaller pieces and aggregated them from the filesystem. Anyway, below is my attempt to parse these files in one pass. I'm pulling out the lines "INSERT INTO ... (data),(data)..(data);". I'm letting the other lines just fail in the parser and not yield values. The insert lines average 1MB in size and 8k entries.
I've tried what feels like a thousand versions of this with deepseqs and outputting lists of page entries per line instead of the vector I have now and using lineC with a partial parser instead of the linesUnboundedC. I assume there must be some high level concept I'm missing that is keeping the memory usage exponential. The file is 5GB and I'm easily blowing through 16GB of memory, but I can't seem to nail down a lazy thunk memory leak anywhere. I would assume that this code would be able to store each line independently and then move on without the extra memory to the next line, but I can't figure out what is wrong. I had no issue counting the amount of entries or using a more complex monoid to get data from the file using the same parser.
Any help is much appreciated!
{-# LANGUAGE OverloadedStrings #-}
module Parser.Helper where
import Conduit
import Control.DeepSeq
import Control.Monad (void)
import Control.Monad.Primitive
import Data.Attoparsec.Text
import Data.Maybe (catMaybes)
import Data.Text (Text)
import qualified Data.Vector as V
import Data.Vector.Generic (Vector)
import System.IO (openFile,IOMode(..))
data Page = Page !Int !Text !Bool
deriving Show
instance NFData Page where
rnf (Page pid title isRedirect) = pid `seq` title `seq` isRedirect `seq` ()
parseFile
:: (Foldable t, Vector v e)
=> FilePath -> Parser (t e) -> IO [v e]
parseFile fp parser =
do
handle <- openFile fp ReadMode
runConduit $ sourceHandle handle
.| decodeUtf8LenientC
.| peekForeverE linesUnboundedC
.| vectors parser
.| sinkList
vectors
:: (Foldable t, Vector v e, MonadBase base m, Control.Monad.Primitive.PrimMonad base)
=> Parser (t e) -> ConduitM Text (v e) m ()
vectors parser =
do
vectorBuilderC (1024*1024)
(\f ->
peekForeverE $ do
ml <- await
case ml of
Nothing -> return ()
Just l ->
case parseOnly parser l of
Left _ -> return ()
Right v -> mapM_ f v
)
parsePageLine :: Parser (V.Vector Page)
parsePageLine =
do
string "INSERT INTO" *> skipWhile (/= '(')
V.fromList . catMaybes <$> sepBy' parsePageField (char ',') <* char ';'
parsePageField :: Parser (Maybe Page)
parsePageField =
do
void $ char '('
pid <- parseInt <* char ','
namespace <- parseInt <* char ','
title <- parseTextField <* char ','
_ <- skipField <* char ','
_ <- skipField <* char ','
redirect <- parseInt <* char ','
void $ sepBy' skipField (char ',')
void $ char ')'
let ret = case namespace == 0 of
True -> Just $ Page pid title (redirect == 1)
False -> Nothing
return $ force ret
parseTextField :: Parser Text
parseTextField = char '\'' *> scan False f <* char '\''
where
f :: Bool -> Char -> Maybe Bool
f False '\'' = Nothing
f False '\\' = Just True
f _ _ = Just False
skipField :: Parser ()
skipField = void $ scan False f
where
f :: Bool-> Char -> Maybe Bool
f False '\\' = Just True
f False ',' = Nothing
f False ')' = Nothing
f _ _ = Just False
parseInt :: Parser Int
parseInt = signed $ decimal

parsec produce strange errors when trying to handle Maybe as ParseError

If I have this code :
import Text.Parsec
ispositive a = if (a<0) then Nothing else (Just a)
f a b = a+b
parserfrommaybe :: String -> (Maybe c) -> Parsec a b c
parserfrommaybe msg Nothing = fail msg
parserfrommaybe _ (Just res) = return res
(<!>) :: Parsec a b (Maybe c) -> String -> Parsec a b c
(<!>) p1 msg = p1 >>= (parserfrommaybe msg)
integermaybeparser = (ispositive <$> integer) <!> "negative numbers are not allowed"
testparser = f <$> (integermaybeparser <* whiteSpace) <*> integermaybeparser
when I test testparser with input like this "-1 3" it gives :
Left (line 1, column 4):
unexpected "3"
negative numbers are not allowed
I expected it to give error on Column 1 and give the error message without the sentence "unexpected 3" but it seems parsec continued parsing.
Why did this happen ? and how to make parsec give the error message I expect ?
I have found the solution, the cause of is that the first parser gets run and consumes input even when failing.
The solution was to use lookAhead like this:
(<!>) :: (Monad m,Stream a m t) => ParsecT a b m (Maybe c) -> String -> ParsecT a b m c
(<!>) p1 msg = ((lookAhead p1) >>= (parserfrommaybe msg)) *> (p1 >>= (parserfrommaybe msg))
if lookAhead p1 returns Nothing then the first argument of *> would fail without consuming input because of lookAhead, now if lookAhead p1 returns Just res then it would succeed again without consuming input and the result would be obtained from the second argument of *>.
ofcourse I had to change parserfrommaybe type annotation to (Monad m) => String -> (Maybe c) -> ParsecT a b m c to satisfy ghc.

Checking that a price value in a string is in the correct format

I use n <- getLine to get from user price. How can I check is value correct ? (Price can have '.' and digits and must be greater than 0) ?
It doesn't work:
isFloat = do
n <- getLine
let val = case reads n of
((v,_):_) -> True
_ -> False
If The Input Is Always Valid Or Exceptions Are OK
If you have users entering decimal numbers in the form of "123.456" then this can simply be converted to a Float or Double using read:
n <- getLine
let val = read n
Or in one line (having imported Control.Monad):
n <- liftM read getLine
To Catch Erroneous Input
The above code fails with an exception if the users enter invalid entries. If that's a problem then use reads and listToMaybe (from Data.Maybe):
n <- liftM (fmap fst . listToMaybe . reads) getLine
If that code looks complex then don't sweat it - the below is the same operation but doing all the work with explicit case statements:
n <- getLine
let val = case reads n of
((v,_):_) -> Just v
_ -> Nothing
Notice we pattern match to get the first element of the tuple in the head of the list, The head of the list being (v,_) and the first element is v. The underscore (_) just means "ignore the value in this spot".
If Floating Point Isn't Acceptable
Floating values are well known to be approximate, and not suitable for real world financial computations (but perhaps homework, depending on your professor). In this case you'd want to read the values into a Rational (from Data.Ratio).
n <- liftM maybeRational getLine
...
where
maybeRational :: String -> Maybe Rational
maybeRational str =
let (a,b) = break (=='.') str
in liftM2 (%) (readMaybe a) (readMaybe $ drop 1 b)
readMaybe = fmap fst . listToMaybe . reads
In addition to the parsing advice provided by TomMD, consider using the appropriate monad for error reporting. It allows you to conveniently chain computations which can fail, avoiding explicit error checking on every step.
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Error
parsePrice :: MonadError String m => String -> m Double
parsePrice s = do
x <- case reads s of
[(x, "")] -> return x
_ -> throwError "Not a valid real number."
when (x <= 0) $ throwError "Price must be positive."
return x
main = do
n <- getLine
case parsePrice n of
Left err -> putStrLn err
Right x -> putStrLn $ "Price is " ++ show x

Resources