Parsing multi line log with attoparsec - parsing

I'm trying to parse a multiline log like this
[xxx] This is 1
[xxx] This is also 1
[yyy] This is 2
I have these types defined
{-# LANGUAGE OverloadedStrings #-}
module Parser where
import Prelude hiding(takeWhile)
import Data.Text
import Data.Word
import Data.Attoparsec.Text as T
import Data.Char
import Data.String
data ID = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String]
data Log = Log [Block]
And defined these parsers:
parseID :: Parser ID
parseID = do
char '['
id <- takeTill ( == ']' )
char ']'
return $ ID $ unpack id
parseEntry :: Parser Entry
parseEntry = do
id <- parseID
char ' '
content <- takeTill isEndOfLine
return $ Entry id (unpack content)
This works ok when I do stuff like parseOnly parseEntry entryString and I get back an Entry.
The problem is when I try to parse something like the log I added at the start.
I would get a [Entry] but I would like to get [Block].
Also I want that when 2 or more consecutive lines have the same ID (like xxx) the should get stored into the same block, so for parsing the aforementioned log I'd like to get back
[block1, block2]
-- block1 == Block "xxx" ["This is 1", "This is also 1"]
-- block2 == Block "yyy" ["This is 2"]
How can I make the parser create new blocks or add into the last generated one depending on if the ID changes?
One obvious solution is to simply generate a [Entry] and then use a folding function to convert it to [Block] with the proper logic, but I'd be doing 2 passes, 1 over the log and another over the [Entry] which seems not only is not too performant for large logs but also feels like the wrong way to do it (from my very limited attoparsec knowledge)
Any other ideas?
EDIT
Bob Dalgleish solution essentially works (many thanks!!!), just needed a few tweaks to make it work.
This is my final solution:
data ID = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String] deriving (Eq, Show)
data Log = Log [Block] deriving (Eq, Show)
parseID :: Parser ID
parseID = do
char '['
id <- takeTill ( == ']' )
char ']'
return $ ID $ unpack id
parseEntry :: Parser Entry
parseEntry = do
id <- parseID
char ' '
content <- takeTill isEndOfLine
return $ Entry id (unpack content)
parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
id <- parseID
if blockId == id
then do
char ' '
content <- takeTill isEndOfLine
endOfLine <|> endOfInput
return $ Entry id (unpack content)
else fail "nonmatching id"
parseBlock :: Parser Block
parseBlock = do
(Entry entryId s) <- parseEntry
let newBlock = Block entryId [s]
endOfLine <|> endOfInput
entries <- many' (parseEntryFor entryId)
return $ Block entryId (s : Prelude.map (\(Entry _ s') -> s') entries)

You need to have a parser for Blocks. It accepts an Entry, does a lookahead for an Entry with the same id; if not the same, it backtracks and returns what it has so far.
First, let's introduce a conditional Entry parser:
parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
id <- parseEntry
if blockId == id
then do
char ' '
content <- takeTill isEndOfLine
endOfLine
return $ Entry id (unpack content)
else fail "nonmatching id"
-- |A Block consists of one or more Entry's with the same ID
parseBlock :: Parser Block
parseBlock = do
(Entry entryId s) <- parseEntry
let newBlock = Block entryId [s]
endOfLine
entries <- many' (parseEntryFor entryId)
return $ Block entryId s: (map (\(Entry _ s') -> x') entries)
(This code is not tested, as I have only ever used Parsec.)

Related

Couldn't match type error when running text.parsec.indent example

I would like to learn how to Parse an Indentation-Sensitive Language, but I get Couldn't match type errors.
I installed the dependencied by cabal install parsec and cabal install indents, and tried to run this example code:
module Main where
-- First, import all the needed modules.
import Text.Parsec hiding (State)
import Text.Parsec.Indent
import Control.Monad.State
-- Next, define our new Parser type. This replaces the Identity monad
-- with the (State SourcePos) monad.
type IParser a = ParsecT String () (State SourcePos) a
-- Now we define our new parse function. This one accepts an IParser
-- (which we've just defined) instead of a Parser.
iParse :: IParser a -> SourceName -> String -> Either ParseError a
iParse aParser source_name input =
runIndent source_name $ runParserT aParser () source_name input
-- Define our sample input string. Note: the unlines function joins
-- strings together with newline characters.
input_text :: String
input_text = unlines [
"listName:",
" item1",
" item2",
" item3"
]
-- Define main. It parses the input text and prints the parsed value. If
-- there was an error, it prints the error.
main :: IO ()
main = do
case iParse aNamedList "indented_example" input_text of
Left err -> print err
Right result -> putStrLn $ "I parsed: " ++ show result
-- Define a datatype to hold our parsed value.
data NamedList = NamedList Name [Item]
deriving (Show)
-- Define what we mean by 'Name' and 'Item'. In this case, they are both
-- strings.
type Name = String
type Item = String
-- Define how we parse a NamedList. A Named list is a Name and a list of
-- Items contained in the NamedList data structure. Read more about the
-- withBlock function here:
-- http://hackage.haskell.org/packages/archive/indents/0.3.2/doc/html/Text-Parsec-Indent.html#v:withBlock
aNamedList :: IParser NamedList
aNamedList = do
b <- withBlock NamedList aName anItem
spaces
return b
-- A name is an alpha-numeric string followed by a ':' and some
-- whitespace.
aName :: IParser Name
aName = do
s <- many1 alphaNum
_ <- char ':'
spaces
return s
-- An item is an alpha-numeric string followed by some whitespace.
anItem :: IParser Item
anItem = do
i <- many1 alphaNum
spaces
return i
-- Output:
-- > runhaskell -Wall indented_parsec_example.hs
-- I parsed: NamedList "listName" ["item1","item2","item3"]
I get this error:
Main.hs:19:13: error:
* Couldn't match type `[Char]'
with `Control.Monad.Trans.Reader.ReaderT
Text.Parsec.Indent.Internal.Indentation
Data.Functor.Identity.Identity
(State SourcePos (Either ParseError a) -> Either ParseError a)'
Expected type: IndentT
Data.Functor.Identity.Identity
(State SourcePos (Either ParseError a) -> Either ParseError a)
Actual type: SourceName
* In the first argument of `runIndent', namely `source_name'
In the expression: runIndent source_name
In the expression:
runIndent source_name $ runParserT aParser () source_name input
* Relevant bindings include
aParser :: IParser a (bound at Main.hs:18:8)
iParse :: IParser a -> SourceName -> String -> Either ParseError a
(bound at Main.hs:18:1)
Main.hs:60:8: error:
* Couldn't match type `Control.Monad.Trans.Reader.ReaderT
Text.Parsec.Indent.Internal.Indentation m0'
with `StateT SourcePos Data.Functor.Identity.Identity'
Expected type: ParsecT String () (State SourcePos) NamedList
Actual type: IndentParserT String () m0 NamedList
* In a stmt of a 'do' block: b <- withBlock NamedList aName anItem
In the expression:
do { b <- withBlock NamedList aName anItem;
spaces;
return b }
In an equation for `aNamedList':
aNamedList
= do { b <- withBlock NamedList aName anItem;
spaces;
return b }
Main.hs:60:28: error:
* Couldn't match type `StateT
SourcePos Data.Functor.Identity.Identity'
with `Control.Monad.Trans.Reader.ReaderT
Text.Parsec.Indent.Internal.Indentation m0'
Expected type: IndentParserT String () m0 Name
Actual type: IParser Name
* In the second argument of `withBlock', namely `aName'
In a stmt of a 'do' block: b <- withBlock NamedList aName anItem
In the expression:
do { b <- withBlock NamedList aName anItem;
spaces;
return b }
Main.hs:60:34: error:
* Couldn't match type `StateT
SourcePos Data.Functor.Identity.Identity'
with `Control.Monad.Trans.Reader.ReaderT
Text.Parsec.Indent.Internal.Indentation m0'
Expected type: IndentParserT String () m0 Item
Actual type: IParser Item
* In the third argument of `withBlock', namely `anItem'
In a stmt of a 'do' block: b <- withBlock NamedList aName anItem
In the expression:
do { b <- withBlock NamedList aName anItem;
spaces;
return b }
I have no idea why I get these errors, I used the exact same example code. What do I do wrong?
That tutorial was written before version 0.4.0.0 of indents was released, which made a lot of breaking changes. Downgrade to indents-0.3.3 for it to work as written.

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)
What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

Parsec Parsing list of different kind of statements

I'm trying to parse (for now) a subset of the Dot language.
The grammar is here and my code is the following
import System.Environment
import System.IO
import qualified Text.Parsec.Token as P
import Text.ParserCombinators.Parsec.Char -- for letter
import Text.Parsec
import qualified Control.Applicative as App
import Lib
type Id = String
data Dot = Undirected Id Stmts
| Directed Id Stmts
deriving (Show)
data Stmt = NodeStmt Node | EdgeStmt Edges
deriving (Show)
type Stmts = [Stmt]
data Node = Node Id Attributes deriving (Show)
data Edge = Edge Id Id deriving (Show)
type Edges = [Edge]
data Attribute = Attribute Id Id deriving (Show)
type Attributes = [Attribute]
dotDef :: P.LanguageDef st
dotDef = P.LanguageDef
{ P.commentStart = "/*"
, P.commentEnd = "*/"
, P.commentLine = "//"
, P.nestedComments = True
, P.identStart = letter
, P.identLetter = alphaNum
, P.reservedNames = ["node", "edge", "graph", "digraph", "subgraph", "strict" ]
, P.caseSensitive = True
, P.opStart = oneOf "-="
, P.opLetter = oneOf "->"
, P.reservedOpNames = []
}
lexer = P.makeTokenParser dotDef
brackets = P.brackets lexer
braces = P.braces lexer
identifier = P.identifier lexer
reserved = P.reserved lexer
semi = P.semi lexer
comma = P.comma lexer
reservedOp = P.reservedOp lexer
eq_op = reservedOp "="
undir_edge_op = reservedOp "--"
dir_edge_op = reservedOp "->"
edge_op = undir_edge_op <|> dir_edge_op
-- -> Attribute
attribute = do
id1 <- identifier
eq_op
id2 <- identifier
optional (semi <|> comma)
return $ Attribute id1 id2
a_list = many attribute
bracked_alist =
brackets $ option [] a_list
attributes =
do
nestedAttributes <- many1 bracked_alist
return $ concat nestedAttributes
nodeStmt = do
nodeName <- identifier
attr <- option [] attributes
return $ NodeStmt $ Node nodeName attr
dropLast = reverse . tail . reverse
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
stmt = do
x <- nodeStmt <|> edgeStmt
optional semi
return x
stmt_list = many stmt
graphDecl = do
reserved "graph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Undirected varName stms
digraphDecl = do
reserved "digraph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Directed varName stms
topLevel3 = do
spaces
graphDecl <|> digraphDecl
main :: IO ()
main = do
(file:_) <- getArgs
content <- readFile file
case parse topLevel3 "" content of
Right g -> print g
Left err -> print err
Given this input
digraph PZIFOZBO{
a[toto = bar] b ; c ; w // 1
a->b // 2
}
It works fine if line 1 or line 2 is commented, but if both are enabled, it fails with
(line 3, column 10): unexpected "-" expecting identifier or "}"
My understanding it that the parser picks first matching rule (with backtracking). Here both edge and node statement starts with and identifier, so it always pick this one.
I tried reversing the order in stmt, without any luck.
I also tried to sprinkle some try in stmt, nodeStmt and edgeStmt, without luck either.
Any help appreciated.
Note that I get the same error whether or not line 1 is commented out, so:
digraph PZIFOZBO{
a->b
}
also says unexpected "-".
As I think you have correctly diagnosed, the problem here is that the stmt parser tries nodeStmt first. That succeeds and parses "a", leaving "->b" yet to be consumed, but ->b isn't a valid statement. Note that Parsec does not backtrack automatically in the absence of a try, so it's not going to go back and revisit this decisions when it "discovers" that ->b can't be parsed.
You can "fix" this problem by swapping the order in stmt:
x <- edgeStmt <|> nodeStmt
but now the parse will break on an expression like a[toto = bar]. That's because edgeStmt is buggy. It parses "a" as a valid statement EdgeStmt [] because sepBy1 allows a single edge "a", which isn't what you want.
If you rewrite edgeStmt to require at least one edge:
import Control.Monad (guard)
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
guard $ length nodes > 1
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
and adjust stmt to "try" an edge statement first and backtrack to a node statement:
stmt = do
x <- try edgeStmt <|> nodeStmt
optional semi
return x
then your example compiles fine.

Haskell read variable name

I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.
Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.

Resources