Parsing data with Parsec and omitting comments - parsing

I am trying to write a Haksell Parsec Parser that parses input data from a file into the LogLine datatype as follows:
--Final parser that holds the indvidual parsers.
final :: Parser [LogLine]
final = do{ logLines <- sepBy1 logLine eol
; return logLines
}
--The logline token declaration
logLine :: Parser LogLine
logLine = do
name <- plainValue -- parse the name (identifier)
many1 space -- parse and throw away a space
args1 <- bracketedValue -- parse the first arguments
many1 space -- throw away the second sapce
args2 <- bracketedValue -- parse the second list of arguments
many1 space --
constant <- plainValue -- parse the constant identifier
space
weighting <- plainValue --parse the weighting double
space
return $ LogLine name args1 args2 constant weighting
It parses everything just fine, but now I need to add comments to the file, and I have to modify the parser so that it ignores them.
It should support single-line comments only beginning with "--" and ending with a '\n'
I've tried defining the comment token as follows:
comments :: Parser String
comments = do
string "--"
comment <- (manyTill anyChar newline)
return ""
And then plugging it into the final parser like so:
final :: Parser [LogLine]
final = do
optional comments
logLines <- sepBy1 logLine (comments<|>newline)
optional comments
return logLines
It compiles fine, but it does not parse. I've tried several minor modifications but the best result was parsing everything up to the first comment, so I'm beginning to think that this is not the way to do it.
PS:
I've seen this Similar Question, but it is slightly different from what I'm trying to achieve.

If I understand your description of the format in your comment correctly, your example for the format would be
name arg1 arg2 c1 weight
-- comment goes here
optionally followed by further log-lines and/or comments.
Then your problem is that there is a newline between the log-line and the comment line, which means that the comments part of the separator parser fails - comments must start with "--" - without consuming input, so newline is tried and succeeds. Then the next line begins with "--" which makes plainValue fail without consuming input, and thus ends the sepBy1.
The solution is to let the separator first consume a newline, and then as many comment lines as follow:
final = do
skipMany comments
sepEndBy1 logLine (newline >> skipMany comments)
by allowing the sequence to be ended by a separator (sepEndBy1 instead of sepBy1), any comment lines after the final LogLine are automatically skipped.

The way i understand your problem, each line is either a comment or a logLine. If so, final should look something like this:
final :: Parser [LogLine]
final = do
logLines <- sepBy1 (comment<|>logLine) newline
return logLines

Related

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?
In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','
I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Parsec: Handling Overlapping Parsers

I'm really new to parsing in Haskell, but it's mostly making sense.
I'm working on building a Templating program mostly to learn parsing better; templates can interpolate values in via {{ value }} notation.
Here's my current parser,
data Template a = Template [Either String a]
data Directive = Directive String
templateFromFile :: FilePath -> IO (Either ParseError (Template Directive))
templateFromFile = parseFromFile templateParser
templateParser :: Parser (Template Directive)
templateParser = do
tmp <- template
eof
return tmp
template :: Parser (Template Directive)
template = Template <$> many (dir <|> txt)
where
dir = Right <$> directive
txt = Left <$> many1 (anyChar <* notFollowedBy directive)
directive :: Parser Directive
directive = do
_ <- string "{{"
txt <- manyTill anyChar (string "}}")
return $ Directive txt
Then I run it on a file something like this:
{{ value }}
This is normal Text
{{ expression }}
When I run this using templateFromFile "./template.txt" I get the error:
Left "./template.txt" (line 5, column 17):
unexpected Directive " expression "
Why is this happening and how can I fix it?
My basic understanding is that many1 (anyChar <* notFollowedBy directive)
should grab all of the characters up until the start of the next directive, then should fail and return the list of characters up till that point; then
it should fall back to the previous many and should try parsing dir again and should succeed; clearly something else is happening though. I'm
having trouble figuring out how to parse things between other things when
the parsers mostly overlap.
I'd love some tips on how to structure this all more idiomatically, please let me know if I'm doing something in a silly way. Cheers! Thanks for your time!
You have a couple of problems. First, in Parsec, if a parser consumes any input and then fails, that's an error. So, when the parser:
anyChar <* notFollowedBy directive
fails (because the character is followed by a directive), it fails after anyChar has consumed input, and that generates an error immediately. Therefore, the parser:
let p1 = many1 (anyChar <* notFollowedBy directive)
will never succeed if it runs into a directive. For example:
parse p1 "" "okay" -- works
parse p1 "" "oops {{}}" -- will fail after consuming "oops "
You can fix this by inserting a try clause:
let p2 = many1 (try (anyChar <* notFollowedBy directive))
parse p2 "" "okay {{}}"
which yields Right "okay" and reveals the second problem. Parser p2 only consumes characters that aren't followed by a directive, so that excludes the space immediately before the directive, and you have no means in your parser to consume a character that is followed by a directive, so it gets stuck.
You actually want something like:
let p3 = many1 (notFollowedBy directive *> anyChar)
which first checks that, at the current position, we aren't looking at a directive before grabbing a character. No try clause is needed because if this fails, it fails without consuming input. (notFollowedBy never consumes input, as per the documentation.)
parse p3 "" "okay" -- returns Right "okay"
parse p3 "" "okay {{}}" -- return Right "okay "
parse p3 "" "{{fails}}" -- correctly fails w/o consuming input
So, taking your original example with:
txt = Left <$> many1 (notFollowedBy directive *> anyChar)
should work fine.
replace-megaparsec
is a library for doing search-and-replace with parsers. The
search-and-replace function is
streamEdit,
which can find your {{ value }} patterns and then substitute in some other text.
streamEdit is built from a generalized version
of your template function called
sepCap.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Char
input = unlines
[ "{{ value }}"
, ""
, "This is normal Text"
, ""
, "{{ expression }}"
]
directive :: Parsec Void String String
directive = do
_ <- string "{{"
txt <- manyTill anySingle (string "}}")
return txt
editor k = fmap toUpper k
streamEdit directive editor input
VALUE
This is normal Text
EXPRESSION
many1 (anyChar <* notFollowedBy directive)
This parses only characters not followed by a directive.
{{ value }}
This is normal Text
{{ expression }}
When parsing the text in the middle, it will stop at the last t, leaving the newline before the directive unconsumed (because it's, well, a character followed by a directive), so the next iteration, you try to parse a directive and you fail. Then you retry txt on that newline, the parser expects it not to be followed by a directive, but it finds one, hence the error.

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.
What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res
#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

Parse array of numbers between emptylines

I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
1 154
2 321
3 519
1 235 623 684
2 871 699 557
3 918 686 49
Here is the full text file
I wrote the following parser with parsec :
import Text.ParserCombinators.Parsec
emptyLine = do
spaces
newline
emptyLines = many1 emptyLine
data1 = do
dat <- many1 digit
return (dat)
datan = do
many1 (oneOf " \t")
dat <- many1 digit
return (dat)
dataline = do
dat1 <- data1
dat2 <- many datan
many (oneOf " \t")
newline
return (dat1:dat2)
parseSeries = do
dat <- many1 dataline
return dat
parseParag = try parseSeries
parseListing = do
--cont <- parseSeries `sepBy` emptyLines
cont <- between emptyLines emptyLines parseSeries
eof
return cont
main = do
fichier <- readFile ("test_listtst.txt")
case parse parseListing "(test)" fichier of
Left error -> do putStrLn "!!! Error !!!"
print error
Right serie -> do
mapM_ print serie
but it fails with the following error :
!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line
and I don't understand why.
Do you have any idea of what's wrong with my parser ?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
The spaces in emptyLine is consuming the '\n', and then newline has no '\n' to parse. You can write it as:
emptyLine = do
skipMany $ satisfy (\c -> isSpace c && c /= '\n')
newline
And you should change parseListing to:
parseListing = do
cont <- parseSeries `sepEndBy` emptyLines
eof
return cont
I think sepEndBy is better than sepBy, because it will skip any new lines that you may have at the end of the file.
A couple of things:
spaces includes new lines, and so spaces >> newline always fails which implies that the emptyLine parser will always fail.
I've had luck with these definitions of parseSeries and parseListing:
parseSeries = do
s <- many1 dataline
spaces -- eat trailing whitespace
return s
parseListing = do
spaces -- ignore leading whitespace
ss <- many parseSeries -- begin parseSeries at non-whitespace
eof
return ss
The idea is that a parser always eats the whitespace following it.
This approach also handles empty files.
Do you have any idea of what's wrong with my parser ?
A few things:
As other answerers have already pointed out, the spaces parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace; the newline ('\n') is such a character. Therefore, your emptyLine parser always fails, because newline expects a newline character that has already been consumed.
You probably shouldn't use the newline parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline.
Why not use parsec 3 (Text.Parsec.*) rather than parsec 2 (Text.ParserCombinators.*)?
Why not parse the numbers as Integers or Ints as you go, rather than keep them as Strings?
Personal preference, but you rely too much on the do notation for my taste, to the detriment of readability. For instance,
data1 = do
dat <- many1 digit
return (dat)
can be simplified to
data1 = many1 digit
You would do well to add a type signature to all your top-level bindings.
Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"?
Have you considered using a different type of input stream (e.g. Text) for better performance?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces parser).
module Main where
import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )
eolChar :: Char
eolChar = '\n'
eol :: Parser Char
eol = char eolChar
whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar
emptyLine :: Parser String
emptyLine = whitespace
emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol
cell :: Parser Integer
cell = read <$> many1 digit
dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
-- ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"
dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol
listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines
main :: IO ()
main = do
fichier <- readFile ("test_listtst.txt")
case parse listing "(test)" fichier of
Left error -> putStrLn "!!! Error !!!"
Right serie -> mapM_ print serie
Test:
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]
Here is another approach which allows you to stream in the data and process each block as it is identified:
import Data.Char
import Control.Monad
-- toBlocks - convert a list of lines into a list of blocks
toBlocks :: [String] -> [[[String]]]
toBlocks [] = []
toBlocks theLines =
let (block,rest) = break isBlank theLines
next = dropWhile isBlank rest
in if null block
then toBlocks next
else [ words x | x <- block ] : toBlocks next
where isBlank str = all isSpace str
main' path = do
content <- readFile path
forM_ (toBlocks (lines content)) $ print
Parsec has to read in the entire file before it gives you the list of blocks which might be a problem if your input file is large.
I don't know the exact problem but my experience with parsing "line oriented" file with parsec is : don't use parsec ( or at least not this way).
I mean the problem is you want somehow to strip the blanks (spaces and newline) between numbers (on the same line) but still being aware of them when needed.
It's really hard to do in one step (which is what you are trying to do).
It's probably doable adding lookahead but it's really messy (and to be honest I never managed to make it work).
The easiest way is to parse lines on the first step (which allow you to detect empty lines) and then parse each line separately.
To do that, you don't need parsec at all and can do it just using lines and words. However, doing this, you are losing the ability to backtrack.
There is probably a way to "mulitple steps" parsing using parsec and it's tokenizer (but I haven't find any useful doc about how to use parsec tokenizer).

Resources