Parsing Printable Text File in Haskell - parsing

I'm trying to figure out the "right" way to parse a particular text file in Haskell.
In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.
The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:
MY COMPANY'S NAME
PROGRAM LISTING
STATE: OK PRODUCT: ProductName
(DESCRIPTION OF REPORT)
DATE: 11/03/2013
This is the first line of a a two-line description of the contents of this report. The description, as noted,
spans two lines. This is more text. I'm running out of things to write. Blah.
DIVISION CODE: 3 XYZ CODE: FAA3 AGENT CODE: 0007 PAGE NO: 1
AGENT TARGET NAME ST UD TARGET# XYZ# X-DATE YEAR CO ENCODING
----- ------------------------------ -- -- ------- ---- ---------- ---- ---------- ----------
0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL
0007 SMITH, JANE 43 3 2345678 001 12/07/2013 2005 ACME YELLOW
0007 DOE, JOHN 43 3 3456789 004 12/09/2013 2008 MICROSOFT GREEN
0007 DOE, JANE 43 3 4567890 002 12/09/2013 2007 MICROSOFT BLUE
0007 BORGES, JORGE LUIS 43 3 5678901 001 12/09/2013 2008 DUFEMSCHM Y1500
0007 DEWEY, JOHN & 43 3 6789012 003 12/11/2013 2013 ERTZEVILI X1500
0007 NIETZSCHE, FRIEDRICH 43 3 7890123 004 12/11/2013 2006 NCORPORAT X7
I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.
Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.
Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.
So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.
Thanks!
P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.
Edit:
Thank you all so much for the great comments and answers!
Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.
My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!

This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like
module ReportParser where
import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)
import Control.Applicative
import Data.Attoparsec.Text
data ReportHeaderData = Company Text
| Program Text
| State Text
-- ...
| FieldNames [Text]
data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text
data Date = Date Int Int Int
and we can say, for the sake of argument, that a report is
data Report = Report [ReportHeaderData] [ReportData]
Now, I generally create a parser which is a function of the same name as the data type
-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space
-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs
-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()
-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)
-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f3 = decimal <* clearSpaces
f4 = decimal <* clearSpaces
f5 = decimal <* clearSpaces
f6 = decimal <* clearSpaces
f7 = date <* clearSpaces
f8 = decimal <* clearSpaces
f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10
By proper running of one of the parse functions and the use of one of the combinators (such as many (and possibly feed, if you end up with a Partial result), you should end up with a list of ReportDatas. You can then convert them to CSV with some function you've created.
Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report with e.g.
-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)
Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace). Data.Alternative is also useful, for reasons implied by the name.
For playing with this, I highly recommend GHCI and the parseTest function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.

There are very few languages that I would recommend using a parser for something so simple (I've parsed many a file like this using regular expressions in the past), but parsec makes it so easy-
parseLine = do
first <- count 4 anyChar
second <- count 4 anyChar
return (first, second)
parseFile = endBy parseLine (char '\n')
main = interact $ show . parse parseFile "-"
The function "parseLine" creates a parser for an individual line by chaining together two fields made up of fixed length (4 chars, any char will do).
The function "parseFile" then chains these together as a list of lines.
Of course you will have to add more fields, and cut off the header in your data still, but all of this is easy in parsec.
This is arguably much easier to read than regexps....

Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's far more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format.
I'll use Attoparsec. We'll also need to use the ByteString data type (and the OverloadedStrings PRAGMA which lets Haskell interpret string literals as both String and ByteString) and some combinators from Control.Applicative and Control.Monad.
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8
import Control.Applicative
import Control.Monad
import qualified Data.ByteString.Char8 as S
First, we'll build a data type representing each record.
data YearMonthDay =
YearMonthDay { ymdYear :: Int
, ymdMonth :: Int
, ymdDay :: Int
}
deriving ( Show )
data Line =
Line { agent :: Int
, name :: S.ByteString
, st :: Int
, ud :: Int
, targetNum :: Int
, xyz :: Int
, xDate :: YearMonthDay
, year :: Int
, co :: S.ByteString
, encoding :: S.ByteString
}
deriving ( Show )
You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a Parser Line type---read that as a parser type which returns a Line if it succeeds.
To do this, we'll build our Line type "inside of" the Parser using its Applicative interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the YearMonthDay type as a warm-up
parseYMDWrong :: Parser YearMonthDay
parseYMDWrong =
YearMonthDay <$> decimal
<*> decimal
<*> decimal
Here, decimal is a built-in Attoparsec parser which parses an integral type like Int. You can read this parser as nothing more than "parse three decimal numbers and use them to build my YearMonthDay type" and you'd be basically correct. The (<*>) operator (read as "apply") sequences the parses and collects their results into our YearMonthDay constructor function.
Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the '/' characters which delimit the numbers inside of our YearMonthDay. We fix this by using the "sequence and throw away" operator (<*). It's a visual pun on (<*>) and we use it when we want to perform a parsing action... but we don't want to keep the result.
We use (<*) to augment the first two decimal parsers with their following '/' characters using the built-in char8 parser.
parseYMD :: Parser YearMonthDay
parseYMD =
YearMonthDay <$> (decimal <* char8 '/')
<*> (decimal <* char8 '/')
<*> decimal
And we can test that this is a valid parser using Attoparsec's parseOnly function
>>> parseOnly parseYMD "2013/12/12"
Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12})
We'd like to now generalize this technique to the entire Line parser. There's one hitch, however. We'd like to parse ByteString fields like "SMITH, JOHN" which might contain spaces... while also delimiting each field of our Line by double spaces. This means that we need a special ByteString parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row.
We can build this using the scan combinator. scan allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too.
parseStringField :: Parser S.ByteString
parseStringField = scan False step where
step :: Bool -> Char -> Maybe Bool
step b ' ' | b = Nothing
| otherwise = Just True
step _ _ = Just False
We can again test this little piece using parseOnly. Let's try parsing three string fields.
>>> let p = (,,) <$> parseStringField <*> parseStringField <*> parseStringField
>>> parseOnly p "foo bar baz"
Right ("foo "," bar "," baz")
>>> parseOnly p "foo bar baz quux end"
Right ("foo bar "," baz quux "," end")
>>> parseOnly p "a sentence with no double space delimiters"
Right ("a sentence with no double space delimiters","","")
Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now.
We can now build our Line parser. Like with parseYMD, we'll follow each field's parser with a delimiting parser, someSpaces which consumes two or more spaces. We'll use the MonadPlus interface to Parser to build this atop the built-in parser space by (1) parsing some spaces and (2) checking to be sure that we got at least two of them.
someSpaces :: Parser Int
someSpaces = do
sps <- some space
let count = length sps
if count >= 2 then return count else mzero
>>> parseOnly someSpaces " "
Right 2
>>> parseOnly someSpaces " "
Right 4
>>> parseOnly someSpaces " "
Left "Failed reading: mzero"
And now we can build the line parser
lineParser :: Parser Line
lineParser =
Line <$> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseYMD <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (parseStringField <* some space)
>>> parseOnly lineParser "0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL "
Right (Line { agent = 7
, name = "SMITH, JOHN "
, st = 43
, ud = 3
, targetNum = 1234567
, xyz = 1
, xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013}
, year = 2004
, co = "ABC "
, encoding = "SIZE XL "
})
And then we can just cut off the header and parse each line.
parseFile :: S.ByteString -> [Either String Line]
parseFile = map (parseOnly parseLine) . drop 14 . lines

Related

Why does try not trigger backtracking in this example

I am trying to wrap my head around writing parser using parsec in Haskell, in particular how backtracking works.
Take the following simple parser:
import Text.Parsec
type Parser = Parsec String () String
parseConst :: Parser
parseConst = do {
x <- many digit;
return $ read x
}
parseAdd :: Parser
parseAdd = do {
l <- parseExp;
char '+';
r <- parseExp;
return $ l <> "+" <> r
}
parseExp :: Parser
parseExp = try parseConst <|> parseAdd
pp :: Parser
pp = parseExp <* eof
test = parse pp "" "1+1"
test has value
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
In my mind this should succeed since I used the try combinator on parseConst in the definition of parseExp.
What am I missing? I am also interrested in pointers for how to debug this in my own, I tried using parserTraced which just allowed me to conclude that it indeed wasn't backtracking.
PS.
I know this is an awful way to write an expression parser, but I'd like to understand why it doesn't work.
There are a lot of problems here.
First, parseConst can never work right. The type says it must produce a String, so read :: String -> String. That particular Read instance requires the input be a quoted string, so being passed 0 or more digit characters to read is always going to result in a call to error if you try to evaluate the value it produces.
Second, parseConst can succeed on matching zero characters. I think you probably wanted some instead of many. That will make it actually fail if it encounters input that doesn't start with a digit.
Third, (<|>) doesn't do what you think. You might think that (a <* c) <|> (b <* c) is interchangeable with (a <|> b) <* c, but it isn't. There is no way to throw try in and make it the same, either. The problem is that (<|>) commits to whichever branch succeed, if one does. In (a <|> b) <* c, if a matches, there's no way later to backtrack and try b there. Doesn't matter how you lob try around, it can't undo the fact that (<|>) committed to a. In contrast, (a <* c) <|> (b <* c) doesn't commit until both a and c or b and c match the input.
This is the situation you're encountering. You have (try parseConst <|> parseAdd) <* eof, after a bit of inlining. Since parseConst will always succeed (see the second issue), parseAdd will never get tried, even if the eof fails. So after parseConst consumes zero or more leading digits, the parse will fail unless that's the end of the input. Working around this essentially requires carefully planning your grammar such that any use of (<|>) is safe to commit locally. That is, the contents of each branch must not overlap in a way that is disambiguated only by later portions of the grammar.
Note that this unpleasant behavior with (<|>) is how the parsec family of libraries work, but not how all parser libraries in Haskell work. Other libraries work without the left bias or commit behavior the parsec family have chosen.

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?
In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','
I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Parse array of numbers between emptylines

I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
1 154
2 321
3 519
1 235 623 684
2 871 699 557
3 918 686 49
Here is the full text file
I wrote the following parser with parsec :
import Text.ParserCombinators.Parsec
emptyLine = do
spaces
newline
emptyLines = many1 emptyLine
data1 = do
dat <- many1 digit
return (dat)
datan = do
many1 (oneOf " \t")
dat <- many1 digit
return (dat)
dataline = do
dat1 <- data1
dat2 <- many datan
many (oneOf " \t")
newline
return (dat1:dat2)
parseSeries = do
dat <- many1 dataline
return dat
parseParag = try parseSeries
parseListing = do
--cont <- parseSeries `sepBy` emptyLines
cont <- between emptyLines emptyLines parseSeries
eof
return cont
main = do
fichier <- readFile ("test_listtst.txt")
case parse parseListing "(test)" fichier of
Left error -> do putStrLn "!!! Error !!!"
print error
Right serie -> do
mapM_ print serie
but it fails with the following error :
!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line
and I don't understand why.
Do you have any idea of what's wrong with my parser ?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
The spaces in emptyLine is consuming the '\n', and then newline has no '\n' to parse. You can write it as:
emptyLine = do
skipMany $ satisfy (\c -> isSpace c && c /= '\n')
newline
And you should change parseListing to:
parseListing = do
cont <- parseSeries `sepEndBy` emptyLines
eof
return cont
I think sepEndBy is better than sepBy, because it will skip any new lines that you may have at the end of the file.
A couple of things:
spaces includes new lines, and so spaces >> newline always fails which implies that the emptyLine parser will always fail.
I've had luck with these definitions of parseSeries and parseListing:
parseSeries = do
s <- many1 dataline
spaces -- eat trailing whitespace
return s
parseListing = do
spaces -- ignore leading whitespace
ss <- many parseSeries -- begin parseSeries at non-whitespace
eof
return ss
The idea is that a parser always eats the whitespace following it.
This approach also handles empty files.
Do you have any idea of what's wrong with my parser ?
A few things:
As other answerers have already pointed out, the spaces parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace; the newline ('\n') is such a character. Therefore, your emptyLine parser always fails, because newline expects a newline character that has already been consumed.
You probably shouldn't use the newline parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline.
Why not use parsec 3 (Text.Parsec.*) rather than parsec 2 (Text.ParserCombinators.*)?
Why not parse the numbers as Integers or Ints as you go, rather than keep them as Strings?
Personal preference, but you rely too much on the do notation for my taste, to the detriment of readability. For instance,
data1 = do
dat <- many1 digit
return (dat)
can be simplified to
data1 = many1 digit
You would do well to add a type signature to all your top-level bindings.
Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"?
Have you considered using a different type of input stream (e.g. Text) for better performance?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces parser).
module Main where
import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )
eolChar :: Char
eolChar = '\n'
eol :: Parser Char
eol = char eolChar
whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar
emptyLine :: Parser String
emptyLine = whitespace
emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol
cell :: Parser Integer
cell = read <$> many1 digit
dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
-- ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"
dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol
listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines
main :: IO ()
main = do
fichier <- readFile ("test_listtst.txt")
case parse listing "(test)" fichier of
Left error -> putStrLn "!!! Error !!!"
Right serie -> mapM_ print serie
Test:
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]
Here is another approach which allows you to stream in the data and process each block as it is identified:
import Data.Char
import Control.Monad
-- toBlocks - convert a list of lines into a list of blocks
toBlocks :: [String] -> [[[String]]]
toBlocks [] = []
toBlocks theLines =
let (block,rest) = break isBlank theLines
next = dropWhile isBlank rest
in if null block
then toBlocks next
else [ words x | x <- block ] : toBlocks next
where isBlank str = all isSpace str
main' path = do
content <- readFile path
forM_ (toBlocks (lines content)) $ print
Parsec has to read in the entire file before it gives you the list of blocks which might be a problem if your input file is large.
I don't know the exact problem but my experience with parsing "line oriented" file with parsec is : don't use parsec ( or at least not this way).
I mean the problem is you want somehow to strip the blanks (spaces and newline) between numbers (on the same line) but still being aware of them when needed.
It's really hard to do in one step (which is what you are trying to do).
It's probably doable adding lookahead but it's really messy (and to be honest I never managed to make it work).
The easiest way is to parse lines on the first step (which allow you to detect empty lines) and then parse each line separately.
To do that, you don't need parsec at all and can do it just using lines and words. However, doing this, you are losing the ability to backtrack.
There is probably a way to "mulitple steps" parsing using parsec and it's tokenizer (but I haven't find any useful doc about how to use parsec tokenizer).

Make a parser ignore all redundant whitespace

Say I have a Parser p in Parsec and I want to specify that I want to ignore all superfluous/redundant white space in p. Let's for example say that I define a list as starting with "[", end with "]", and in the list are integers separated by white space. But I don't want any errors if there are white space in front of the "[", after the "]", in between the "[" and the first integer, and so on.
In my case, I want this to work for my parser for a toy programming language.
I will update with code if that is requested/necessary.
Just surround everything with space:
parseIntList :: Parsec String u [Int]
parseIntList = do
spaces
char '['
spaces
first <- many1 digit
rest <- many $ try $ do
spaces
char ','
spaces
many1 digit
spaces
char ']'
return $ map read $ first : rest
This is a very basic one, there are cases where it'll fail (such as an empty list) but it's a good start towards getting something to work.
#Joehillen's suggestion will also work, but it requires some more type magic to use the token features of Parsec. The definition of spaces matches 0 or more characters that satisfies Data.Char.isSpace, which is all the standard ASCII space characters.
Use combinators to say what you mean:
import Control.Applicative
import Text.Parsec
import Text.Parsec.String
program :: Parser [[Int]]
program = spaces *> many1 term <* eof
term :: Parser [Int]
term = list
list :: Parser [Int]
list = between listBegin listEnd (number `sepBy` listSeparator)
listBegin, listEnd, listSeparator :: Parser Char
listBegin = lexeme (char '[')
listEnd = lexeme (char ']')
listSeparator = lexeme (char ',')
lexeme :: Parser a -> Parser a
lexeme parser = parser <* spaces
number :: Parser Int
number = lexeme $ do
digits <- many1 digit
return (read digits :: Int)
Try it out:
λ :l Parse.hs
Ok, modules loaded: Main.
λ parseTest program " [1, 2, 3] [4, 5, 6] "
[[1,2,3],[4,5,6]]
This lexeme combinator takes a parser and allows arbitrary whitespace after it. Then you only need to use lexeme around the primitive tokens in your language such as listSeparator and number.
Alternatively, you can parse the stream of characters into a stream of tokens, then parse the stream of tokens into a parse tree. That way, both the lexer and parser can be greatly simplified. It’s only worth doing for larger grammars, though, where maintainability is a concern; and you have to use some of the lower-level Parsec API such as tokenPrim.

Resources