Make a parser ignore all redundant whitespace

Make a parser ignore all redundant whitespace - parsing

Say I have a Parser p in Parsec and I want to specify that I want to ignore all superfluous/redundant white space in p. Let's for example say that I define a list as starting with "[", end with "]", and in the list are integers separated by white space. But I don't want any errors if there are white space in front of the "[", after the "]", in between the "[" and the first integer, and so on.
In my case, I want this to work for my parser for a toy programming language.
I will update with code if that is requested/necessary.

Just surround everything with space:
parseIntList :: Parsec String u [Int]
parseIntList = do
spaces
char '['
spaces
first <- many1 digit
rest <- many $ try $ do
spaces
char ','
spaces
many1 digit
spaces
char ']'
return $ map read $ first : rest
This is a very basic one, there are cases where it'll fail (such as an empty list) but it's a good start towards getting something to work.
#Joehillen's suggestion will also work, but it requires some more type magic to use the token features of Parsec. The definition of spaces matches 0 or more characters that satisfies Data.Char.isSpace, which is all the standard ASCII space characters.

Use combinators to say what you mean:
import Control.Applicative
import Text.Parsec
import Text.Parsec.String
program :: Parser [[Int]]
program = spaces *> many1 term <* eof
term :: Parser [Int]
term = list
list :: Parser [Int]
list = between listBegin listEnd (number `sepBy` listSeparator)
listBegin, listEnd, listSeparator :: Parser Char
listBegin = lexeme (char '[')
listEnd = lexeme (char ']')
listSeparator = lexeme (char ',')
lexeme :: Parser a -> Parser a
lexeme parser = parser <* spaces
number :: Parser Int
number = lexeme $ do
digits <- many1 digit
return (read digits :: Int)
Try it out:
λ :l Parse.hs
Ok, modules loaded: Main.
λ parseTest program " [1, 2, 3] [4, 5, 6] "
[[1,2,3],[4,5,6]]
This lexeme combinator takes a parser and allows arbitrary whitespace after it. Then you only need to use lexeme around the primitive tokens in your language such as listSeparator and number.
Alternatively, you can parse the stream of characters into a stream of tokens, then parse the stream of tokens into a parse tree. That way, both the lexer and parser can be greatly simplified. It’s only worth doing for larger grammars, though, where maintainability is a concern; and you have to use some of the lower-level Parsec API such as tokenPrim.

Related

Using makeExprParser with ambiguity

I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.
The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (<, >) and separated by ,.
Stuff like <>, <23>, <23,87> etc.
The problem now is that the expressions may themselves contain the > operator (meaning "greater than"), which causes my parser to fail.
<1223>234> should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)].
I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.
Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:
This is the relevant part of my parser:
-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type BinaryOperator = Text
type Name = Text
data Expr
= IntExpr Integer
| BinaryExpression BinaryOperator Expr Expr
deriving (Eq, Show)
type Parser = Parsec Void Text
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")
parseInteger :: Parser Expr
parseInteger = do
number <- some digitChar
_ <- sc
return $ IntExpr $ read number
parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]
parseBracketList :: Parser [Expr]
parseBracketList = do
_ <- symbol "<"
exprs <- sepBy parseExpr (symbol ",")
_ <- symbol ">"
return exprs
main :: IO ()
main = do
text : _ <- getArgs
let res = runParser parseBracketList "stdin" (pack text)
case res of
(Right suc) -> do
print suc
(Left err) ->
putStrLn $ errorBundlePretty err

You've (probably) misdiagnosed the problem. Your parser fails on <1233>234> because it's trying to parse > as a left associative operator, like +. In other words, the same way:
1+2+
would fail, because the second + has no right-hand operand, your parser is failing because:
1233>234>
has no digit following the second >. Assuming you don't want your > operator to chain (i.e., 1>2>3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.
Unfortunately, with or without this change your parser will still fail on the simpler test case:
<1233>
because the > is interpreted as an operator within a continuing expression.
In other words, the problem isn't that your parser can't handle expressions with > characters, it's that it's overly aggressive in treating > characters as part of an expression, preventing them from being recognized as the closing angle bracket.
To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where > can be part of a continuing expression and where it can't.
One rule that will probably work is to only consider a > as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:
symbol ">" <* lookAhead term
will parse a > operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the > symbol itself), so you must surround it with a try:
try (symbol ">" <* lookAhead term)
With the above two fixes applied to parseExpr:
parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger
you'll get the following parses:
λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
However, the following will fail:
λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
| ^
unexpected end of input
expecting ',', '>', or digit
λ>
because the fact that the > is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87>18 is followed by neither a comma nor a closing > angle bracket.
If you need to parse something like <23,87>18, you have bigger problems. Consider the following two test cases:
<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001
It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:
[IntExpr 1, IntExpr 2]
followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?

In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','

I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

Designing parsing code using Parsec

In the course for following the tutorial Write yourself a Scheme in 48 hours, I was attempting to enhance my parsing code to create support for hexadecimal, octal, binary and decimal literals.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Monad
import Numeric
hexChar :: Parser Char
hexChar = ...
octChar :: Parser Char
octChar = ...
hexNumber :: Parser Integer
hexNumber = do
char '#'
char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
char '#'
char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
If we forget about decimal and binary numbers in this discussion:
parseNumber :: Parse Integer
parseNumber = hexNumber <|> octNumber
Then this parser will fail to recognize octal numbers. This seems to be related to the number of lookahead characters required to tell apart and octal from hexadecimal numbers (if we drop the leading '#' in the syntax, then the
parser will work). Hence it seems we are forced to revisit the code and 'factorize' the leading '#' so to speak, by dropping the char '#' in the individual parsers and defining:
parseNumber = char '#' >> (hexNumber <|> octNumber)
This is fine but I find the code less pleasant. Somehow, if I have a function called hexNumber I would expect it to recognize #xffff (which is proper Scheme syntax) and not xffff. Is this something I have to live with, or are there ways to go around this 'forced factorization' of the leading character #?

If the first argument of (<|>) fails after having consumed some input, then it fails immediately without trying the second alternative. If a failure in the first argument should lead to a retry with the second argument, you can use try to avoid consuming input. In hexNumber you must consume '#' only if the following character matches 'x'.
hexNumber :: Parser Integer
hexNumber = do
try $ char '#' >> char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
try $ char '#' >> char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
Note that this is somewhat inefficient since you parse '#' twice, and it gets worse as the common prefix gets longer and more complex.

Parse array of numbers between emptylines

I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
1 154
2 321
3 519
1 235 623 684
2 871 699 557
3 918 686 49
Here is the full text file
I wrote the following parser with parsec :
import Text.ParserCombinators.Parsec
emptyLine = do
spaces
newline
emptyLines = many1 emptyLine
data1 = do
dat <- many1 digit
return (dat)
datan = do
many1 (oneOf " \t")
dat <- many1 digit
return (dat)
dataline = do
dat1 <- data1
dat2 <- many datan
many (oneOf " \t")
newline
return (dat1:dat2)
parseSeries = do
dat <- many1 dataline
return dat
parseParag = try parseSeries
parseListing = do
--cont <- parseSeries `sepBy` emptyLines
cont <- between emptyLines emptyLines parseSeries
eof
return cont
main = do
fichier <- readFile ("test_listtst.txt")
case parse parseListing "(test)" fichier of
Left error -> do putStrLn "!!! Error !!!"
print error
Right serie -> do
mapM_ print serie
but it fails with the following error :
!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line
and I don't understand why.
Do you have any idea of what's wrong with my parser ?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?

The spaces in emptyLine is consuming the '\n', and then newline has no '\n' to parse. You can write it as:
emptyLine = do
skipMany $ satisfy (\c -> isSpace c && c /= '\n')
newline
And you should change parseListing to:
parseListing = do
cont <- parseSeries `sepEndBy` emptyLines
eof
return cont
I think sepEndBy is better than sepBy, because it will skip any new lines that you may have at the end of the file.

A couple of things:
spaces includes new lines, and so spaces >> newline always fails which implies that the emptyLine parser will always fail.
I've had luck with these definitions of parseSeries and parseListing:
parseSeries = do
s <- many1 dataline
spaces -- eat trailing whitespace
return s
parseListing = do
spaces -- ignore leading whitespace
ss <- many parseSeries -- begin parseSeries at non-whitespace
eof
return ss
The idea is that a parser always eats the whitespace following it.
This approach also handles empty files.

Do you have any idea of what's wrong with my parser ?
A few things:
As other answerers have already pointed out, the spaces parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace; the newline ('\n') is such a character. Therefore, your emptyLine parser always fails, because newline expects a newline character that has already been consumed.
You probably shouldn't use the newline parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline.
Why not use parsec 3 (Text.Parsec.*) rather than parsec 2 (Text.ParserCombinators.*)?
Why not parse the numbers as Integers or Ints as you go, rather than keep them as Strings?
Personal preference, but you rely too much on the do notation for my taste, to the detriment of readability. For instance,
data1 = do
dat <- many1 digit
return (dat)
can be simplified to
data1 = many1 digit
You would do well to add a type signature to all your top-level bindings.
Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"?
Have you considered using a different type of input stream (e.g. Text) for better performance?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces parser).
module Main where
import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )
eolChar :: Char
eolChar = '\n'
eol :: Parser Char
eol = char eolChar
whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar
emptyLine :: Parser String
emptyLine = whitespace
emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol
cell :: Parser Integer
cell = read <$> many1 digit
dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
-- ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"
dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol
listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines
main :: IO ()
main = do
fichier <- readFile ("test_listtst.txt")
case parse listing "(test)" fichier of
Left error -> putStrLn "!!! Error !!!"
Right serie -> mapM_ print serie
Test:
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]

Here is another approach which allows you to stream in the data and process each block as it is identified:
import Data.Char
import Control.Monad
-- toBlocks - convert a list of lines into a list of blocks
toBlocks :: [String] -> [[[String]]]
toBlocks [] = []
toBlocks theLines =
let (block,rest) = break isBlank theLines
next = dropWhile isBlank rest
in if null block
then toBlocks next
else [ words x | x <- block ] : toBlocks next
where isBlank str = all isSpace str
main' path = do
content <- readFile path
forM_ (toBlocks (lines content)) $ print
Parsec has to read in the entire file before it gives you the list of blocks which might be a problem if your input file is large.

I don't know the exact problem but my experience with parsing "line oriented" file with parsec is : don't use parsec ( or at least not this way).
I mean the problem is you want somehow to strip the blanks (spaces and newline) between numbers (on the same line) but still being aware of them when needed.
It's really hard to do in one step (which is what you are trying to do).
It's probably doable adding lookahead but it's really messy (and to be honest I never managed to make it work).
The easiest way is to parse lines on the first step (which allow you to detect empty lines) and then parse each line separately.
To do that, you don't need parsec at all and can do it just using lines and words. However, doing this, you are losing the ability to backtrack.
There is probably a way to "mulitple steps" parsing using parsec and it's tokenizer (but I haven't find any useful doc about how to use parsec tokenizer).

Correct ReadP usage in Haskell

I did a very simple parser for lists of numbers in a file, using ReadP in Haskell. It works, but it is very slow... is this normal behavior of this type of parser or am I doing something wrong?
import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char
setsReader :: ReadP [ IntSet.IntSet ]
setsReader =
setReader `sepBy` ( char '\n' )
innocentWhitespace :: ReadP ()
innocentWhitespace =
skipMany $ (char ' ') <++ (char '\t' )
setReader :: ReadP IntSet.IntSet
setReader = do
innocentWhitespace
int_list <- integerReader `sepBy1` innocentWhitespace
innocentWhitespace
return $ IntSet.fromList int_list
integerReader :: ReadP Int
integerReader = do
digits <- many1 $ satisfy isDigit
return $ read digits
readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do
whole_file <- readFile filename
return $ ( fst . last ) $ readP_to_S setsReader whole_file

setReader has exponential behavior, because it is allowing the whitespace between the numbers to be optional. So for the line:
12 34 56
It is seeing these parses:
[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]
You could see how this could get out of hand for long lines. ReadP returns all valid parses in increasing length order, so to get to the last parse you have to traverse through all these intermediate parses. Change:
int_list <- integerReader `sepBy1` innocentWhitespace
To:
int_list <- integerReader `sepBy1` mandatoryWhitespace
For a suitable definition of mandatoryWhitespace to squash this exponential behavior. The parsing strategy used by parsec is more resistant to this kind of error, because it is greedy -- once it consumes input in a given branch, it is committed to that branch and never goes back (unless you explicitly asked it to). So once it correctly parsed 12, it would never go back to parse 1 2. Of course that means it matters in which order you state your choices, which I always find to be a bit of a pain to think about.
Also I would use:
head [ x | (x,"") <- readP_to_S setsReader whole_file ]
To extract a valid whole-file parse, in case it very quickly consumed all input but there were a hundred bazillion ways to interpret that input. If you don't care about the ambiguity, you would probably rather it return the first one than the last one, because the first one will arrive faster.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Make a parser ignore all redundant whitespace - parsing

Related

Using makeExprParser with ambiguity

Why Parsec's sepBy stops and does not parse all elements?

Designing parsing code using Parsec

Parse array of numbers between emptylines

Correct ReadP usage in Haskell

Categories

Resources