Skip everything until a successful parse - parsing

I'd like to parse all days from a text like this:
Ignore this
Also this
2019-09-05
More to ignore
2019-09-06
2019-09-07
Using Trifecta, I've defined a function to parse a day:
dayParser :: Parser Day
dayParser = do
dayString <- tillEnd
parseDay dayString
tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)
parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
where
dayMaybe = parseTime' dayFormat s
failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
-- %-m makes the parser accept months consisting of a single digit
dayFormat = "%Y-%-m-%-d"
eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()
-- "%Y-%-m-%-d" for example
type TimeFormat = String
-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale
Parsing a day this way works. What I'm having trouble with is ignoring anything in the text that's not a day.
The following can't work since it assumes the number of text blocks that aren't a day:
daysParser :: Parser [Day]
daysParser = do
-- Ignore everything that's not a day
_ <- manyTill anyChar $ try dayParser
days <- many $ token dayParser
_ <- manyTill anyChar $ try dayParser
-- There might be more days after this...
return days
I reckon there's a straightforward way to express this with Trifecta but I can't seem to find it.
Here's the whole module including an example text to parse:
{-# LANGUAGE QuasiQuotes #-}
module DateParser where
import Text.RawString.QQ
import Data.Time
import Text.Trifecta
import Control.Applicative ( (<|>) )
-- "%Y-%-m-%-d" for example
type TimeFormat = String
dayParser :: Parser Day
dayParser = do
dayString <- tillEnd
parseDay dayString
tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)
parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
where
dayMaybe = parseTime' dayFormat s
failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
-- %-m makes the parser accept months consisting of a single digit
dayFormat = "%Y-%-m-%-d"
eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()
-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale
daysParser :: Parser [Day]
daysParser = do
-- Ignore everything that's not a day
_ <- manyTill anyChar $ try dayParser
days <- many $ token dayParser
_ <- manyTill anyChar $ try dayParser
-- There might be more days after this...
return days
test = parseString daysParser mempty text1
text1 = [r|
Ignore this
Also this
2019-09-05
More to ignore
2019-09-06
2019-09-07|]

There are three large problems here.
First, the way you're defining dayParser, it's always trying to parse the rest of the text as a date. For example, if your input text is "2019-01-01 foo bar", then dayParser would first consume the whole string, so that dayString == "2019-01-01 foo bar", and then will try to parse that string as a date. Which, of course, would fail.
In order to have a saner behavior, you could only bite off the beginning of the string that kinda looks like a date and try to parse that, like:
dayParser =
parseDay =<< many (digit <|> char '-')
This implementation bites off the beginning of the input consisting of digits and dashes, and tries to parse that as a date.
Note that this is a quick-n-dirty implementation. It is imprecise. For example, this implementation would accept input like "2019-01-0123456" and try to parse that as a date, and of course will fail. From your question, it is not clear whether you'd want to still parse 2019-01-01 and leave the rest, or whether you want to not consider that a proper date. If you wanted to be super-precise about this, you could specify the exact format as precisely as you want, e.g.:
dayParser = do
y <- count 4 digit
void $ char '-'
m <- try (count 2 digit) <|> count 1 digit
void $ char '-'
d <- try (count 2 digit) <|> count 1 digit
parseDay $ y ++ "-" ++ m ++ "-" ++ d
This implementation expects exactly the format of the date.
Second, there is a logical problem: your daysParser tries to first parse some garbage, then parse many days, and then parse some garbage again. This logic does not admit a case where the many dates have some garbage between them.
Third problem is much more tricky. You see, the way the try combinator works - if the parser fails, then try will roll back the input position, but if the parser succeeds, then the input remains consumed! This means that you cannot use try as a zero-consumption lookahead, the way you're trying to do in manyTill anyChar $ try dayParser. Such a parser will parse until it finds a date, and then it will consume the date, leaving nothing for the next parser and causing it to fail.
I will illustrate with a simpler example. Consider this:
> parseString (many (char 'a')) mempty "aaa"
Success "aaa"
Cool, it parses three 'a's. Now let's add a try at the beginning:
> parseString (try (char 'b') *> many (char 'a')) mempty "aaa"
Success "aaa"
Awesome, this still works: the try fails, and then we parse three 'a's as before.
Now let's change the try from 'b' to 'a':
> parseString (try (char 'a') *> many (char 'a')) mempty "aaa"
Success "aa"
Look what happened: the try has consumed the first 'a', leaving only two to be parsed by many.
We can even extend it to more fully resemble your approach:
> p = manyTill anyChar (try (char 'a')) *> many (char 'a')
> parseString p mempty "aaa"
Success "aa"
> parseString p mempty "cccaaa"
Success "aa"
See what happens? manyTill correctly skips all the 'c's up to the first 'a', but then it also consumes that first 'a'!
There appears to be no sane way (that I see) to have a zero-consumption lookahead like this. You always have to consume the first successful hit.
If I had this problem, I would probably resort to recursion: parsing chars one by one, at every step looking if I can get a day, and concatenating in a list. Something like this:
data WhatsThis = AChar Char | ADay Day | EOF
daysParser = do
r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ eof)
case r of
ADay d -> do
rest <- daysParser
pure $ d : rest
AChar _ ->
daysParser
EOF ->
pure []
It tries to parse a day, and if that fails, just skips a char, unless there are no more chars. If day parsing succeeded, it calls itself recursively, then prepends the day to the result of the recursive call.
Note that this approach is not very composable: it always consumes everything till the end of the input. If you want to compose it with something else, you may want consider replacing eof with a parameter:
daysParser stop = do
r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ stop)
...

Related

Passing argument to a ReadP Parser in Haskell

I am trying to create a parser from scratch in Haskell. I have problems passing a string as an argument to a function that is already part of a do block in which the parsing occurs. Why does the following Minimal viable example code return [] and not 4 as expected.
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type Parser a = ReadP a
token :: Parser a -> Parser a
token combinator = (do spaces
combinator)
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
parseString input = readP_to_S (do
e <- pExpr
token eof
return e) input
pExpr = (do
pv <- pOpHelper
spaces
str <- string pv
return str
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return e1
)
I am of course interested in returning a processed version of whatever string pv returns. However I can't understand why the current setup wouldn't return anything besides [] on parseString "4" since calling just pOpHelper wihtout pExpr seems to work.
Edit
I think I have located the 'bug' to be part of the string function. I had a closer look at it here but I can't see from the documentation why it shouldn't work in the above. But the above code is narrowed down to the parts that produce the unintended outputs as specified.
EDIT EDIT
I have now narrowed the problem down even further. It has to do with how 'consumption' works for the parser. The problem is that if I give it parseString "4" the string pv expects the "4" that is returned by pv, but it will still be parsing the next characters on which munch isDigit is no longer satisfies. This means that it will only return [("4","")] rather than [] if the input is parseString "4 4", and only if the spaces has been added to the do-clause in pExpr.
But how can I work around this and avoid 'consuming' the string that I put as input. Is there a way to use look for instance, in the above documentation.
As pointed out in the comments below I am interested in transforming whatever is the input to pOpHelper and then passing its output to functions (in a recursion) that is part of the parent parser-function called. But how can I do it without consuming the input with pOpHelper first such that the following example would return str on input of "4":
pExpr = (do
pv <- pOpHelper
--spaces
str <- string pv
if str == "(4)" then return str -- do stuff!
else pfail
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return ( "(" ++ e1 ++ ")" )
)

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)
What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

How do I skip specified symbols while parsing

I am trying to write a trifecta parser that can parse all three of the phone numbers below. When I try to use parsePhone by calling parseString parsePhone mempty phoneNum2, the parser fails at the first dash and says it expected '('.
When I call the parser on phoneNum1 it fails at ')' , saying it expected '('.
Why is my skipSymbol parser failing? I would think that due to my use of <|>, the parser would be fine with not detecting '(' and move on. Is the technique that I am attempting with skipSymbol bound to fail?
phoneNum1 = "(123) 456 7890"
phoneNum2 = "123-456-7890"
phoneNum3 = "1234567890"
type NumberingPlanArea = Integer
type Exchange = Integer
type LineNumber = Integer
data PhoneNumber =
PhoneNumber NumberingPlanArea
Exchange LineNumber
deriving (Eq, Show)
parse3digits :: Parser Integer
parse3digits = read <$> replicateM 3 digit
skipSymbol :: Parser ()
skipSymbol =
skipMany (char '(')
<|> skipMany (char ')')
<|> skipMany (char '-')
<|> skipMany (char ' ')
parsePhone :: Parser PhoneNumber
parsePhone =
skipSymbol >>
parse3digits >>=
\area -> skipSymbol >>
parse3digits >>=
\exch -> skipSymbol >>
integer >>=
\line ->
pure $ PhoneNumber area exch line
skipMany p applies the parser p zero-or-more times. Operationally, it goes something like this:
Attempt to apply p.
If p succeeded, repeat step 1.
If p failed without consuming input, succeed and return (). (This is what the “zero” in “zero-or-more” means.)
If p failed after consuming some input, report the failure.
Let's look at how skipSymbol operates on the input ).
Parsec tries the left hand choice of skipSymbol, namely skipMany (char '(').
skipMany (char '(') attempts to apply char '(', which fails without consuming input because the input character is ).
Because char '(' failed without consuming input, skipMany (char '(') succeeds without consuming input. This means the other choices in skipSymbol won't be attempted.
The current input character is still ) (which is what causes parse3Digits to later fail).
As noted in the comments, the fix is to change the definition of skipSymbol to
skipSymbol :: Parser ()
skipSymbol = skipMany $ choice [char c | c <- "()- "]
This version loops a choice, rather than choosing between loops.

avoid parsing last separator with `sepBy`

I'm trying to parse a string using megaparsec.
Part of it is a repetition of strings separated by a separator and I'm using sepBy for this.
Consider for example
sepBy (char 'a') (char 's')
This parses correctly "", "a", "asa", ...
The problem appears if I need to continue parsing with another parser which starts with my separator, as in
(,) <$> sepBy (char 'a') (char 's') <*> string "something"
If I try to parse the string "asasomething" with this parser I'd expect to get ("aa", "something"). Instead I get an error because I don't have an a after the second s.
I tried also with sepEndBy but the result is the same
I solved it as follows.
The implementation of sepBy used by megapersec is
sepBy :: MonadPlus m => m a -> m sep -> m [a]
sepBy p sep = do
r <- C.optional p
case r of
Nothing -> return []
Just x -> (x:) <$> many (sep >> p)
I modified it to
sepBy :: Parser a -> Parser sep -> Parser [a]
sepBy p sep = do
r <- optional p
case r of
Nothing -> return []
Just x -> (x:) <$> many (try $ sep >> p)
to specialise it to Parsec add a try to avoid eager parsing

Correctly parsing line indentations in uu-parsinglib in Haskell

I want to create a parser combinator, which will collect all lines below current place, which indentation levels will be greater or equal some i. I think the idea is simple:
Consume a line - if its indentation is:
ok -> do it for next lines
wrong -> fail
Lets consider following code:
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
-- end of line
pEOL = pSym '\n'
pSpace = pSym ' '
pTab = pSym '\t'
indentOf s = case s of
' ' -> 1
'\t' -> 4
-- return the indentation level (number of spaces on the beginning of the line)
pIndent = (+) <$> (indentOf <$> (pSpace <|> pTab)) <*> pIndent `opt` 0
-- returns tuple of (indentation level, result of parsing the second argument)
pIndentLine p = (,) <$> pIndent <*> p <* pEOL
-- SHOULD collect all lines below witch indentations greater or equal i
myParse p i = do
(lind, expr) <- pIndentLine p
if lind < i
then pFail
else do
rest <- myParse p i `opt` []
return $ expr:rest
-- sample inputs
s1 = " a\
\\n a\
\\n"
s2 = " a\
\\na\
\\n"
-- execution
pProgram = myParse (pSym 'a') 1
parse p s = UU.parse ( (,) <$> p <*> pEnd) (createStr (LineColPos 0 0 0) s)
main :: IO ()
main = do
print $ parse pProgram s1
print $ parse pProgram s2
return ()
Which gives following output:
("aa",[])
Test.hs: no correcting alternative found
The result for s1 is correct. The result for s2 should consume first "a" and stop consuming. Where this error comes from?
The parsers which you are constructing will always try to proceed; if necessary input will be discarded or added. However pFail is a dead-end. It acts as a unit element for <|>.
In you parser there is however no other alternative present in case the input does not comply to the language recognised by the parser. In you specification you say you want the parser to fail on input s2. Now it fails with a message saying that is fails, and you are surprised.
Maybe you do not want it to fail, but you want to stop accepting further input? In that case
replace pFail by return [].
Note that the text:
do
rest <- myParse p i `opt` []
return $ expr:rest
can be replaced by (expr:) <$> (myParse p i `opt` [])
A natural way to solve your problem is probably something like
pIndented p = do i <- pGetIndent
(:) <$> p <* pEOL <*> pMany (pToken (take i (repeat ' ')) *> p <* pEOL)
pIndent = length <$> pMany (pSym ' ')

Resources