Haskell: How to integrate semantic whitespacing into a parser? - parsing

I'm currently writing a language in Haskell, https://github.com/EdONeill1/HENRYGCL, and I'm having trouble in figuring out how to allow for a program to be written on multiple lines. Take the following loop that adds 1 onto x until it reaches 10.
Henry > x := 1
Henry > if <x<10> [] x := <x+1> [] x := <x+10>
I would like the program to be somewhat in the form of:
Henry > x := 1
Henry > if <x<10>
[] x := <x+1>
[] x := <x+10>
I thought about using the function space or newline from the Text.ParserCombinators.Parsec.Char module. This would allow me to recognise (I think) the newline token \n. Upon using it the following parser function:
ifStmt :: Parser HenryVal
ifStmt =
do
reserved "if"
cond <- bExpression <|>
do
_ <- char '<'
x <- try parseBinary
_ <- char '>'
return x
some (space <|> newline)
reserved "[]"
stmt1 <- statement
some (space <|> newline)
reserved " []"
stmt2 <- statement
return $ If cond stmt1 stmt2
I receive the following error when I try following:
Henry > x:=1
1
Henry > if <x<10>
Parse error at "Henry" (line 1, column 10):
unexpected end of input
expecting space or lf new-line
Henry > if <x<10>
Parse error at "Henry" (line 1, column 12):
unexpected end of input
expecting space, lf new-line or "[]"
Henry >
The first error arises from pressing Enter when I finish typing > and the second error arises from pressing Space once then pressing Enter. In both instances a newline wasn't corrected. I'm not sure as well as to what lf in lf new-line actually means because to my understanding, shouldn't hitting Enter give you a newline?
In another section of my code I have the following, whiteSpace = Token.whiteSpace lexer. When I replace the some (space <|> newline) with this and press enter after if <x<10>, a newline is actually created. However despite being able to write a full if-statement, there's no termination and it just allows me to keep writing as much as I want indefinitely.
I'm quite confused in how to proceed from here. I think my logic using some (space <|> newline) is correct insofar as if the program encounters at least one space or newline, a space or newline is made but I know my implementation is incorrect. I thought that maybe whiteSpace would lead to somewhere but it seems as if that's another dead end.

The parser looks fine. The problem here is that main currently runs the parser on a single line at a time. You need to accumulate the whole input before running the parser.

Related

Why does try not trigger backtracking in this example

I am trying to wrap my head around writing parser using parsec in Haskell, in particular how backtracking works.
Take the following simple parser:
import Text.Parsec
type Parser = Parsec String () String
parseConst :: Parser
parseConst = do {
x <- many digit;
return $ read x
}
parseAdd :: Parser
parseAdd = do {
l <- parseExp;
char '+';
r <- parseExp;
return $ l <> "+" <> r
}
parseExp :: Parser
parseExp = try parseConst <|> parseAdd
pp :: Parser
pp = parseExp <* eof
test = parse pp "" "1+1"
test has value
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
In my mind this should succeed since I used the try combinator on parseConst in the definition of parseExp.
What am I missing? I am also interrested in pointers for how to debug this in my own, I tried using parserTraced which just allowed me to conclude that it indeed wasn't backtracking.
PS.
I know this is an awful way to write an expression parser, but I'd like to understand why it doesn't work.
There are a lot of problems here.
First, parseConst can never work right. The type says it must produce a String, so read :: String -> String. That particular Read instance requires the input be a quoted string, so being passed 0 or more digit characters to read is always going to result in a call to error if you try to evaluate the value it produces.
Second, parseConst can succeed on matching zero characters. I think you probably wanted some instead of many. That will make it actually fail if it encounters input that doesn't start with a digit.
Third, (<|>) doesn't do what you think. You might think that (a <* c) <|> (b <* c) is interchangeable with (a <|> b) <* c, but it isn't. There is no way to throw try in and make it the same, either. The problem is that (<|>) commits to whichever branch succeed, if one does. In (a <|> b) <* c, if a matches, there's no way later to backtrack and try b there. Doesn't matter how you lob try around, it can't undo the fact that (<|>) committed to a. In contrast, (a <* c) <|> (b <* c) doesn't commit until both a and c or b and c match the input.
This is the situation you're encountering. You have (try parseConst <|> parseAdd) <* eof, after a bit of inlining. Since parseConst will always succeed (see the second issue), parseAdd will never get tried, even if the eof fails. So after parseConst consumes zero or more leading digits, the parse will fail unless that's the end of the input. Working around this essentially requires carefully planning your grammar such that any use of (<|>) is safe to commit locally. That is, the contents of each branch must not overlap in a way that is disambiguated only by later portions of the grammar.
Note that this unpleasant behavior with (<|>) is how the parsec family of libraries work, but not how all parser libraries in Haskell work. Other libraries work without the left bias or commit behavior the parsec family have chosen.

parsec: stopping at empty line

I would like to solve the following task with parsec, although splitOn "\n\n" is probably the simpler answer.
I have an inputstring like
testInput = unlines ["ab", "cd", "", "e"] -- "ab\ncd\n\ne"
The parser shall stop when encountering an empty line.
I tried this
import Text.ParserCombinators.Parsec
inputFileP :: GenParser Char st String
inputFileP = many (lower <|> delimP)
delimP :: GenParser Char st Char
delimP = do
x <- char '\n'
notFollowedBy (char '\n')
return x
This fails with unexpected '\n'.
Why?
I was under the impression that many x parses x until it fails and then stops.
I was under the impression that many x parses x until it fails and then stops.
This is only the case if x fails without consuming any input. If x fails after consuming input, the whole parse will fail unless there's a try somewhere (this isn't just specific to many: x <|> y would also fail in that case even if y would succeed). In your case delimP fails on the notFollowedBy (char '\n') after already consuming the first \n, so the whole parse fails.
To change this behaviour, you need to explicitly enable backtracking using try like this:
delimP = try $ do
x <- char '\n'
notFollowedBy (char '\n')
return x
Alternatively, you can make it so that delimP fails without consuming any input (and thus no need for try) by making it look ahead by two characters before matching the \n:
delimP = do
notFollowedBy (string "\n\n")
char '\n'

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?
In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','
I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

How to parse many characters except few in the parentheses?

The best way to parse any character except few, is to use noneOf combinator,
Unfortunately it doesn't work if I combine it in the following way:
Combine.parse (Combine.parens <| Combine.many <| Combine.Char.noneOf ['"', '\\']) "()"
Err ((),{ data = "()", input = "", position = 2 },["expected \")\""])
: Result.Result
(Combine.ParseErr ()) (Combine.ParseOk () (List Char))
Your use of noneOf results in that parser consuming all characters including the closing parenthesis. Since the inner portion consumes the closing paren, the Combine.parens parser will not see the closing paren. You need to cause the many <| noneOf ... parser to halt on a closing parenthesis.
Consider adding the closing parenthesis to the list of characters in noneOf:
Combine.parse (Combine.parens <| Combine.many <| Combine.Char.noneOf ['"', '\\', ')']) "()"

Resources