Haskell -- parser combinators keywords - parsing

I am working on building a parser in Haskell using parser combinators. I have an issue with parsing keywords such as "while", "true", "if" etc
So the issue I am facing is that after a keyword there is a requirement that there is a separator or whitespace, for example in the statement
if cond then stat1 else stat2 fi;x = 1
with this statement all keywords have either a space in front of them or a semi colon. However in different situations there can be different separators.
Currently I have implemented it as follows:
keyword :: String -> Parser String
keyword k = do
kword <- leadingWS (string k)
check (== ';') <|> check isSpace <|> check (== ',') <|> check (== ']')
junk
return word
however the problem with this keyword parser is that it will allow programs which have statements like if; cond then stat1 else stat2 fi
We tried passing in a (Char -> Bool) to keyword, which would then be passed to check. But this wouldn’t work because where we parse the keyword we don’t know what kind of separator is allowed.
I was wondering if I could have some help with this issue?

Don't try to handle the separators in keyword but you need to ensure that keyword "if" will not be confused with an identifier "iffy" (see comment by sepp2k).
keyword :: String -> Parser String
keyword k = leadingWS $ try (do string k
notFollowedBy alphanum)
Handling separators for statements would go like this:
statements = statement `sepBy` semi
statement = ifStatement <|> assignmentStatement <|> ...

Related

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Parse EBNF with Megaparsec nested sepBy

As an exercise I try to parse a EBNF/ABNF grammar with Megaparsec. I got trivial stuff like terminals and optionals working, but I'm struggling with alternatives. With this grammar:
S ::= 'hello' ['world'] IDENTIFIER LITERAL | 'test';
And this code:
production :: Parser Production
production = sepBy1 alternativeTerm (char '|') >>= return . Production
alternativeTerm :: Parser AlternativeTerm
alternativeTerm = sepBy1 term space >>= return . AlternativeTerm
term :: Parser Term
term = terminal
<|> optional
<|> identifier
<|> literal
I get this error:
unexpected '|'
expecting "IDENTIFIER", "LITERAL", ''', '[', or white space
I guess the alternativeTerm parser is not returning to the production parser when it encounters a sequence that it cannot parse and throws an error instead.
What can I do about this? Change my ADT of an EBNF or should I somehow flatten the parsing. But then again, how can I do so?
It's probably best to expand my previous comment into a full answer.
Your grammar is basically a list of list of terms seperated (and ended) by whitespace, which in turn is seperated by |. Your solution with sepBy1 does not work because there is a trailing whitespace after LITERAL - sepBy1 assumes there is another term following that whitespace and tries to apply term to the |, which fails.
If your alternativeTerm is guaranteed to end with a whitespace character (or multiple), rewrite your alternativeTerm as follows:
alternativeTerm = (term `sepEndBy1` space) >>= return . AlternativeTerm

Haskell Parsec Parser for Encountering [...]

I'm attempting to write a parser in Haskell using Parsec. Currently I have a program that can parse
test x [1,2,3] end
The code that does this is given as follows
testParser = do {
reserved "test";
v <- identifier;
symbol "[";
l <- sepBy natural commaSep;
symbol "]";
p <- pParser;
return $ Test v (List l) p
} <?> "end"
where commaSep is defined as
commaSep = skipMany1 (space <|> char ',')
Now is there some way for me to parse a similar statement, specifically:
test x [1...3] end
Being new to Haskell, and Parsec for that matter, I'm sure there's some nice concise way of doing this that I'm just not aware of. Any help would be appreciated.
Thanks again.
I'll be using some functions from Control.Applicative like (*>). These functions are useful if you want to avoid the monadic interface of Parsec and prefer the applicative interface, because the parsers become easier to read that way in my opinion.
If you aren't familiar with the basic applicative functions, leave a comment and I'll explain them. You can look them up on Hoogle if you are unsure.
As I've understood your problem, you want a parser for some data structure like this:
data Test = Test String Numbers
data Numbers = List [Int] | Range Int Int
A parser that can parse such a data structure would look like this (I've not compiled the code, but it should work):
-- parses "test <identifier> [<numbers>] end"
testParser :: Parser Test
testParser =
Test <$> reserved "test" *> identifier
<*> symbol "[" *> numbersParser <* symbol "]"
<* reserved "end"
<?> "test"
numbersParser :: Parser Numbers
numbersParser = try listParser <|> rangeParser
-- parses "<natural>, <natural>, <natural>" etc
listParser :: Parser Numbers
listParser =
List <$> sepBy natural (symbol ",")
<?> "list"
-- parses "<natural> ... <natural>"
rangeParser :: Parser Numbers
rangeParser =
Range <$> natural <* symbol "..."
<*> natural
<?> "range"

FParsec identifiers vs keywords

For languages with keywords, some special trickery needs to happen to prevent for example "if" from being interpreted as an identifier and "ifSomeVariableName" from becoming keyword "if" followed by identifier "SomeVariableName" in the token stream.
For recursive descent and Lex/Yacc, I've simply taken the approach (as per helpful instruction) of transforming the token stream between the lexer and the parser.
However, FParsec doesn't really seem do a separate lexer step, so I'm wondering what the best way to deal with this is. Speaking of, it seems like Haskell's Parsec supports a lexer layer, but FParsec does not?
I think, this problem is very simple. The answer is that you have to:
Parse out an entire word ([a-z]+), lower case only;
Check if it belongs to a dictionary; if so, return a keyword; otherwise, the parser will fall back;
Parse identifier separately;
E.g. (just a hypothetical code, not tested):
let keyWordSet =
System.Collections.Generic.HashSet<_>(
[|"while"; "begin"; "end"; "do"; "if"; "then"; "else"; "print"|]
)
let pKeyword =
(many1Satisfy isLower .>> nonAlphaNumeric) // [a-z]+
>>= (fun s -> if keyWordSet.Contains(s) then (preturn x) else fail "not a keyword")
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
The code above will parse a keyword or an identifier twice. To fix it, alternatively, you may:
Parse out an entire word ([a-z][A-Z]+[a-z][A-Z][0-9]+), e.g. everything alphanumeric;
Check if it's a keyword or an identifier (lower case and belonging to a dictionary) and either
Return a keyword
Return an identifier
P.S. Don't forget to order "cheaper" parsers first, if it does not ruin the logic.
You can define a parser for whitespace and check if keyword or identifier is followed by it.
For example some generic whitespace parser will look like
let pWhiteSpace = pLineComment <|> pMultilineComment <|> pSpaces
this will require at least one whitespace
let ws1 = skipMany1 pWhiteSpace
then if will look like
let pIf = pstring "if" .>> ws1

Resources