FParsec backtracking in nested parser - f#

I want to parse expressions that are constructed like: a is x or y or z or b is z or w, so basically I have the same separator for different rules in my grammar.
I already succeed parsing such expressions with Antlr since it can backtrack quite nicely. But now I want to parse it with FParsec and I don't get the inner parser to not be greedy. My current parsers look like this:
let variable = // matches a,b,c,...
// variables ::= variable { "or" variable }+ ;
let variables =
variable .>>? keyword "or" .>>.? (sepBy1 variable (keyword "or"))
let operation =
variable .>>? keyword "is" .>>.? variables
// expression ::= operation { "or" operation }+ ;
let expression =
operation .>>? keyword "or" .>>.? (sepBy1 variable (keyword "or"))
In my example the variables parser consumes x or y or z or b and the whole thing fails at is. This means I need to make that variables parser less greedy or make it backtrack correctly.
I found a similar question where they make a backtracking version of sepBy1, but using that still does not solve my problem. I guess that is because I want to backtrack into a nested parser.
So what is the correct way to make FParsec accept my input?

Besides switching one of the meanings of or to | as I mentioned in a comment, you could also use notFollowedBy (keyword "is") as follows:
let variables =
variable .>>? keyword "or" .>>.? (sepBy1 (variable .>> (notFollowedBy (keyword "is"))) (keyword "or"))
I'm not really enthusiastic about this solution because it doesn't generalize easily. Are there other keywords besides is that could appear after a variable? E.g., do you have a syntax like b matches x or y, or anything similar? If so, then you'd need to write something like (notFollowedBy ((keyword "is") <|> (keyword "matches"))) and it can quickly get complicated. But as long as the keyword is is the only one that is throwing off your parser, using (notFollowedBy (keyword "is")) is probably your best bet.

Related

Using midaction rules in Lemon to interpret "let" expression

I'm trying to write a "toy" interpreter using Flex + Lemon that supports a very basic "let" syntax where a variable X is temporarily bound to an expression. For example, "letx 3 + 4 in x + 8" should evaluate to 15.
In essence, what I'd "like" the rule to say is:
expr(E) ::= LETX expr(N) IN expr(O). {
environment->X = N;
E = O;
}
But that won't work since O is evaluated before the X = N assignment is made.
I understand that the usual solution for this would be a mid-rule action. Lemon doesn't explicitly support this, but I've read elsewhere that would just be syntactic sugar in any event.
So I've tried to put together a mid-rule action that would do my assignment of X = N before interpreting O:
midruleaction ::= /* mid rule */. { environment->X = N; }
expr(E) ::= LETX expr(N) IN midruleaction expr(O). { E = O; }
But that won't work because there's no way for the midruleaction rule to access N, or at least none I can see in the lemon docs/examples.
I think I'm missing something here. I know I could build up a tree and then walk it in a second pass. And I might end up doing that, but I'd like to understand how to solve this more directly first.
Any suggestions?
It's really not a very scalable solution to evaluate immediately in a parser. See below.
It is true that mid-rule actions are (mostly) syntactic sugar. However, in most cases they are not syntactic sugar for "markers" (non-terminals with empty right-hand sides) but rather for non-terminals representing production prefixes. For example, you could write your letx rule like this:
expr(E) ::= letx_prefix IN expr(O). { E = O; }
letx_prefix ::= LETX expr(N). { environment->X = N; }
Or you could do this:
expr(E) ::= LETX assigned_expr IN expr(O). { E = O; }
assigned_expr ::= expr(N). { environment->X = N; }
The first one is the prefix desugaring; the second one is the one I'd use because I feel that it separates concerns better. The important point is that the environment->X = N; action requires access to the semantic values of the prefix of the RHS, so it must be part of a prefix rule (which includes at least the symbols whose semantic values are requires), rather than a marker, which has access to no semantic values at all.
Having said all that, immediate evaluation during parsing is a very limited strategy. It cannot cope with a large range of constructs which require deferred evaluation, such as loops and function definitions. It cannot cleanly cope with constructs which may suppress evaluation, such as conditionals and short-circuit operators. (These can be handled using MRAs and a stateful environment which contains an evaluation-suppressed flag, but that's very ugly.)
Another problem is that syntactically incorrect expressions may be partially evaluated before the syntax error is discovered, and it may not be immediately obvious to the user which parts of the expression have and have not been evaluated.
On the whole, you're better off building an easily-evaluated AST during the parse, and evaluating the AST when the parse successfully completes.

try function in parsing lambda expressions

I'm totally new to Haskell and trying to implement a "Lambda calculus" parser, that will be used to read the input to a lambda reducer .. It's required to parse bindings first "identifier = expression;" from a text file, then at the end there's an expression alone ..
till now it can parse bindings only, and displays errors when encountering an expression alone .. when I try to use the try or option functions, it gives a type mismatch error:
Couldn't match type `[Expr]'
with `Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]]'
Expected type: Text.Parsec.Prim.ParsecT
s0 u0 m0 (Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]])
Actual type: Text.Parsec.Prim.ParsecT s0 u0 m0 [Expr]
In the second argument of `option', namely `bindings'
bindings weren't supposed to return anything, but I tried to add a return statement and it also returned a type mismatch error:
Couldn't match type `[Expr]' with `Expr'
Expected type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [Expr]
Actual type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [[Expr]]
In the second argument of `(<|>)', namely `expressions'
Don't use <|> if you want to allow both
Your program parser does its main work with
program = do
spaces
try bindings <|> expressions
spaces >> eof
This <|> is choice - it does bindings if it can, and if that fails, expressions, which isn't what you want. You want zero or more bindings, followed by expressions, so let's make it do that.
Sadly, even when this works, the last line of your parser is eof and
First, let's allow zero bindings, since they're optional, then let's get both the bindings and the expressions:
bindings = many binding
program = do
spaces
bs <- bindings
es <- expressions
spaces >> eof
return (bs,es)
This error would be easier to find with plenty more <?> "binding" type hints so you can see more clearly what was expected.
endBy doesn't need many
The error message you have stems from the line
expressions = many (endBy expression eol)
which should be
expressions :: Parser [Expr]
expressions = endBy expression eol
endBy works like sepBy - you don't need to use many on it because it already parses many.
This error would have been easier to find with a stronger data type tree, so:
Use try to deal with common prefixes
One of the hard-to-debug problems you've had is when you get the error expecting space or "=" whilst parsing an expression. If we think about that, the only place we expect = is in a binding, so it must be part way through parsing a binding when we've given it an expression. This only happens if our expression starts with an identifier, just like a binding does.
binding sees the first identifier and says "It's OK guys, I've got this" but then finds no = and gives you an error, where we wanted it to backtrack and let expression have a go. The key point is we've already used the identifier input, and we want to unuse it. try is right for that.
Encase your binding parser with try so if it fails, we'll go back to the start of the line and hand over to expression.
binding = try (do
(Var id) <- identifier
_ <- char '='
spaces
exp <- expression
spaces
eol <?> "end of line"
return $ Eq id exp
<?> "binding")
It's important that as far as possible each parser starts with matching something unique to avoid this problem. (try is backtracking, hence inefficient, so should be avoided if possible.)
In particular, avoid starting parsers with spaces, but instead make sure you finish them all with spaces. Your main program can start with spaces if you like, since it's the only alternative.
Use types for most productions - better structure & readability
My first piece of general advice is that you could do with a more fine-grained data type, and should annotate your parsers with their type. At the moment, everything's wrapped up in Expr, which means you can only get error messages about whether you have an Expr or a [Expr]. The fact that you had to add Eq to Expr is a sign you're pushing the type too far.
Usually it's worth making a data type for quite a lot of the productions, and if you import Control.Applicative hiding ((<|>),(<$>),many) Control.Applicative you can use <$> and <*> so that the production, the datatype and the parser are all the same structure:
--<program> ::= <spaces> [<bindings>] <expressions>
data Program = Prog [Binding] [Expr]
program = spaces >> Prog <$> bindings <*> expressions
-- <expression> ::= <abstraction> | factors
data Expression = Ab Abstraction | Fa [Factor]
expression = Ab <$> abstraction <|> Fa <$> factors <?> "expression"
Don't do this with letters for example, but for important things. What counts as important things is a matter of judgement, but I'd start with Identifiers. (You can use <* or *> to not include syntax like = in the results.)
Amended code:
Before refactoring types and using Applicative here
And afterwards here

Parsing the full input twice

To achieve case-insensitive infix operators using OperatorPrecedenceParser, I'm preprocessing the input, parsing it as text delimited by string literals. The text portion is then searched for infix operators that need to be uppercased (to conform to the operator as known to the OPP). The actual parsing then takes place.
My question is, can both phases be combined into a single parser? I tried
// preprocess: Parser<string,_>
// scalarExpr: Parser<ScalarExpr,_>
let filter = (preprocess .>> eof) >>. (scalarExpr .>> eof)
but it fails at the end of the input, seemingly expecting a scalarExpr. The input can be parsed by preprocess and scalarExpr independently, so I'm guessing it's an issue with eof, but I can't seem to get it right. Is this possible?
Here are the other parsers for reference.
let stringLiteral =
let subString = manySatisfy ((<>) '"')
let escapedQuote = stringReturn "\"\"" "\""
(between (pstring "\"") (pstring "\"") (stringsSepBy subString escapedQuote))
let canonicalizeKeywords =
let keywords =
[
"OR"
"AND"
"CONTAINS"
"STARTSWITH"
"ENDSWITH"
]
let caseInsensitiveKeywords = HashSet(keywords, StringComparer.InvariantCultureIgnoreCase)
fun text ->
let re = Regex(#"([\w][\w']*\w)")
re.Replace(text, MatchEvaluator(fun m ->
if caseInsensitiveKeywords.Contains(m.Value) then m.Value.ToUpperInvariant()
else m.Value))
let preprocess =
stringsSepBy
((manySatisfy ((<>) '"')) |>> canonicalizeKeywords)
(stringLiteral |>> (fun s -> "\"" + s + "\""))
The simplest way to parse case insensitive operators with FParsec's OperatorPrecedenceParser is to add operator definitions for every casing you want to support. If you only need to support short operator names, such as "and" or "or", you could simply add all possible case combinations. If you want to use operator names that are too long for this approach, you might consider only supporting the sane casings, i.e. lowercase, UPPERCASE, camelCase and PascalCase. When you want to support multiple casings, it is usually convenient to write a helper function that automatically generates all the needed casings for you from a standard one.
If you have long operator names and you really want to support all casings, the OperatorPrecedenceParser's dynamic configurability also allows the following approach, which should be easier and more efficient than transforming the input:
Search the input for all case insensitive occurrences of the supported operators. This search shouldn't miss any occurrences, but it's no problem if it finds false positives if e.g. the operator name is used inside a function name or inside a string literal.
Add all unique casings you found in step 1 to the OperatorPrecedenceParser. (Usually there won't be many casings of the same operator.)
Parse the input with the configured OperatorPrecedenceParser.
When you parse multiple inputs, you can keep the OperatorPrecedenceParser instance around and just lazily add new operators casings as you need them.

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

FParsec identifiers vs keywords

For languages with keywords, some special trickery needs to happen to prevent for example "if" from being interpreted as an identifier and "ifSomeVariableName" from becoming keyword "if" followed by identifier "SomeVariableName" in the token stream.
For recursive descent and Lex/Yacc, I've simply taken the approach (as per helpful instruction) of transforming the token stream between the lexer and the parser.
However, FParsec doesn't really seem do a separate lexer step, so I'm wondering what the best way to deal with this is. Speaking of, it seems like Haskell's Parsec supports a lexer layer, but FParsec does not?
I think, this problem is very simple. The answer is that you have to:
Parse out an entire word ([a-z]+), lower case only;
Check if it belongs to a dictionary; if so, return a keyword; otherwise, the parser will fall back;
Parse identifier separately;
E.g. (just a hypothetical code, not tested):
let keyWordSet =
System.Collections.Generic.HashSet<_>(
[|"while"; "begin"; "end"; "do"; "if"; "then"; "else"; "print"|]
)
let pKeyword =
(many1Satisfy isLower .>> nonAlphaNumeric) // [a-z]+
>>= (fun s -> if keyWordSet.Contains(s) then (preturn x) else fail "not a keyword")
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
The code above will parse a keyword or an identifier twice. To fix it, alternatively, you may:
Parse out an entire word ([a-z][A-Z]+[a-z][A-Z][0-9]+), e.g. everything alphanumeric;
Check if it's a keyword or an identifier (lower case and belonging to a dictionary) and either
Return a keyword
Return an identifier
P.S. Don't forget to order "cheaper" parsers first, if it does not ruin the logic.
You can define a parser for whitespace and check if keyword or identifier is followed by it.
For example some generic whitespace parser will look like
let pWhiteSpace = pLineComment <|> pMultilineComment <|> pSpaces
this will require at least one whitespace
let ws1 = skipMany1 pWhiteSpace
then if will look like
let pIf = pstring "if" .>> ws1

Resources