Parsing the full input twice

Parsing the full input twice - f#

To achieve case-insensitive infix operators using OperatorPrecedenceParser, I'm preprocessing the input, parsing it as text delimited by string literals. The text portion is then searched for infix operators that need to be uppercased (to conform to the operator as known to the OPP). The actual parsing then takes place.
My question is, can both phases be combined into a single parser? I tried
// preprocess: Parser<string,_>
// scalarExpr: Parser<ScalarExpr,_>
let filter = (preprocess .>> eof) >>. (scalarExpr .>> eof)
but it fails at the end of the input, seemingly expecting a scalarExpr. The input can be parsed by preprocess and scalarExpr independently, so I'm guessing it's an issue with eof, but I can't seem to get it right. Is this possible?
Here are the other parsers for reference.
let stringLiteral =
let subString = manySatisfy ((<>) '"')
let escapedQuote = stringReturn "\"\"" "\""
(between (pstring "\"") (pstring "\"") (stringsSepBy subString escapedQuote))
let canonicalizeKeywords =
let keywords =
[
"OR"
"AND"
"CONTAINS"
"STARTSWITH"
"ENDSWITH"
]
let caseInsensitiveKeywords = HashSet(keywords, StringComparer.InvariantCultureIgnoreCase)
fun text ->
let re = Regex(#"([\w][\w']*\w)")
re.Replace(text, MatchEvaluator(fun m ->
if caseInsensitiveKeywords.Contains(m.Value) then m.Value.ToUpperInvariant()
else m.Value))
let preprocess =
stringsSepBy
((manySatisfy ((<>) '"')) |>> canonicalizeKeywords)
(stringLiteral |>> (fun s -> "\"" + s + "\""))

The simplest way to parse case insensitive operators with FParsec's OperatorPrecedenceParser is to add operator definitions for every casing you want to support. If you only need to support short operator names, such as "and" or "or", you could simply add all possible case combinations. If you want to use operator names that are too long for this approach, you might consider only supporting the sane casings, i.e. lowercase, UPPERCASE, camelCase and PascalCase. When you want to support multiple casings, it is usually convenient to write a helper function that automatically generates all the needed casings for you from a standard one.
If you have long operator names and you really want to support all casings, the OperatorPrecedenceParser's dynamic configurability also allows the following approach, which should be easier and more efficient than transforming the input:
Search the input for all case insensitive occurrences of the supported operators. This search shouldn't miss any occurrences, but it's no problem if it finds false positives if e.g. the operator name is used inside a function name or inside a string literal.
Add all unique casings you found in step 1 to the OperatorPrecedenceParser. (Usually there won't be many casings of the same operator.)
Parse the input with the configured OperatorPrecedenceParser.
When you parse multiple inputs, you can keep the OperatorPrecedenceParser instance around and just lazily add new operators casings as you need them.

Related

Unary minus messes up parsing

Here is the grammar of the language id' like to parse:
expr ::= val | const | (expr) | unop expr | expr binop expr
var ::= letter
const ::= {digit}+
unop ::= -
binop ::= /*+-
I'm using an example from the haskell wiki.
The semantics and token parser are not shown here.
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "-" >> return (Uno Oppo))]
,[Infix (m_reservedOp "/" >> return (Bino Quot)) AssocLeft
,Infix (m_reservedOp "*" >> return (Bino Prod)) AssocLeft]
,[Infix (m_reservedOp "-" >> return (Bino Diff)) AssocLeft
,Infix (m_reservedOp "+" >> return (Bino Somm)) AssocLeft]
]
term = m_parens exprparser
<|> fmap Var m_identifier
<|> fmap Con m_natural
The minus char appears two times, once as unary, once as binary operator.
On input "1--2", the parser gives only
Con 1
instead of the expected
"Bino Diff (Con 1) (Uno Oppo (Con 2))"
Any help welcome.Full code here

The purpose of reservedOp is to create a parser (which you've named m_reservedOp) that parses the given string of operator symbols while ensuring that it is not the prefix of a longer string of operator symbols. You can see this from the definition of reservedOp in the source:
reservedOp name =
lexeme $ try $
do{ _ <- string name
; notFollowedBy (opLetter languageDef) <?> ("end of " ++ show name)
}
Note that the supplied name is parsed only if it is not followed by any opLetter symbols.
In your case, the string "--2" can't be parsed by m_reservedOp "-" because, even though it starts with the valid operator "-", this string occurs as the prefix of a longer valid operator "--".
In a language with single-character operators, you probably don't want to use reservedOp at all, unless you want to disallow adjacent operators without intervening whitespace. Just use symbol "-", which will always parse "-", no matter what follows (and consume following whitespace, if any). Also, in a language with a fixed set of operators (i.e., no user-defined operators), you probably won't use the operator parser, so you won't need opStart, or reservedOpNames. Without reservedOp or operator, the opLetter parser isn't used, so you can drop it too.
This is probably pretty confusing, and the Parsec documentation does a terrible job of explaining how the "reserved" mechanism is supposed to work. Here's a primer:
Let's start with identifiers, instead of operators. In a typical language that allows user-defined identifiers (i.e., pretty much any language, since "variables" and "functions" have user-defined names) and may also have some reserved words that aren't allowed as identifiers, the relevant settings in the GenLanguageDef are:
identStart -- parser for first character of valid identifier
identLetter -- second and following characters of valid identifier
reservedNames -- list of reserved names not allowed as identifiers
The lexeme (whitespace-absorbing) parsers created using the GenTokenParser object are:
identifier - Parses an unknown, user-defined identifier. It parses a character from identStart followed by zero or more identLetters up to the first non-identLetter. (It never parses a partial identifier, so it'll never leave more identLetters on the table.) Additionally, it checks that the identifier is not in the list reservedNames.
symbol - Parses the given string. If the string is a reserved word, no check is made that it isn't part of a larger valid identifier. So, symbol "for" would match the beginning of foreground = "black", which is rarely what you want. Note that symbol makes no use of identStart, identLetter, or reservedNames.
reserved - Parses the given string, and then ensures that it's not followed by an identLetter. So, m_reserved "for" will parse for (i=1; ... but not parse foreground = "black". Usually, the supplied string will be a valid identifier, but no check is made for this, so you can write m_reserved "15" if you want -- in a language with the usual sorts of alphanumeric identifiers, this would parse "15" provided it wasn't following by a letter or another digit. Also, maybe somewhat surprisingly, no check is made that the supplied string is in reservedNames.
If that makes sense to you, then the operator settings follow the exact same pattern. The relevant settings are:
opStart -- parser for first character of valid operator
opLetter -- valid second and following operator chars, for multichar operators
reservedOpNames -- list of reserved operator names not allowed as user-defined operators
and the relevant parsers are:
operator - Parses an unknown, user-defined operator starting with an opStart and followed by zero or more opLetters up to the first non-opLetter. So, operator applied to the string "--2" will always take the whole operator "--", never just the prefix "-". An additional check is made that the resulting operator is not in the reservedOpNames list.
symbol - Exactly as for identifiers. It parses a string with no checks or reference to opStart, opLetter, or reservedOpNames, so symbol "-" will parse the first character of the string "--" just fine, leaving the second "-" character for a later parser.
reservedOp - Parses the given string, ensuring it's not followed by opLetter. So, m_reservedOp "-" will parse the start of "-x" but not "--2", assuming - matches opLetter. As before, no check is made that the string is in reservedOpNames.

FParsec backtracking in nested parser

I want to parse expressions that are constructed like: a is x or y or z or b is z or w, so basically I have the same separator for different rules in my grammar.
I already succeed parsing such expressions with Antlr since it can backtrack quite nicely. But now I want to parse it with FParsec and I don't get the inner parser to not be greedy. My current parsers look like this:
let variable = // matches a,b,c,...
// variables ::= variable { "or" variable }+ ;
let variables =
variable .>>? keyword "or" .>>.? (sepBy1 variable (keyword "or"))
let operation =
variable .>>? keyword "is" .>>.? variables
// expression ::= operation { "or" operation }+ ;
let expression =
operation .>>? keyword "or" .>>.? (sepBy1 variable (keyword "or"))
In my example the variables parser consumes x or y or z or b and the whole thing fails at is. This means I need to make that variables parser less greedy or make it backtrack correctly.
I found a similar question where they make a backtracking version of sepBy1, but using that still does not solve my problem. I guess that is because I want to backtrack into a nested parser.
So what is the correct way to make FParsec accept my input?

Besides switching one of the meanings of or to | as I mentioned in a comment, you could also use notFollowedBy (keyword "is") as follows:
let variables =
variable .>>? keyword "or" .>>.? (sepBy1 (variable .>> (notFollowedBy (keyword "is"))) (keyword "or"))
I'm not really enthusiastic about this solution because it doesn't generalize easily. Are there other keywords besides is that could appear after a variable? E.g., do you have a syntax like b matches x or y, or anything similar? If so, then you'd need to write something like (notFollowedBy ((keyword "is") <|> (keyword "matches"))) and it can quickly get complicated. But as long as the keyword is is the only one that is throwing off your parser, using (notFollowedBy (keyword "is")) is probably your best bet.

Using ParserResult

The example code below appears to work nicely:
open FParsec
let capitalized : Parser<unit,unit> =(asciiUpper >>. many asciiLower >>. eof)
let inverted : Parser<unit,unit> =(asciiLower >>. many asciiUpper >>. eof)
let capsOrInvert =choice [capitalized;inverted]
You can then do:
run capsOrInvert "Dog";;
run capsOrInvert "dOG";;
and get a success or:
run capsOrInvert "dog";;
and get a failure.
Now that I have a ParserResult, how do I do things with it? For example, print the string backwards?

There are several notable issues with your code.
First off, as noticed in #scrwtp's answer, your parser returns unit. Here's why: operator (>>.) returns only the result returned by the right inner parser. On the other hand, (.>>) would return the result of a left parser, while (.>>.) would return a tuple of both left and right ones.
So, parser1 >>. parser2 >>. eof is essentially (parser1 >>. parser2) >>. eof.
The code in parens completely ignores the result of parser1, and the second (>>.) then ignores the entire result of the parser in parens. Finally, eof returns unit, and this value is being returned.
You may need some meaningful data returned instead, e.g. the parsed string. The easiest way is:
let capitalized = (asciiUpper .>>. many asciiLower .>> eof)
Mind the operators.
The code for inverted can be done in a similar manner.
This parser would be of type Parser<(char * char list), unit>, a tuple of first character and all the remaining ones, so you may need to merge them back. There are several ways to do that, here's one:
let mymerge (c1: char, cs: char list) = c1 :: cs // a simple cons
let pCapitalized = capitalized >>= mymerge
The beauty of this code is that your mymerge is a normal function, working with normal char's, it knows nothing about parsers or so. It just works with the data, and (>>=) operator does the rest.
Note, pCapitalized is also a parser, but it returns a single char list.
Nothing stops you from applying further transitions. As you mentioned printing the string backwards:
let pCapitalizedAndReversed =
capitalized
>>= mymerge
>>= List.rev
I have written the code in this way for purpose. In different lines you see a gradual transition of your domain data, still within the paradigm of Parser. This is an important consideration, because any subsequent transition may "decide" that the data is bad for some reason and raise a parsing exception, for example. Or, alternatively, it may be merged with other parser.
As soon as your domain data (a parsed-out word) is complete, you extract the result as mentioned in another answer.
A minor note. choice is superfluous for only two parsers. Use (<|>) instead. From experience, careful choosing parser combinators is important because a wrong choice deep inside your core parser logic can easily make your parsers dramatically slow.
See FParsec Primitives for further details.

ParserResult is a discriminated union. You simply match the Success and Failure cases.
let r = run capsOrInvert "Dog"
match r with
| Success(result, _, _) -> printfn "Success: %A" result
| Failure(errorMsg, _, _) -> printfn "Failure: %s" errorMsg
But this is probably not what you find tricky about your situation.
The thing about your Parser<unit, unit> type is that the parsed value is of type unit (the first type argument to Parser). What this means is that this parser doesn't really produce any sensible output for you to use - it can only tell you whether it can parse a string (in which case you get back a Success ((), _, _) - carrying the single value of type unit) or not.
What do you expect to get out of this parser?
Edit: This sounds close to what you want, or at least you should be able to pick up some pointers from it. capitalized accepts capitalized strings, inverted accepts capitalized strings that have been reversed and reverses them as part of the parser logic.
let reverse (s: string) =
System.String(Array.rev (Array.ofSeq s))
let capitalized : Parser<string,unit> =
(asciiUpper .>>. manyChars asciiLower)
|>> fun (upper, lower) -> string upper + lower
let inverted : Parser<string,unit> =
(manyChars asciiLower .>>. asciiUpper)
|>> fun (lower, upper) -> reverse (lower + string upper)
let capsOrInvert = choice [capitalized;inverted]
run capsOrInvert "Dog"
run capsOrInvert "doG"
run capsOrInvert "dog"

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)

However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

FParsec identifiers vs keywords

For languages with keywords, some special trickery needs to happen to prevent for example "if" from being interpreted as an identifier and "ifSomeVariableName" from becoming keyword "if" followed by identifier "SomeVariableName" in the token stream.
For recursive descent and Lex/Yacc, I've simply taken the approach (as per helpful instruction) of transforming the token stream between the lexer and the parser.
However, FParsec doesn't really seem do a separate lexer step, so I'm wondering what the best way to deal with this is. Speaking of, it seems like Haskell's Parsec supports a lexer layer, but FParsec does not?

I think, this problem is very simple. The answer is that you have to:
Parse out an entire word ([a-z]+), lower case only;
Check if it belongs to a dictionary; if so, return a keyword; otherwise, the parser will fall back;
Parse identifier separately;
E.g. (just a hypothetical code, not tested):
let keyWordSet =
System.Collections.Generic.HashSet<_>(
[|"while"; "begin"; "end"; "do"; "if"; "then"; "else"; "print"|]
)
let pKeyword =
(many1Satisfy isLower .>> nonAlphaNumeric) // [a-z]+
>>= (fun s -> if keyWordSet.Contains(s) then (preturn x) else fail "not a keyword")
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
The code above will parse a keyword or an identifier twice. To fix it, alternatively, you may:
Parse out an entire word ([a-z][A-Z]+[a-z][A-Z][0-9]+), e.g. everything alphanumeric;
Check if it's a keyword or an identifier (lower case and belonging to a dictionary) and either
Return a keyword
Return an identifier
P.S. Don't forget to order "cheaper" parsers first, if it does not ruin the logic.

You can define a parser for whitespace and check if keyword or identifier is followed by it.
For example some generic whitespace parser will look like
let pWhiteSpace = pLineComment <|> pMultilineComment <|> pSpaces
this will require at least one whitespace
let ws1 = skipMany1 pWhiteSpace
then if will look like
let pIf = pstring "if" .>> ws1

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart