Searching for a pattern with Parsec - parsing

Not sure if this is possible (or recommended), but I am essentially trying to search for a sequence of characters in file using Parsec. Example file:
START (name)
junk
morejunk=junk;
dontcare
foo ()
bar
care_about this (stuff in here i dont care about);
don't care about this
or this
foo = bar;
also_care
about_this
(dont care whats in here);
and_this too(only the names
at the front
do i care about
);
foobar
may hit something = perhaps maybe (like this);
foobar
END
And here is my attempt at getting it working:
careAbout :: Parser (String, String)
careAbout = do
name1 <- many1 (noneOf " \n\r")
skipMany space
name2 <- many1 (noneOf " (\r\n")
skipMany space
skipMany1 parens
skipMany space
char ';'
return (name1, name2)
parens :: Parser ()
parens = do
char '('
many (parens <|> skipMany1 (noneOf "()"))
char ')'
return ()
parseFile = do
manyTill (do
try careAbout <|>
anyChar >> return ("", "")) (try $ string "END")
I'm trying to brute force the search by looking for careAbout, and if that doesn't work, eat one character and try again. I could parse all the junk in the middle (I know what it could be), but I don't care about what it is (so why bother parsing it), and it's potentially complicated.
Problem is, my solution doesn't quite work. anyChar ends up consuming everything, and the searching for END never gets a chance. Also, somewhere in the careAbout we hit eof and some Exception is thrown because of it.
This is probably the exact wrong way of doing it, and I would like to know of a way, or even better, the Right Way™, of doing it.

If not for the parens parser, this would be a good fit for a regular language parser, such as regex-applicative. This is because regular language parsers are much more "smart" about "backtracking" (in fact there's no backtracking going on at all, and yet every possible branch is explored).
However, as you probably know, matching parentheses is not a regular language. If you can relax your grammar to become regular, give regex-applicative a try.

I can't really tell from OP's post which parts of the file we care about or
don't, so I'm not going to post a specific solution. But in general, for
searching through a file for patterns which match a recursive parser, one
can use
replace-megaparsec.

Related

Why does try not trigger backtracking in this example

I am trying to wrap my head around writing parser using parsec in Haskell, in particular how backtracking works.
Take the following simple parser:
import Text.Parsec
type Parser = Parsec String () String
parseConst :: Parser
parseConst = do {
x <- many digit;
return $ read x
}
parseAdd :: Parser
parseAdd = do {
l <- parseExp;
char '+';
r <- parseExp;
return $ l <> "+" <> r
}
parseExp :: Parser
parseExp = try parseConst <|> parseAdd
pp :: Parser
pp = parseExp <* eof
test = parse pp "" "1+1"
test has value
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
In my mind this should succeed since I used the try combinator on parseConst in the definition of parseExp.
What am I missing? I am also interrested in pointers for how to debug this in my own, I tried using parserTraced which just allowed me to conclude that it indeed wasn't backtracking.
PS.
I know this is an awful way to write an expression parser, but I'd like to understand why it doesn't work.
There are a lot of problems here.
First, parseConst can never work right. The type says it must produce a String, so read :: String -> String. That particular Read instance requires the input be a quoted string, so being passed 0 or more digit characters to read is always going to result in a call to error if you try to evaluate the value it produces.
Second, parseConst can succeed on matching zero characters. I think you probably wanted some instead of many. That will make it actually fail if it encounters input that doesn't start with a digit.
Third, (<|>) doesn't do what you think. You might think that (a <* c) <|> (b <* c) is interchangeable with (a <|> b) <* c, but it isn't. There is no way to throw try in and make it the same, either. The problem is that (<|>) commits to whichever branch succeed, if one does. In (a <|> b) <* c, if a matches, there's no way later to backtrack and try b there. Doesn't matter how you lob try around, it can't undo the fact that (<|>) committed to a. In contrast, (a <* c) <|> (b <* c) doesn't commit until both a and c or b and c match the input.
This is the situation you're encountering. You have (try parseConst <|> parseAdd) <* eof, after a bit of inlining. Since parseConst will always succeed (see the second issue), parseAdd will never get tried, even if the eof fails. So after parseConst consumes zero or more leading digits, the parse will fail unless that's the end of the input. Working around this essentially requires carefully planning your grammar such that any use of (<|>) is safe to commit locally. That is, the contents of each branch must not overlap in a way that is disambiguated only by later portions of the grammar.
Note that this unpleasant behavior with (<|>) is how the parsec family of libraries work, but not how all parser libraries in Haskell work. Other libraries work without the left bias or commit behavior the parsec family have chosen.

Using ParserResult

The example code below appears to work nicely:
open FParsec
let capitalized : Parser<unit,unit> =(asciiUpper >>. many asciiLower >>. eof)
let inverted : Parser<unit,unit> =(asciiLower >>. many asciiUpper >>. eof)
let capsOrInvert =choice [capitalized;inverted]
You can then do:
run capsOrInvert "Dog";;
run capsOrInvert "dOG";;
and get a success or:
run capsOrInvert "dog";;
and get a failure.
Now that I have a ParserResult, how do I do things with it? For example, print the string backwards?
There are several notable issues with your code.
First off, as noticed in #scrwtp's answer, your parser returns unit. Here's why: operator (>>.) returns only the result returned by the right inner parser. On the other hand, (.>>) would return the result of a left parser, while (.>>.) would return a tuple of both left and right ones.
So, parser1 >>. parser2 >>. eof is essentially (parser1 >>. parser2) >>. eof.
The code in parens completely ignores the result of parser1, and the second (>>.) then ignores the entire result of the parser in parens. Finally, eof returns unit, and this value is being returned.
You may need some meaningful data returned instead, e.g. the parsed string. The easiest way is:
let capitalized = (asciiUpper .>>. many asciiLower .>> eof)
Mind the operators.
The code for inverted can be done in a similar manner.
This parser would be of type Parser<(char * char list), unit>, a tuple of first character and all the remaining ones, so you may need to merge them back. There are several ways to do that, here's one:
let mymerge (c1: char, cs: char list) = c1 :: cs // a simple cons
let pCapitalized = capitalized >>= mymerge
The beauty of this code is that your mymerge is a normal function, working with normal char's, it knows nothing about parsers or so. It just works with the data, and (>>=) operator does the rest.
Note, pCapitalized is also a parser, but it returns a single char list.
Nothing stops you from applying further transitions. As you mentioned printing the string backwards:
let pCapitalizedAndReversed =
capitalized
>>= mymerge
>>= List.rev
I have written the code in this way for purpose. In different lines you see a gradual transition of your domain data, still within the paradigm of Parser. This is an important consideration, because any subsequent transition may "decide" that the data is bad for some reason and raise a parsing exception, for example. Or, alternatively, it may be merged with other parser.
As soon as your domain data (a parsed-out word) is complete, you extract the result as mentioned in another answer.
A minor note. choice is superfluous for only two parsers. Use (<|>) instead. From experience, careful choosing parser combinators is important because a wrong choice deep inside your core parser logic can easily make your parsers dramatically slow.
See FParsec Primitives for further details.
ParserResult is a discriminated union. You simply match the Success and Failure cases.
let r = run capsOrInvert "Dog"
match r with
| Success(result, _, _) -> printfn "Success: %A" result
| Failure(errorMsg, _, _) -> printfn "Failure: %s" errorMsg
But this is probably not what you find tricky about your situation.
The thing about your Parser<unit, unit> type is that the parsed value is of type unit (the first type argument to Parser). What this means is that this parser doesn't really produce any sensible output for you to use - it can only tell you whether it can parse a string (in which case you get back a Success ((), _, _) - carrying the single value of type unit) or not.
What do you expect to get out of this parser?
Edit: This sounds close to what you want, or at least you should be able to pick up some pointers from it. capitalized accepts capitalized strings, inverted accepts capitalized strings that have been reversed and reverses them as part of the parser logic.
let reverse (s: string) =
System.String(Array.rev (Array.ofSeq s))
let capitalized : Parser<string,unit> =
(asciiUpper .>>. manyChars asciiLower)
|>> fun (upper, lower) -> string upper + lower
let inverted : Parser<string,unit> =
(manyChars asciiLower .>>. asciiUpper)
|>> fun (lower, upper) -> reverse (lower + string upper)
let capsOrInvert = choice [capitalized;inverted]
run capsOrInvert "Dog"
run capsOrInvert "doG"
run capsOrInvert "dog"

How to negate a parser with Parsec

I have a file with line endings “\r\r\n”, and use the parser eol = string "\r\r\n" :: Parser String to handle them. To get a list of the lines between these separators, I would like to use sepBy along with a parser that returns any text that would not be captured by eol. Looking through the documentation I did not see a combinator that negates a parser (an ‘anything but the pattern ”\r\r\n”’ parser).
I have tried using sepBy (many anyToken) end, but many anyToken appears to be greedy, not stopping for eol matches. I cannot use many (noneOf "\n\r"), because there are several places in my text with the single '\n' character.
Is there a combinator that can get me the inverse of string "\r\r\n"?
I'm afraid you're going about it backwards. Parsec parsers don't chop up the input, they build the output.
The more you try to parse by thinking about what you don't want, the harder it'll be. You need to think bottom-up what's permissable, not top down where you chop.
You should start with the smallest, most basic thing you do want. For example, don't think of an identifier as everything before a space, think of it as a letter followed by alphanumeric data. You can then combine that, separated by whitespace with the other things you expect on a line.
line = do
i <- identifier
whiteSpace
string "="
e <- expr
return $ Line i e
Only when you've completed a parser that successfully parses what you want from a line and rejects invalid lines should you parse multiple lines:
lines = sepBy line eol
As a tentative answer, it looks like manyTill anyChar (try eol) does what I want. As part of my original question though, I'm still interested in knowing whether there is a general way to negate a parser, or whether there's another recommended way of doing what I want.
The
sepCap
parser combinator from the package
replace-megaparsec
does this kind of parser negation, and returns a list of Either with the negative matches in Left and the positive matches in Right.
import Replace.Megaparsec
import Text.Megaparsec
parseTest (sepCap (chunk "\r\r\n" :: Parsec Void String String))
$ "one\r\r\ntwo\r\r\nthree\r\r\n"
[ Left "one"
, Right "\r\r\n"
, Left "two"
, Right "\r\r\n"
, Left "three"
, Right "\r\r\n"
]

try function in parsing lambda expressions

I'm totally new to Haskell and trying to implement a "Lambda calculus" parser, that will be used to read the input to a lambda reducer .. It's required to parse bindings first "identifier = expression;" from a text file, then at the end there's an expression alone ..
till now it can parse bindings only, and displays errors when encountering an expression alone .. when I try to use the try or option functions, it gives a type mismatch error:
Couldn't match type `[Expr]'
with `Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]]'
Expected type: Text.Parsec.Prim.ParsecT
s0 u0 m0 (Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]])
Actual type: Text.Parsec.Prim.ParsecT s0 u0 m0 [Expr]
In the second argument of `option', namely `bindings'
bindings weren't supposed to return anything, but I tried to add a return statement and it also returned a type mismatch error:
Couldn't match type `[Expr]' with `Expr'
Expected type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [Expr]
Actual type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [[Expr]]
In the second argument of `(<|>)', namely `expressions'
Don't use <|> if you want to allow both
Your program parser does its main work with
program = do
spaces
try bindings <|> expressions
spaces >> eof
This <|> is choice - it does bindings if it can, and if that fails, expressions, which isn't what you want. You want zero or more bindings, followed by expressions, so let's make it do that.
Sadly, even when this works, the last line of your parser is eof and
First, let's allow zero bindings, since they're optional, then let's get both the bindings and the expressions:
bindings = many binding
program = do
spaces
bs <- bindings
es <- expressions
spaces >> eof
return (bs,es)
This error would be easier to find with plenty more <?> "binding" type hints so you can see more clearly what was expected.
endBy doesn't need many
The error message you have stems from the line
expressions = many (endBy expression eol)
which should be
expressions :: Parser [Expr]
expressions = endBy expression eol
endBy works like sepBy - you don't need to use many on it because it already parses many.
This error would have been easier to find with a stronger data type tree, so:
Use try to deal with common prefixes
One of the hard-to-debug problems you've had is when you get the error expecting space or "=" whilst parsing an expression. If we think about that, the only place we expect = is in a binding, so it must be part way through parsing a binding when we've given it an expression. This only happens if our expression starts with an identifier, just like a binding does.
binding sees the first identifier and says "It's OK guys, I've got this" but then finds no = and gives you an error, where we wanted it to backtrack and let expression have a go. The key point is we've already used the identifier input, and we want to unuse it. try is right for that.
Encase your binding parser with try so if it fails, we'll go back to the start of the line and hand over to expression.
binding = try (do
(Var id) <- identifier
_ <- char '='
spaces
exp <- expression
spaces
eol <?> "end of line"
return $ Eq id exp
<?> "binding")
It's important that as far as possible each parser starts with matching something unique to avoid this problem. (try is backtracking, hence inefficient, so should be avoided if possible.)
In particular, avoid starting parsers with spaces, but instead make sure you finish them all with spaces. Your main program can start with spaces if you like, since it's the only alternative.
Use types for most productions - better structure & readability
My first piece of general advice is that you could do with a more fine-grained data type, and should annotate your parsers with their type. At the moment, everything's wrapped up in Expr, which means you can only get error messages about whether you have an Expr or a [Expr]. The fact that you had to add Eq to Expr is a sign you're pushing the type too far.
Usually it's worth making a data type for quite a lot of the productions, and if you import Control.Applicative hiding ((<|>),(<$>),many) Control.Applicative you can use <$> and <*> so that the production, the datatype and the parser are all the same structure:
--<program> ::= <spaces> [<bindings>] <expressions>
data Program = Prog [Binding] [Expr]
program = spaces >> Prog <$> bindings <*> expressions
-- <expression> ::= <abstraction> | factors
data Expression = Ab Abstraction | Fa [Factor]
expression = Ab <$> abstraction <|> Fa <$> factors <?> "expression"
Don't do this with letters for example, but for important things. What counts as important things is a matter of judgement, but I'd start with Identifiers. (You can use <* or *> to not include syntax like = in the results.)
Amended code:
Before refactoring types and using Applicative here
And afterwards here

FParsec identifiers vs keywords

For languages with keywords, some special trickery needs to happen to prevent for example "if" from being interpreted as an identifier and "ifSomeVariableName" from becoming keyword "if" followed by identifier "SomeVariableName" in the token stream.
For recursive descent and Lex/Yacc, I've simply taken the approach (as per helpful instruction) of transforming the token stream between the lexer and the parser.
However, FParsec doesn't really seem do a separate lexer step, so I'm wondering what the best way to deal with this is. Speaking of, it seems like Haskell's Parsec supports a lexer layer, but FParsec does not?
I think, this problem is very simple. The answer is that you have to:
Parse out an entire word ([a-z]+), lower case only;
Check if it belongs to a dictionary; if so, return a keyword; otherwise, the parser will fall back;
Parse identifier separately;
E.g. (just a hypothetical code, not tested):
let keyWordSet =
System.Collections.Generic.HashSet<_>(
[|"while"; "begin"; "end"; "do"; "if"; "then"; "else"; "print"|]
)
let pKeyword =
(many1Satisfy isLower .>> nonAlphaNumeric) // [a-z]+
>>= (fun s -> if keyWordSet.Contains(s) then (preturn x) else fail "not a keyword")
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
The code above will parse a keyword or an identifier twice. To fix it, alternatively, you may:
Parse out an entire word ([a-z][A-Z]+[a-z][A-Z][0-9]+), e.g. everything alphanumeric;
Check if it's a keyword or an identifier (lower case and belonging to a dictionary) and either
Return a keyword
Return an identifier
P.S. Don't forget to order "cheaper" parsers first, if it does not ruin the logic.
You can define a parser for whitespace and check if keyword or identifier is followed by it.
For example some generic whitespace parser will look like
let pWhiteSpace = pLineComment <|> pMultilineComment <|> pSpaces
this will require at least one whitespace
let ws1 = skipMany1 pWhiteSpace
then if will look like
let pIf = pstring "if" .>> ws1

Resources