understanding attoparsec - parsing

attoparsec was suggested to me for parsing a file, now I must to understand how to use it;
somebody gave me this piece of code:
#
type Environment = M.Map String String
import Data.Attoparsec (maybeResult)
import qualified Data.Attoparsec.Char8 as A
import qualified Data.ByteString.Char8 as B
environment :: A.Parser Environment
environment = M.fromList <$> A.sepBy entry A.endOfLine
parseEnvironment = maybeResult .flip A.feed B.empty . A.parse environment
spaces = A.many $ A.char ' '
entry = (,) <$> upTo ':' <*> upTo ';'
upTo delimiter = B.unpack <$> A.takeWhile (A.notInClass $ delimiter : " ")
<* (spaces >> A.char delimiter >> spaces)
that works very well, but I do not know why:
what the reason of using flip, is it not easier to put the argument of A.feed in a different order? and why is there B.empty?
is there some tutorial about that I can study?
thanks in advance

There's an explanation of the need for feed in the answers to this StackOverflow question. As Bryan O'Sullivan (the creator of Attoparsec) says there:
If you write an attoparsec parser that
consumes as much input as possible
before failing, you must tell the
partial result continuation when
you've reached the end of your input.
You can do this by feeding it an empty bytestring.
I'll admit that I wrote the code in question, and I actually didn't use pointfree in this case. Simple composition just makes sense to me here: you run the parser (A.parse environment), you tell it you're done (flip A.feed B.empty), and you convert to a Maybe as a kind of basic error handling (maybeResult). In my opinion this feels cleaner than the pointed version:
parseEnvironment b = maybeResult $ A.feed (A.parse environment b) B.empty
The rest is I think fairly idiomatic applicative parsing, although I'm not sure why I would have used >> instead of *>.

Related

How to make a sub parser with Parsec?

I would like to parse several lists of commands indented or formated as array with Parsec. As example, my lists will be formated like this:
Command1 arg1 arg2 Command1 arg1 arg2 Command1 arg1 arg2
Command2 arg1 Command3 arg1 arg2 arg3
Command3 arg1 arg2 arg3
Command4
Command3 arg1 arg2 arg3 Command2 arg1
Command4
Command4
Command5 arg1 Command2 arg1
These commands are supposed to be parsed column by column with state changes in the parser.
My idea is to gather the commands into separated list of string and parse these strings into a subparser (executed inside the main parser).
I inspected the API of the Parsec library but I didn't find a function to do that.
I considered using runParser but this function only extract the results of the parser and not its state.
I also considered making a function inspired by runParsecT and mkPT to make my own parser, but the constructors ParsecT or initialPos are not available (not exported by the library)
Is it possible to run a subparser inside a parser with Parsec?
If not, does a library such as megaparsec can solve my problem?
Not a complete answer, more a question for clarification:
Is it necessary to build a list of strings?
I would prefer to parse the input and convert it into a more special datatype. By that you can use the type guarantees of haskell.
I would begin by defining a datatype for my commands:
data Command = Command1 Argtype1
| Command2 Argtype2
| Command3 Argtype1 Argtype2
data Argtype1 = Arg1 | Arg2 | ArgX
data Argtype2 = Arg2_1 | Arg2_2
After that you can parse the input and put it in datatypes.
At the end of the parsing you can mappend the results (that is for lists adding at the front with operation (:)).
You end up with a datatype of [Command].
With that you can work further.
For parsing the text you can follow the introduction to the package megaparsec at
(https://markkarpov.com/megaparsec/parsing-simple-imperative-language.html)
Or do you mean something completly different? Perhaps that every line (containing some commands) is as it whole shall be one input of a state machine and the state machine changes in relation to the commands? Then I wonder why the state machine shall be implemented as a parser.
As a starting point, the simplest answer to "How to make a sub parser" is using the monadic bind, applicative <*>, alternative <|>, and the combinators provided by the library. Assuming that each command belongs to a single type (as in Hans Kruger's answer), and with arbitrary number of columns, the below might make a good template.
import Text.Parsec
import Text.Parsec.Char
import Data.List(transpose)
cmdFileParser :: Parsec s u [[CommandType]]
cmdFileParser = sepBy sepParser cmdLineParser
where
sepParser = newline --From Text.Parsec.Char
cmdLineParser :: Parsec s u [CommandType]
cmdLineParser = sepBy sepParser cmdParser
where
sepParser = tab
cmdParser :: Parsec s u CommandType
cmdParser = parseCommand1
<|> parseCommand2
<|> parseCommand3
<|> etc
Then, after the the parsing, transpose the [[CommandType]] to group commands by column
main = do
...
let ret = runParser cmdFileParser
"debug string telling what was parsed"
stringToParse
case ret of
Left e -> putStrLn "wasn't parsed"
Right cmds -> doSomethingWith (transpose cmds)
I would say that the above is a typical approach. There are variations of course. For instance if you know there should be only three columns, you might have instead of the above cmdLineParser the below
cmdLineParser :: Parsec s u (CommandType,CommandType,CommandType)
cmdLineParser = (\a b c -> (a,b,c)) <$> ct <*> ct <*> cmdParser
where
ct = cmdParser <* tab
I would say that using getState is atypical. When I first started using Parsec, I remember getting something like what I think you are after working, but it wasn't pretty. Of course, if you really want to just return the strings you can always parse for any char except your newlines and tabs.
cmdParser :: Parsec s u String
cmdParser = many (noneOf "\n\t")
Although, careful of using the above. I've been burned in my use of many before, where it takes too much or always succeeds. So I don't have high confidence that that exact formulation will get you the command string. Also, if you just parse that command as a string, then reparse the command in your main, you will be parsing twice!

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.
What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res
#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

try function in parsing lambda expressions

I'm totally new to Haskell and trying to implement a "Lambda calculus" parser, that will be used to read the input to a lambda reducer .. It's required to parse bindings first "identifier = expression;" from a text file, then at the end there's an expression alone ..
till now it can parse bindings only, and displays errors when encountering an expression alone .. when I try to use the try or option functions, it gives a type mismatch error:
Couldn't match type `[Expr]'
with `Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]]'
Expected type: Text.Parsec.Prim.ParsecT
s0 u0 m0 (Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]])
Actual type: Text.Parsec.Prim.ParsecT s0 u0 m0 [Expr]
In the second argument of `option', namely `bindings'
bindings weren't supposed to return anything, but I tried to add a return statement and it also returned a type mismatch error:
Couldn't match type `[Expr]' with `Expr'
Expected type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [Expr]
Actual type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [[Expr]]
In the second argument of `(<|>)', namely `expressions'
Don't use <|> if you want to allow both
Your program parser does its main work with
program = do
spaces
try bindings <|> expressions
spaces >> eof
This <|> is choice - it does bindings if it can, and if that fails, expressions, which isn't what you want. You want zero or more bindings, followed by expressions, so let's make it do that.
Sadly, even when this works, the last line of your parser is eof and
First, let's allow zero bindings, since they're optional, then let's get both the bindings and the expressions:
bindings = many binding
program = do
spaces
bs <- bindings
es <- expressions
spaces >> eof
return (bs,es)
This error would be easier to find with plenty more <?> "binding" type hints so you can see more clearly what was expected.
endBy doesn't need many
The error message you have stems from the line
expressions = many (endBy expression eol)
which should be
expressions :: Parser [Expr]
expressions = endBy expression eol
endBy works like sepBy - you don't need to use many on it because it already parses many.
This error would have been easier to find with a stronger data type tree, so:
Use try to deal with common prefixes
One of the hard-to-debug problems you've had is when you get the error expecting space or "=" whilst parsing an expression. If we think about that, the only place we expect = is in a binding, so it must be part way through parsing a binding when we've given it an expression. This only happens if our expression starts with an identifier, just like a binding does.
binding sees the first identifier and says "It's OK guys, I've got this" but then finds no = and gives you an error, where we wanted it to backtrack and let expression have a go. The key point is we've already used the identifier input, and we want to unuse it. try is right for that.
Encase your binding parser with try so if it fails, we'll go back to the start of the line and hand over to expression.
binding = try (do
(Var id) <- identifier
_ <- char '='
spaces
exp <- expression
spaces
eol <?> "end of line"
return $ Eq id exp
<?> "binding")
It's important that as far as possible each parser starts with matching something unique to avoid this problem. (try is backtracking, hence inefficient, so should be avoided if possible.)
In particular, avoid starting parsers with spaces, but instead make sure you finish them all with spaces. Your main program can start with spaces if you like, since it's the only alternative.
Use types for most productions - better structure & readability
My first piece of general advice is that you could do with a more fine-grained data type, and should annotate your parsers with their type. At the moment, everything's wrapped up in Expr, which means you can only get error messages about whether you have an Expr or a [Expr]. The fact that you had to add Eq to Expr is a sign you're pushing the type too far.
Usually it's worth making a data type for quite a lot of the productions, and if you import Control.Applicative hiding ((<|>),(<$>),many) Control.Applicative you can use <$> and <*> so that the production, the datatype and the parser are all the same structure:
--<program> ::= <spaces> [<bindings>] <expressions>
data Program = Prog [Binding] [Expr]
program = spaces >> Prog <$> bindings <*> expressions
-- <expression> ::= <abstraction> | factors
data Expression = Ab Abstraction | Fa [Factor]
expression = Ab <$> abstraction <|> Fa <$> factors <?> "expression"
Don't do this with letters for example, but for important things. What counts as important things is a matter of judgement, but I'd start with Identifiers. (You can use <* or *> to not include syntax like = in the results.)
Amended code:
Before refactoring types and using Applicative here
And afterwards here

Searching for a pattern with Parsec

Not sure if this is possible (or recommended), but I am essentially trying to search for a sequence of characters in file using Parsec. Example file:
START (name)
junk
morejunk=junk;
dontcare
foo ()
bar
care_about this (stuff in here i dont care about);
don't care about this
or this
foo = bar;
also_care
about_this
(dont care whats in here);
and_this too(only the names
at the front
do i care about
);
foobar
may hit something = perhaps maybe (like this);
foobar
END
And here is my attempt at getting it working:
careAbout :: Parser (String, String)
careAbout = do
name1 <- many1 (noneOf " \n\r")
skipMany space
name2 <- many1 (noneOf " (\r\n")
skipMany space
skipMany1 parens
skipMany space
char ';'
return (name1, name2)
parens :: Parser ()
parens = do
char '('
many (parens <|> skipMany1 (noneOf "()"))
char ')'
return ()
parseFile = do
manyTill (do
try careAbout <|>
anyChar >> return ("", "")) (try $ string "END")
I'm trying to brute force the search by looking for careAbout, and if that doesn't work, eat one character and try again. I could parse all the junk in the middle (I know what it could be), but I don't care about what it is (so why bother parsing it), and it's potentially complicated.
Problem is, my solution doesn't quite work. anyChar ends up consuming everything, and the searching for END never gets a chance. Also, somewhere in the careAbout we hit eof and some Exception is thrown because of it.
This is probably the exact wrong way of doing it, and I would like to know of a way, or even better, the Right Way™, of doing it.
If not for the parens parser, this would be a good fit for a regular language parser, such as regex-applicative. This is because regular language parsers are much more "smart" about "backtracking" (in fact there's no backtracking going on at all, and yet every possible branch is explored).
However, as you probably know, matching parentheses is not a regular language. If you can relax your grammar to become regular, give regex-applicative a try.
I can't really tell from OP's post which parts of the file we care about or
don't, so I'm not going to post a specific solution. But in general, for
searching through a file for patterns which match a recursive parser, one
can use
replace-megaparsec.

Resources