Parsing a particular string in Haskell - parsing

I'm using the parsec Haskell library.
I want to parse strings of the following kind:
[[v1]][[v2]]
xyz[[v1]][[v2]]
[[v1]]xyz[[v2]]
etc.
I'm interesting to collect only the values v1 and v2, and store these in a data structure.
I tried with the following code:
import Text.ParserCombinators.Parsec
quantifiedVars = sepEndBy var (string "]]")
var = between (string "[[") (string "") (many (noneOf "]]"))
parseSL :: String -> Either ParseError [String]
parseSL input = parse quantifiedVars "(unknown)" input
main = do {
c <- getContents;
case parse quantifiedVars "(stdin)" c of {
Left e -> do { putStrLn "Error parsing input:"; print e; };
Right r -> do{ putStrLn "ok"; mapM_ print r; };
}
}
In this way, if the input is "[[v1]][[v2]]" the program works fine, returning the following output:
"v1"
"v2"
If the input is "xyz[[v1]][[v2]]" the program doesn't work. In particular, I want only what is contained in [[...]], ignoring "xyz".
Also, I want to store the content of [[...]] in a data structure.
How do you solve this problem?

You need to restructure your parser. You are using combinators in very strange locations, and they mess things up.
A var is a varName between "[[" and "]]". So, write that:
var = between (string "[[") (string "]]") varName
A varName should have some kind of format (I don't think that you want to accept "%A¤%&", do you?), so you should make a parser for that; but in case it really can be anything, just do this:
varName = many $ noneOf "]"
Then, a text containing vars, is something with vars separated by non-vars.
varText = someText *> var `sepEndBy` someText
... where someText is anything except a '[':
someText = many $ noneOf "["
Things get more complicated if you want this to be parseable:
bla bla [ bla bla [[somevar]blabla]]
Then you need a better parser for varName and someText:
varName = concat <$> many (try incompleteTerminator <|> many1 (noneOf "]"))
-- Parses e.g. "]a"
incompleteTerminator = (\ a b -> [a, b]) <$> char ']' <*> noneOf "]"
someText = concat <$> many (try incompleteInitiator <|> many1 (noneOf "["))
-- Parses e.g. "[b"
incompleteInitiator = (\ a b -> [a, b]) <$> char '[' <*> noneOf "["
PS. (<*>), (*>) and (<$>) is from Control.Applicative.

Related

How to fail a nested megaparsec parser?

I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.
What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>
fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

Parsec sepBy Haskell

I wrote a function and it complies, but I'm not sure if it works the way I intend it to or how to call it in the terminal. Essentially, I want to take a string, like ("age",5),("age",6) and make it into a list of tuples [("age1",5)...]. I am trying to write a function separate the commas and either I am just not sure how to call it in the terminal or I did it wrong.
items :: Parser (String,Integer) -> Parser [(String,Integer)]
items p = do { p <- sepBy strToTup (char ",");
return p }
I'm not sure what you want and I don't know what is Parser.
Starting from such a string:
thestring = "(\"age\",5),(\"age\",6),(\"age\",7)"
I would firstly remove the outer commas with a regular expression method:
import Text.Regex
rgx = mkRegex "\\),\\("
thestring' = subRegex rgx thestring ")("
This gives:
>>> thestring'
"(\"age\",5)(\"age\",6)(\"age\",7)"
Then I would split:
import Data.List.Split
thelist = split (startsWith "(") thestring'
which gives:
>>> thelist
["(\"age\",5)","(\"age\",6)","(\"age\",7)"]
This is what you want, if I correctly understood.
That's probably not the best way. Since all the elements of the final list have form ("age", X) you could extract all numbers (I don't know but it should not be difficult) and then it would be easy to get the final list. Maybe better.
Apologies if this has nothing to do with your question.
Edit
JFF ("just for fun"), another way:
import Data.Char (isDigit)
import Data.List.Split
thestring = "(\"age\",15),(\"age\",6),(\"age\",7)"
ages = (split . dropBlanks . dropDelims . whenElt) (not . isDigit) thestring
map (\age -> "(age," ++ age ++ ")") ages
-- result: ["(age,15)","(age,6)","(age,7)"]
Or rather:
>>> map (\age -> ("age",age)) ages
[("age","15"),("age","6"),("age","7")]
Or if you want integers:
>>> map (\age -> ("age", read age :: Int)) ages
[("age",15),("age",6),("age",7)]
Or if you want age1, age2, ...:
import Data.List.Index
imap (\i age -> ("age" ++ show (i+1), read age :: Int)) ages
-- result: [("age1",15),("age2",6),("age3",7)]

Strange behaviour parsing an imperative language using Parsec

I'm trying to parse a fragment of the Abap language with Parsec in haskell. The statements in Abap are delimited by dots. The syntax for function definition is:
FORM <name> <arguments>.
<statements>.
ENDFORM.
I will use it as a minimal example.
Here is my attempt at writing the corresponding type in haskell and the parser. The GenStatement-Constructor is for all other statements except function definition as described above.
module Main where
import Control.Applicative
import Data.Functor.Identity
import qualified Text.Parsec as P
import qualified Text.Parsec.String as S
import Text.Parsec.Language
import qualified Text.Parsec.Token as T
type Args = String
type Name = String
data AbapExpr -- ABAP Program
= Form Name Args [AbapExpr]
| GenStatement String [AbapExpr]
deriving (Show, Read)
lexer :: T.TokenParser ()
lexer = T.makeTokenParser style
where
caseSensitive = False
keys = ["form", "endform"]
style = emptyDef
{ T.reservedNames = keys
, T.identStart = P.alphaNum <|> P.char '_'
, T.identLetter = P.alphaNum <|> P.char '_'
}
dot :: S.Parser String
dot = T.dot lexer
reserved :: String -> S.Parser ()
reserved = T.reserved lexer
identifier :: S.Parser String
identifier = T.identifier lexer
argsP :: S.Parser String
argsP = P.manyTill P.anyChar (P.try (P.lookAhead dot))
genericStatementP :: S.Parser String
genericStatementP = P.manyTill P.anyChar (P.try dot)
abapExprP = P.try (P.between (reserved "form")
(reserved "endform" >> dot)
abapFormP)
<|> abapStmtP
where
abapFormP = Form <$> identifier <*> argsP <* dot <*> many abapExprP
abapStmtP = GenStatement <$> genericStatementP <*> many abapExprP
Testing the parser with the following input results in a strange behaviour.
-- a wrapper for convenience
parse :: S.Parser a -> String -> Either P.ParseError a
parse = flip P.parse "Test"
testParse1 = parse abapExprP "form foo arg1 arg2 arg2. form bar arg1. endform. endform."
results in
Right (GenStatement "form foo arg1 arg2 arg2" [GenStatement "form bar arg1" [GenStatement "endform" [GenStatement "endform" []]]])
so it seems the first brach always fails and only the second generic branch is successful. However if the second branch (parsing generic statements) is commented parsing forms suddenly succeeds:
abapExprP = P.try (P.between (reserved "form")
(reserved "endform" >> dot)
abapFormP)
-- <|> abapStmtP
where
abapFormP = Form <$> identifier <*> argsP <* dot <*> many abapExprP
-- abapStmtP = GenStatement <$> genericStatementP <*> many abapExprP
Now we get
Right (Form "foo" "arg1 arg2 arg2" [Form "bar" "arg1" []])
How is this possible? It seems that the first branch succeeds so why doesn't it work in the first example - what am I missing?
Many thanks in advance!
Looks for me that your parser genericStatementP parses any character until a dot appears (you are using P.anyChar). Hence it doesn't recognize the reserved keywords for your lexer.
I think you must define:
type Args = [String]
and:
argsP :: S.Parser [String]
argsP = P.manyTill identifier (P.try (P.lookAhead dot))
genericStatementP :: S.Parser String
genericStatementP = identifier
With these changes I get the following result:
Right (Form "foo" ["arg1","arg2","arg2"] [Form "bar" ["arg1"] []])

Writing a parser for Persons in haskell

I'm trying to write a parser for a data Person (data Person). But I have to write it in just one line using <$> and <*> and I was trying a lot, but I'm getting really "overtaxed".
The parser type is as usual:
newtype Parser a = Parser (String -> [(a,String)])
And I have this function:
parse :: Parser a -> String -> Maybe a
that returns the first complete parse.
e.g.
if I have this easy function:
upper :: Parser Char
upper = satisfy isUpper
If I run parse upper "A" I get Just 'A'
I also have a funnier function like this:
name :: Parser String
name = (:) <$> (satisfy isUpper) <*> (many $ satisfy isAlpha)
which, as you can see, accepts all strings that are literal characters and begin with an upper Letter.
so:
*Main> parse name1 "hello"
Nothing
*Main> parse name1 "Hello"
Just "Hello"
Until now is everything fine, the only problem is that I have to do something like that for the class (data, type ?!) Person (data Person)
so, I have this:
data Person = Person String deriving (Eq, Show)
And then, in just one line, I have to write the parser for Person, but the name should satisfy the function name, it means, the name should be just a chain of literal characters, where the first one is upper case.
And it should work so:
> parse parserPerson "Chuck"
Just (Person "Chuck")
> parse parserPerson "chuck"
Nothing
where:
parserPerson :: Parser Person
parserPerson = ???
As you can see, bevor "Chuck" there is Person, so I've to use somehow *> to get it.
And that's it, just a line with <$>, <*> and *> that works that way.
I don't have a clue, and I'm getting crazy with this. Maybe anyone could help me.
EDIT
satisfy :: (Char -> Bool) -> Parser Char -- parse a desired character
satisfy p = Parser check
where
check (c:s) | p c = [(c,s)] -- successful
check _ = [ ] -- no parse
and many (as some) are functions from the Control.Applicative Control.Applicative
As tsorn said, the answer was really easy...
parserPerson :: Parser Person
parserPerson = Person <$> name1
and it works because the Functor Instacnce was defined.
instance Functor Parser where
fmap f (Parser p) = Parser $ \s -> map (\(a,b) -> (f a, b)) $ p s

The value of x is undefined here, so this reference is not allowed

I wrote a very simple parser combinator library which seems to work alright (https://github.com/mukeshsoni/tinyparsec).
I then tried writing parser for json with the library. The code for the json parser is here - https://github.com/mukeshsoni/tinyparsec/blob/master/src/example_parsers/JsonParser.purs
The grammar for json is recursive -
data JsonVal
= JsonInt Int
| JsonString String
| JsonBool Boolean
| JsonObj (List (Tuple String JsonVal))
Which means the parser for json object must again call the parser for jsonVal. The code for jsonObj parser looks like this -
jsonValParser
= jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
propValParser :: Parser (Tuple String JsonVal)
propValParser = do
prop <- stringLitParser
_ <- symb ":"
val <- jsonValParser
pure (Tuple prop val)
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = sepBy propValParser (symb ",")
jsonObjParser :: Parser JsonVal
jsonObjParser = do
_ <- symb "{"
propValList <- listOfPropValParser
_ <- symb "}"
pure (JsonObj propValList)
But when i try to build it, i get the following error - The value of propValParser is undefined here. So this reference is not allowed here
I found similar issues on stackoverflow but could not understand why the error happens or how should i refactor my code so that it takes care of the recursive references from jsonValParser to propValParser.
Any help would be appreciated.
See https://stackoverflow.com/a/36991223/139614 for a similar case - you'll need to make use of the fix function, or introduce Unit -> ... in front of a parser somewhere to break the cyclic definition.
I managed to get rid of the error by wrapping the blocks which were throwing error inside a do block and starting the do block with a noop -
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = do
_ <- pure 1 -- does nothing but defer the execution of the second line
sepBy propValParser (symb ",")
Had to do the same for jsonValParser.
jsonValParser = do
_ <- pure 1
jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
The idea is to defer the execution of the code which might lead to cyclic dependency. The added line, _ <- pure 1, does exactly that. I think it might be doing the same as fix from Data.Fix does or what defer from Data.Lazy does.

Resources