Megaparsec: transforming comment syntax into a Record - parsing

Using Megaparsec, if I want to parse a string containing comments of the form ~{content} into a Comment record, how would I go about doing that? For instance:
data Comment = { id :: Integer, content :: String }
parse :: Parser [Comment]
parse = _
parse
"hello world ~{1-sometext} bla bla ~{2-another comment}"
== [Comment { id = 1, content = "sometext" }, Comment { id = 2, content = "another comment"}]
The thing I'm stuck on is allowing for everything that's not ~{} to be ignored, including the lone char ~ and the lone brackets {}.

You can do this by dropping characters up to the next tilde, then parsing the tilde optionally followed by a valid comment, and looping.
In particular, if we define nonTildes to discard non-tildes:
nonTildes :: Parser String
nonTildes = takeWhileP (Just "non-tilde") (/= '~')
and then an optionalComment to parse a tilde and optional following comment in braces:
optionalComment :: Parser (Maybe Comment)
optionalComment = char '~' *>
optional (braces (Comment <$> ident_ <* char '-' <*> content_))
where
braces = between (char '{') (char '}')
ident_ = read <$> takeWhile1P (Just "digit") isDigit
content_ = takeWhileP Nothing (/= '}')
Then the comments can be parsed with:
comments :: Parser [Comment]
comments = catMaybes <$> (nonTildes *> many (optionalComment <* nonTildes))
This assumes that a ~{ without a matching } is a parse error, rather than valid non-comment text, which seems sensible. However, the definition of the content_ parser is probably too liberal. It gobbles everything up to the next }, meaning that:
"~{1-{{{\n}"
is a valid comment with content "{{{\n". Disallowing { (and maybe ~) in comments, or alternatively requiring braces to be properly nested in comments seems like a good idea.
Anyway, here's a full code example for you to fiddle with:
{-# OPTIONS_GHC -Wall #-}
import Data.Char
import Data.Maybe
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
type Parser = Parsec Void String
data Comment = Comment { ident :: Integer, content :: String } deriving (Show)
nonTildes :: Parser String
nonTildes = takeWhileP (Just "non-tilde") (/= '~')
optionalComment :: Parser (Maybe Comment)
optionalComment = char '~' *>
optional (braces (Comment <$> ident_ <* char '-' <*> content_))
where
braces = between (char '{') (char '}')
ident_ = read <$> takeWhile1P (Just "digit") isDigit
content_ = takeWhileP Nothing (/= '}')
comments :: Parser [Comment]
comments = catMaybes <$> (nonTildes *> many (optionalComment <* nonTildes))
main :: IO ()
main = do
parseTest comments "hello world ~{1-sometext} bla bla ~{2-another comment}"
parseTest comments "~~~ ~~~{1-sometext} {junk}"

Related

How to express parsing logic in Parsec ParserT monad

I was working on "Write Yourself a Scheme in 48 hours" to learn Haskell and I've run into a problem I don't really understand. It's for question 2 from the exercises at the bottom of this section.
The task is to rewrite
import Text.ParserCombinators.Parsec
parseString :: Parser LispVal
parseString = do
char '"'
x <- many (noneOf "\"")
char '"'
return $ String x
such that quotation marks which are properly escaped (e.g. in "This sentence \" is nonsense") get accepted by the parser.
In an imperative language I might write something like this (roughly pythonic pseudocode):
def parseString(input):
if input[0] != "\"" or input[len(input)-1] != "\"":
return error
input = input[1:len(input) - 1] # slice off quotation marks
output = "" # This is the 'zero' that accumulates over the following loop
# If there is a '"' in our string we want to make sure the previous char
# was '\'
for n in range(len(input)):
if input[n] == "\"":
try:
if input[n - 1] != "\\":
return error
catch IndexOutOfBoundsError:
return error
output += input[n]
return output
I've been looking at the docs for Parsec and I just can't figure out how to work this as a monadic expression.
I got to this:
parseString :: Parser LispVal
parseString = do
char '"'
regular <- try $ many (noneOf "\"\\")
quote <- string "\\\""
char '"'
return $ String $ regular ++ quote
But this only works for one quotation mark and it has to be at the very end of the string--I can't think of a functional expression that does the work that my loops and if-statements do in the imperative pseudocode.
I appreciate you taking your time to read this and give me advice.
Try something like this:
dq :: Char
dq = '"'
parseString :: Parser Val
parseString = do
_ <- char dq
x <- many ((char '\\' >> escapes) <|> noneOf [dq])
_ <- char dq
return $ String x
where
escapes = dq <$ char dq
<|> '\n' <$ char 'n'
<|> '\r' <$ char 'r'
<|> '\t' <$ char 't'
<|> '\\' <$ char '\\'
The solution is to define a string literal as a starting quote + many valid characters + an ending quote where a "valid character" is either a an escape sequence or non-quote.
So there is a one line change to parseString:
parseString = do char '"'
x <- many validChar
char '"'
return $ String x
and we add the definitions:
validChar = try escapeSequence <|> satisfy ( /= '"' )
escapeSequence = do { char '\\'; anyChar }
escapeSequence may be refined to allow a limited set of escape sequences.

Parser for Quoted string using Parsec

I want to parse input strings like this: "this is \"test \" message \"sample\" text"
Now, I wrote a parser for parsing individual text without any quotes:
parseString :: Parser String
parseString = do
char '"'
x <- (many $ noneOf "\"")
char '"'
return x
This parses simple strings like this: "test message"
Then I wrote a parser for quoted strings:
quotedString :: Parser String
quotedString = do
initial <- string "\\\""
x <- many $ noneOf "\\\""
end <- string "\\\""
return $ initial ++ x ++ end
This parsers for strings like this: \"test message\"
Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?
This is what I would do:
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
parseString :: Parser String
parseString = do
char '"'
strings <- many character
char '"'
return $ concat strings
Now all you need to do is call it:
parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""
Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.
quotedString = do
char '"'
x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
char '"'
return x
I believe, this should work.
In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Token
lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef
strParser :: Parser String
strParser = stringLiteral lexer
parseString :: String -> Either ParseError String
parseString = parse strParser ""
I prefer the following because it is easier to read:
quotedString :: Parser String
quotedString = do
a <- string "\""
b <- concat <$> many quotedChar
c <- string "\""
-- return (a ++ b ++ c) -- if you want to preserve the quotes
return b
where quotedChar = try (string "\\\\")
<|> try (string "\\\"")
<|> ((noneOf "\"\n") >>= \x -> return [x] )
Aadit's solution may be faster because it does not use try but it's probably harder to read.
Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.
For example, let's assume you have a tab character in the string.
My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".
Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".
So we have two different use cases.
My solution: Parse quoted strings with possibly escaped quotes and escaped escapes
Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"
I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:
quotedString = do
string <- between (char '"') (char '"') (many quotedStringChar)
return string
where
quotedStringChar = escapedChar <|> normalChar
escapedChar = (char '\\') *> (oneOf ['\\', '"'])
normalChar = noneOf "\""
elaborating on #Priyatham response
pEscString::Char->Parser String
pEscString e= do
char e;
s<-many (
do{char '\\';c<-anyChar;return ['\\',c]}
<|>many1 (noneOf (e:"\\")))
char e
return$concat s

Parsec-Parser works alright, but could it be done better?

I try to do this:
Parse a Text in the form:
Some Text #{0,0,0} some Text #{0,0,0}#{0,0,0} more Text #{0,0,0}
into a list of some data structure:
[Inside "Some Text ",Outside (0,0,0),Inside " some Text ",Outside (0,0,0),Outside (0,0,0),Inside " more Text ",Outside (0,0,0)]
So these #{a,b,c}-bits should turn into different things as the rest of the text.
I have this code:
module ParsecTest where
import Text.ParserCombinators.Parsec
import Monad
type Reference = (Int, Int, Int)
data Transc = Inside String | Outside Reference
deriving (Show)
text :: Parser Transc
text = do
x <- manyTill anyChar ((lookAhead reference) <|> (eof >> return (Inside "")));
return (Inside x)
transc = reference <|> text
alot :: Parser [Transc]
alot = do
manyTill transc eof
reference :: Parser Transc
reference = try (do{ char '#';
char '{';
a <- number;
char ',';
b <- number;
char ',';
c <- number;
char '}';
return (Outside (a,b,c)) })
number :: Parser Int
number = do{ x <- many1 digit;
return (read x) }
This works as expected. You can test this in ghci by typing
parseTest alot "Some Text #{0,0,0} some Text #{0,0,0}#{0,0,0} more Text #{0,0,0}"
But I think it's not nice.
1) Is the use of lookAhead really necessary for my problem?
2) Is the return (Inside "") an ugly hack?
3) Is there generally a more concise/smarter way to archieve the same?
1) I think you do need lookAhead as you need the result of that parse. It would be nice to avoid running that parser twice by having a Parser (Transc,Maybe Transc) to indicate an Inside with an optional following Outside. If performance is an issue, then this is worth doing.
2) Yes.
3) Applicatives
number2 :: Parser Int
number2 = read <$> many1 digit
text2 :: Parser Transc
text2 = (Inside .) . (:)
<$> anyChar
<*> manyTill anyChar (try (lookAhead reference2) *> pure () <|> eof)
reference2 :: Parser Transc
reference2 = ((Outside .) .) . (,,)
<$> (string "#{" *> number2 <* char ',')
<*> number2
<*> (char ',' *> number2 <* char '}')
transc2 = reference2 <|> text2
alot2 = many transc2
You may want to rewrite the beginning of reference2 using a helper like aux x y z = Outside (x,y,z).
EDIT: Changed text to deal with inputs that don't end with an Outside.

How do you use parsec in a greedy fashion?

In my work I come across a lot of gnarly sql, and I had the bright idea of writing a program to parse the sql and print it out neatly. I made most of it pretty quickly, but I ran into a problem that I don't know how to solve.
So let's pretend the sql is "select foo from bar where 1". My thought was that there is always a keyword followed by data for it, so all I have to do is parse a keyword, and then capture all gibberish before the next keyword and store that for later cleanup, if it is worthwhile. Here's the code:
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Char
import Data.Text (strip)
newtype Statement = Statement [Atom]
data Atom = Branch String [Atom] | Leaf String deriving Show
trim str = reverse $ trim' (reverse $ trim' str)
where
trim' (' ':xs) = trim' xs
trim' str = str
printStatement atoms = mapM_ printAtom atoms
printAtom atom = loop 0 atom
where
loop depth (Leaf str) = putStrLn $ (replicate depth ' ') ++ str
loop depth (Branch str atoms) = do
putStrLn $ (replicate depth ' ') ++ str
mapM_ (loop (depth + 2)) atoms
keywords :: [String]
keywords = [
"select",
"update",
"delete",
"from",
"where"]
keywordparser :: Parsec String u String
keywordparser = try ((choice $ map string keywords) <?> "keywordparser")
stuffparser :: Parsec String u String
stuffparser = manyTill anyChar (eof <|> (lookAhead keywordparser >> return ()))
statementparser = do
key <- keywordparser
stuff <- stuffparser
return $ Branch key [Leaf (trim stuff)]
<?> "statementparser"
tp = parse (many statementparser) ""
The key here is the stuffparser. That is the stuff in between the keywords that could be anything from column lists to where criteria. This function catches all characters leading up to a keyword. But it needs something else before it is finished. What if there is a subselect? "select id,(select product from products) from bar". Well in that case if it hits that keyword, it screws everything up, parses it wrong and screws up my indenting. Also where clauses can have parenthesis as well.
So I need to change that anyChar into another combinator that slurps up characters one at a time but also tries to look for parenthesis, and if it finds them, traverse and capture all that, but also if there are more parenthesis, do that until we have fully closed the parenthesis, then concatenate it all and return it. Here's what I've tried, but I can't quite get it to work.
stuffparser :: Parsec String u String
stuffparser = fmap concat $ manyTill somechars (eof <|> (lookAhead keywordparser >> return ()))
where
somechars = parens <|> fmap (\c -> [c]) anyChar
parens= between (char '(') (char ')') somechars
This will error like so:
> tp "select asdf(qwerty) from foo where 1"
Left (line 1, column 14):
unexpected "w"
expecting ")"
But I can't think of any way to rewrite this so that it works. I've tried to use manyTill on the parenthesis part, but I end up having trouble getting it to typecheck when I have both string producing parens and single chars as alternatives. Does anyone have any suggestions on how to go about this?
Yeah, between might not work for what you're looking for. Of course, for your use case, I'd follow hammar's suggestion and grab an off-the-shelf SQL parser. (personal opinion: or, try not to use SQL unless you really have to; the idea to use strings for database queries was imho a historical mistake).
Note: I add an operator called <++> which will concatenate the results of two parsers, whether they are strings or characters. (code at bottom.)
First, for the task of parsing parenthesis: the top level will parse some stuff between the relevant characters, which is exactly what the code says,
parseParen = char '(' <++> inner <++> char ')'
Then, the inner function should parse anything else: non-parens, possibly including another set of parenthesis, and non-paren junk that follows.
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "()") <++> option "" (parseParen <++> inner)
I'll make the assumption that for the rest of the solution, what you want to do is analgous to splitting things up by top-level SQL keywords. (i.e. ignoring those in parenthesis). Namely, we'll have a parser that will behave like so,
Main> parseTest parseSqlToplevel "select asdf(select m( 2) fr(o)m w where n) from b where delete 4"
[(Select," asdf(select m( 2) fr(o)m w where n) "),(From," b "),(Where," "),(Delete," 4")]
Suppose we have a parseKw parser that will get the likes of select, etc. After we consume a keyword, we need to read until the next [top-level] keyword. The last trick to my solution is using the lookAhead combinator to determine whether the next word is a keyword, and put it back if so. If it's not, then we consume a parenthesis or other character, and then recurse on the rest.
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))
My entire solution is as follows
-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g
data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)
parseKw =
(Select <$ string "select") <|>
(Update <$ string "update") <|>
(Delete <$ string "delete") <|>
(From <$ string "from") <|>
(Where <$ string "where") <?>
"keyword (select, update, delete, from, where)"
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))
parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "()") <++> option "" (parseParen <++> inner)
edit - version with quote support
you can do the same thing as with the parens to support quotes,
import Control.Applicative hiding (many, (<|>))
import Text.Parsec
import Text.Parsec.Combinator
-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g
data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)
parseKw =
(Select <$ string "select") <|>
(Update <$ string "update") <|>
(Delete <$ string "delete") <|>
(From <$ string "from") <|>
(Where <$ string "where") <?>
"keyword (select, update, delete, from, where)"
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> parseQuote <|> many1 (noneOf "'() \t")) <++> parseOther))
parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof
parseQuote = char '\'' <++> inner <++> char '\'' where
inner = many (noneOf "'\\") <++>
option "" (char '\\' <++> anyChar <++> inner)
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "'()") <++>
(parseQuote <++> inner <|> option "" (parseParen <++> inner))
I tried it with parseTest parseSqlToplevel "select ('a(sdf'())b". cheers

Parsing text with optional data at the end

Please note, subsequently to posting this question I managed to derive a solution myself. See the end of this question for my final answer.
I'm working on a little parser at the moment for org-mode documents, and in these documents headings can have a title, and may optionally consist of a list of tags at the of the heading:
* Heading :foo:bar:baz:
I'm having difficulty writing a parser for this, however. The following is what I'm working with for now:
import Control.Applicative
import Text.ParserCombinators.Parsec
data Node = Node String [String]
deriving (Show)
myTest = parse node "" "Some text here :tags:here:"
node = Node <$> (many1 anyChar) <*> tags
tags = (char ':') >> (sepEndBy1 (many1 alphaNum) (char ':'))
<?> "Tag list"
While my simple tags parser works, it doesn't work in the context of node because all of the characters are used up parsing the title of the heading (many1 anyChar). Furthermore, I can't change this parser to use noneOf ":" because : is valid in the title. In fact, it's only special if it's in a taglist, at the very end of the line.
Any ideas how I can parse this optional data?
As an aside, this is my first real Haskell project, so if Parsec is not even the right tool for the job - feel free to point that out and suggest other options!
Ok, I got a complete solution now, but it needs refactoring. The following works:
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Data.Char (isSpace)
import Text.ParserCombinators.Parsec
data Node = Node { level :: Int, keyword :: Maybe String, heading :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> level <*> (optionMaybe keyword) <*> name <*> (optionMaybe tags)
where level = length <$> many1 (char '*') <* space
keyword = (try (many1 upper <* space))
name = noneOf "\n" `manyTill` (eof <|> (lookAhead (try (tags *> eof))))
tags = char ':' *> many1 alphaNum `sepEndBy1` char ':'
myTest = parse parseNode "org-mode" "** Some : text here :tags: JUST KIDDING :tags:here:"
myTest2 = parse parseNode "org-mode" "* TODO Just a node"
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Text.ParserCombinators.Parsec
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
data Node = Node { name :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> name <*> tags
where tags = optionMaybe $ optional (string " :") *> many (noneOf ":\n") `sepEndBy` (char ':')
name = noneOf "\n" `manyTill` try (string " :" <|> string "\n")
myTest = parse parseNode "" "Some:text here :tags:here:"
myTest2 = parse parseNode "" "Sometext here :tags:here:"
Results:
*Main> myTest
Right (Node {name = "Some:text here", tags = Just ["tags","here",""]})
*Main> myTest2
Right (Node {name = "Sometext here", tags = Just ["tags","here",""]})

Resources