Fast parsing of string that allows escaped characters? - parsing

I'm trying to parse a string that can contain escaped characters, here's an example:
import qualified Data.Text as T
exampleParser :: Parser T.Text
exampleParser = T.pack <$> many (char '\\' *> escaped <|> anyChar)
where escaped = satisfy (\c -> c `elem` ['\\', '"', '[', ']'])
The parser above creates a String and then packs it into Text. Is there any way to parse a string with escapes like the above using the functions for efficient string handling that attoparsec provides? Like string, scan, runScanner, takeWhile, ...
Parsing something like "one \"two\" \[three\]" would produce one "two" [three].
Update:
Thanks to #epsilonhalbe I was able to come out with a generalized solution perfect for my needs; note that the following function doesn't look for matching escaped characters like [..], "..", (..), etc; and also, if it finds an escaped character that is not valid it treats \ as a literal character.
takeEscapedWhile :: (Char -> Bool) -> (Char -> Bool) -> Parser Text
takeEscapedWhile isEscapable while = do
x <- normal
xs <- many escaped
return $ T.concat (x:xs)
where normal = Atto.takeWhile (\c -> c /= '\\' && while c)
escaped = do
x <- (char '\\' *> satisfy isEscapable) <|> char '\\'
xs <- normal
return $ T.cons x xs

It is possible writing some escaping code, attoparsec and text - altogether it is pretty straightforward - seeing you have already worked with parsers
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
escaped, quoted, brackted :: Parser Text
normal = AT.takeWhile (/= '\\')
escaped = do r <- normal
rs <- many escaped'
return $ T.concat $ r:rs
where escaped' = do r1 <- normal
r2 <- quoted <|> brackted
return $ r1 <> r2
quoted = do string "\\\""
res <- normal
string "\\\""
return $ "\""<>res <>"\""
brackted = do string "\\["
res <- normal
string "\\]"
return $ "["<>res<>"]"
then you can use it to parse the following test cases
Prelude >: MyModule
Prelude MyModule> import Data.Attoparsec.Text as AT
Prelude MyModule AT> import Data.Text.IO as TIO
Prelude MyModule AT TIO>:set -XOverloadedStrings
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test"
test
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\\"test\\\""
"test"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\[test\\]"
[test]
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test \\\"test\\\" \\[test\\]"
test "test" [test]
note you have to escape the escapes - that's why you see \\\" instead of \"
Also if you just parse it will print the Text values escaped, like
Right "test \"text\" [test]"
for the last example.
If you parse a file you write simpley escaped text in the file.
test.txt
I \[like\] \"Haskell\"
then you can
Prelude MyModule AT TIO> file <- TIO.readFile "test.txt"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped file
I [like] "Haskell"

Related

Parsec try : should try to go to next option

Currently, I have the following code:
import Control.Applicative ((<|>))
import Text.Parsec (ParseError, endBy, sepBy, try)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Operation = Lt | Gt deriving (Show)
data Value =
Raw String
| Op Operation
deriving (Show)
sampleStr :: String
sampleStr = unlines
[ "#BEGIN#"
, "x <- 3.14 + 2.72;"
, "x < 10;"
]
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
return $ Op Gt
ltParser :: Parser Value
ltParser = do
Parsec.string ">"
return $ Op Lt
opParser :: Parser Value
opParser = gtParser <|> ltParser
rawParser :: Parser Value
rawParser = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
return $ Raw str
valueParser :: Parser Value
valueParser = try opParser <|> rawParser
eolParser :: Parser Char
eolParser = try (Parsec.char ';' >> Parsec.endOfLine)
<|> Parsec.endOfLine
lineParser :: Parser [Value]
lineParser = sepBy valueParser $ Parsec.many1 $ Parsec.char ' '
fileParser :: Parser [[Value]]
fileParser = endBy lineParser eolParser
parse :: String -> Either ParseError [[Value]]
parse = Parsec.parse fileParser "fail..."
main :: IO ()
main = print $ parse sampleStr
This will fail with the message
Left "fail..." (line 2, column 4):
unexpected "-"
expecting " ", ";" or new-line
To my understanding, since I have try opParser, after Parsec sees that the token <- cannot be parsed by opParser, it should go to rawParser. (It is essentially a lookahead).
What is my misunderstanding, and how do I fix this error?
You can replicate the problem with the smaller test case:
> Parsec.parse fileParser "foo" "x <- 3.14"
The problem is that fileParser first calls lineParser, which successfully parses "x <" into [Raw "x", Op Gt] and leaves "- 3.14" yet to be parsed. Unfortunately, fileParser now expects to parse something with eolParser, but eolParser can't parse "- 3.14" because it starts with neither a semicolon nor an endOfLine.
Your try opParser has no effect here because opParser successfully parses <, so there's nothing to backtrack from.
There are many ways you might fix the problem. If <- is the only case where a < might be misparsed, you could exclude this case with notFollowedBy:
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
notFollowedBy $ Parsec.string "-"
return $ Op Gt

Parsec: Parsing expression between slashes

I'm trying to parse simple expressions between slashes. Example: / 1+2*3 / should evaluate to 7.
I was trying this
module Test where
import Text.Parsec
import Text.Parsec.Language (emptyDef)
import Text.Parsec.Combinator (between)
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Expr as Ex
import qualified Text.Parsec.Token as Tok
lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser style
where
ops = ["+","*","-","/",";"]
names = ["def","extern"]
style = emptyDef {
Tok.commentLine = "#"
, Tok.reservedOpNames = ops
, Tok.reservedNames = names
}
integer :: Parser Int
integer = fromIntegral <$> Tok.integer lexer
parens :: Parser a -> Parser a
parens = Tok.parens lexer
braces :: Parser a -> Parser a
braces = Tok.braces lexer
slashes :: Parser a -> Parser a
slashes = between (reserved "/") (reserved "/")
reserved :: String -> Parser ()
reserved = Tok.reserved lexer
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
binary s f assoc = Ex.Infix (reservedOp s >> return f) assoc
table = [[binary "*" (*) Ex.AssocLeft,
binary "/" div Ex.AssocLeft]
,[binary "+" (+) Ex.AssocLeft,
binary "-" (-) Ex.AssocLeft]]
factor :: Parser Int
factor = try integer
<|> parens expr
expr :: Parser Int
expr = Ex.buildExpressionParser table factor
programInSlashes :: Parser Int
programInSlashes = slashes expr
programInBraces :: Parser Int
programInBraces = braces expr
which works okay for programInBraces:
*Test> parse programInBraces "" "{ 1+2*3/4 }"
Right 2
however, programInSlashes does fail:
*Test> parse programInSlashes "" "/ 1+2*3/4 /"
Left (line 1, column 12):
unexpected end of input
expecting end of "/", integer or "("
Clearly the problem is that / is both an operator and the delimiter for the program itself. But as the language isn't ambiguous we should be able to parse that, no?
I think you can use Text.Parsec.Expr to parse the interior expression; then you can embed backtracking for the / case, for example:
Infix (try $ do { reserved "/"; notFollowedBy eof; return div }) AssocLeft
You can also parse the exterior language and the interior expression in separate passes. I’ve done this in a compiler for a language with custom operators: first parse the program without touching infix expressions, then run another pass to parse infix expressions according to the operators in scope.

Expression parser for unary operator

I was trying to make an expression parser for two operators out of which only ^ is postfix unary, so with an operand it would look like R^.
The problem is that whenever the operator ^ is encountered somewhere other than the end, it just stops there and returns whatever parsed successfully. It means "R;S;T^" parses successfully, but "R^;S;T^" stops at R. However, "(R^);S;T^" just works fine.
I took help from an online post which he used for unary minus but that is a prefix operator (for example -X). His original solution was giving errors at reservedOp2, so I modified it to reservedOp2 name = try (string name >> notFollowedBy (oneOf opChar)) and it produces the output mentioned above. I need it to work with or without parenthesis.
import Control.Applicative
import Text.ParserCombinators.Parsec hiding (many,optinal,(<|>))
import Text.ParserCombinators.Parsec.Expr
import Text.Parsec.Language (haskell)
import qualified Text.Parsec.Token as P
import Text.Parsec.String (Parser)
data RT = Comp RT RT
| Conv RT
| Var String
deriving (Show)
whiteSpace = P.whiteSpace haskell
word = P.identifier haskell
parens = P.parens haskell
opChar = "^;"
reservedOp2 :: String -> CharParser st ()
reservedOp2 name = try (string name >> notFollowedBy (oneOf opChar) >> whiteSpace)
term = parens relexpr
<|> Var <$> word
<?> "term"
table = [ [postfix "^" Conv]
, [binary ";" Comp AssocLeft]
]
prefix name fun = Prefix $ reservedOp2 name >> return fun
binary name fun = Infix $ reservedOp2 name >> return fun
postfix name fun = Postfix $ reservedOp2 name >> return fun
relexpr :: Parser RT
relexpr = buildExpressionParser table term <?> "expression"
It fails to parse e.g. "R^;S" because postfix "^" Conv fails on notFollowedBy (oneOf opChar) (since '^' is followed by ';'). The fix is to remove ';' from opChar:
opChar = "^"
Or, even easier if you can just use reservedOp from haskell:
reservedOp2 = P.reservedOp haskell
Either of these changes fixes the parsing of your examples.

Parser for Quoted string using Parsec

I want to parse input strings like this: "this is \"test \" message \"sample\" text"
Now, I wrote a parser for parsing individual text without any quotes:
parseString :: Parser String
parseString = do
char '"'
x <- (many $ noneOf "\"")
char '"'
return x
This parses simple strings like this: "test message"
Then I wrote a parser for quoted strings:
quotedString :: Parser String
quotedString = do
initial <- string "\\\""
x <- many $ noneOf "\\\""
end <- string "\\\""
return $ initial ++ x ++ end
This parsers for strings like this: \"test message\"
Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?
This is what I would do:
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
parseString :: Parser String
parseString = do
char '"'
strings <- many character
char '"'
return $ concat strings
Now all you need to do is call it:
parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""
Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.
quotedString = do
char '"'
x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
char '"'
return x
I believe, this should work.
In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Token
lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef
strParser :: Parser String
strParser = stringLiteral lexer
parseString :: String -> Either ParseError String
parseString = parse strParser ""
I prefer the following because it is easier to read:
quotedString :: Parser String
quotedString = do
a <- string "\""
b <- concat <$> many quotedChar
c <- string "\""
-- return (a ++ b ++ c) -- if you want to preserve the quotes
return b
where quotedChar = try (string "\\\\")
<|> try (string "\\\"")
<|> ((noneOf "\"\n") >>= \x -> return [x] )
Aadit's solution may be faster because it does not use try but it's probably harder to read.
Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.
For example, let's assume you have a tab character in the string.
My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".
Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".
So we have two different use cases.
My solution: Parse quoted strings with possibly escaped quotes and escaped escapes
Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"
I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:
quotedString = do
string <- between (char '"') (char '"') (many quotedStringChar)
return string
where
quotedStringChar = escapedChar <|> normalChar
escapedChar = (char '\\') *> (oneOf ['\\', '"'])
normalChar = noneOf "\""
elaborating on #Priyatham response
pEscString::Char->Parser String
pEscString e= do
char e;
s<-many (
do{char '\\';c<-anyChar;return ['\\',c]}
<|>many1 (noneOf (e:"\\")))
char e
return$concat s

Parsing text with optional data at the end

Please note, subsequently to posting this question I managed to derive a solution myself. See the end of this question for my final answer.
I'm working on a little parser at the moment for org-mode documents, and in these documents headings can have a title, and may optionally consist of a list of tags at the of the heading:
* Heading :foo:bar:baz:
I'm having difficulty writing a parser for this, however. The following is what I'm working with for now:
import Control.Applicative
import Text.ParserCombinators.Parsec
data Node = Node String [String]
deriving (Show)
myTest = parse node "" "Some text here :tags:here:"
node = Node <$> (many1 anyChar) <*> tags
tags = (char ':') >> (sepEndBy1 (many1 alphaNum) (char ':'))
<?> "Tag list"
While my simple tags parser works, it doesn't work in the context of node because all of the characters are used up parsing the title of the heading (many1 anyChar). Furthermore, I can't change this parser to use noneOf ":" because : is valid in the title. In fact, it's only special if it's in a taglist, at the very end of the line.
Any ideas how I can parse this optional data?
As an aside, this is my first real Haskell project, so if Parsec is not even the right tool for the job - feel free to point that out and suggest other options!
Ok, I got a complete solution now, but it needs refactoring. The following works:
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Data.Char (isSpace)
import Text.ParserCombinators.Parsec
data Node = Node { level :: Int, keyword :: Maybe String, heading :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> level <*> (optionMaybe keyword) <*> name <*> (optionMaybe tags)
where level = length <$> many1 (char '*') <* space
keyword = (try (many1 upper <* space))
name = noneOf "\n" `manyTill` (eof <|> (lookAhead (try (tags *> eof))))
tags = char ':' *> many1 alphaNum `sepEndBy1` char ':'
myTest = parse parseNode "org-mode" "** Some : text here :tags: JUST KIDDING :tags:here:"
myTest2 = parse parseNode "org-mode" "* TODO Just a node"
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Text.ParserCombinators.Parsec
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
data Node = Node { name :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> name <*> tags
where tags = optionMaybe $ optional (string " :") *> many (noneOf ":\n") `sepEndBy` (char ':')
name = noneOf "\n" `manyTill` try (string " :" <|> string "\n")
myTest = parse parseNode "" "Some:text here :tags:here:"
myTest2 = parse parseNode "" "Sometext here :tags:here:"
Results:
*Main> myTest
Right (Node {name = "Some:text here", tags = Just ["tags","here",""]})
*Main> myTest2
Right (Node {name = "Sometext here", tags = Just ["tags","here",""]})

Resources