Parsing sentences in Haskell more effectively using attoparsec - parsing

I'm trying to parse an ebook in .txt form, to learn more about attoparsec and Haskell (I'm a newbie). In this case, I'm trying to count the number of sentences in the given text file. Here's my code:
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Data.List
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
data Prose = Prose {
word :: [Char]
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '#', '#', '$',
'%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
'[', ']', '/', ':', ';', ',']
inputSentence :: Parser Prose
inputSentence = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars))
sentenceSeparator :: Parser ()
sentenceSeparator = many1 (space <|> satisfy (inClass ".?!")) >> pure ()
sentenceParser :: String -> [Prose]
sentenceParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional sentenceSeparator *> inputSentence `sepBy1` sentenceSeparator
main :: IO()
main = do
input <- readFile "test.txt"
let sentences = sentenceParser input
print sentences
print $ length sentences
Click this link to the github repo if you want to take a complete look at what I'm doing.
My problem is that when I try to parse text file with input:
I get an`output as follows:
So my question is, how can I:
Make the parser realize that anything with "\n\n.." is a different sentence.
Input like Daniel G. Brinton is just 1 sentence.
I've tried using isHorizontalSpace, but to no avail.

Related

Parsec try : should try to go to next option

Currently, I have the following code:
import Control.Applicative ((<|>))
import Text.Parsec (ParseError, endBy, sepBy, try)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Operation = Lt | Gt deriving (Show)
data Value =
Raw String
| Op Operation
deriving (Show)
sampleStr :: String
sampleStr = unlines
[ "#BEGIN#"
, "x <- 3.14 + 2.72;"
, "x < 10;"
]
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
return $ Op Gt
ltParser :: Parser Value
ltParser = do
Parsec.string ">"
return $ Op Lt
opParser :: Parser Value
opParser = gtParser <|> ltParser
rawParser :: Parser Value
rawParser = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
return $ Raw str
valueParser :: Parser Value
valueParser = try opParser <|> rawParser
eolParser :: Parser Char
eolParser = try (Parsec.char ';' >> Parsec.endOfLine)
<|> Parsec.endOfLine
lineParser :: Parser [Value]
lineParser = sepBy valueParser $ Parsec.many1 $ Parsec.char ' '
fileParser :: Parser [[Value]]
fileParser = endBy lineParser eolParser
parse :: String -> Either ParseError [[Value]]
parse = Parsec.parse fileParser "fail..."
main :: IO ()
main = print $ parse sampleStr
This will fail with the message
Left "fail..." (line 2, column 4):
unexpected "-"
expecting " ", ";" or new-line
To my understanding, since I have try opParser, after Parsec sees that the token <- cannot be parsed by opParser, it should go to rawParser. (It is essentially a lookahead).
What is my misunderstanding, and how do I fix this error?
You can replicate the problem with the smaller test case:
> Parsec.parse fileParser "foo" "x <- 3.14"
The problem is that fileParser first calls lineParser, which successfully parses "x <" into [Raw "x", Op Gt] and leaves "- 3.14" yet to be parsed. Unfortunately, fileParser now expects to parse something with eolParser, but eolParser can't parse "- 3.14" because it starts with neither a semicolon nor an endOfLine.
Your try opParser has no effect here because opParser successfully parses <, so there's nothing to backtrack from.
There are many ways you might fix the problem. If <- is the only case where a < might be misparsed, you could exclude this case with notFollowedBy:
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
notFollowedBy $ Parsec.string "-"
return $ Op Gt

Parsec: Parsing expression between slashes

I'm trying to parse simple expressions between slashes. Example: / 1+2*3 / should evaluate to 7.
I was trying this
module Test where
import Text.Parsec
import Text.Parsec.Language (emptyDef)
import Text.Parsec.Combinator (between)
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Expr as Ex
import qualified Text.Parsec.Token as Tok
lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser style
where
ops = ["+","*","-","/",";"]
names = ["def","extern"]
style = emptyDef {
Tok.commentLine = "#"
, Tok.reservedOpNames = ops
, Tok.reservedNames = names
}
integer :: Parser Int
integer = fromIntegral <$> Tok.integer lexer
parens :: Parser a -> Parser a
parens = Tok.parens lexer
braces :: Parser a -> Parser a
braces = Tok.braces lexer
slashes :: Parser a -> Parser a
slashes = between (reserved "/") (reserved "/")
reserved :: String -> Parser ()
reserved = Tok.reserved lexer
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
binary s f assoc = Ex.Infix (reservedOp s >> return f) assoc
table = [[binary "*" (*) Ex.AssocLeft,
binary "/" div Ex.AssocLeft]
,[binary "+" (+) Ex.AssocLeft,
binary "-" (-) Ex.AssocLeft]]
factor :: Parser Int
factor = try integer
<|> parens expr
expr :: Parser Int
expr = Ex.buildExpressionParser table factor
programInSlashes :: Parser Int
programInSlashes = slashes expr
programInBraces :: Parser Int
programInBraces = braces expr
which works okay for programInBraces:
*Test> parse programInBraces "" "{ 1+2*3/4 }"
Right 2
however, programInSlashes does fail:
*Test> parse programInSlashes "" "/ 1+2*3/4 /"
Left (line 1, column 12):
unexpected end of input
expecting end of "/", integer or "("
Clearly the problem is that / is both an operator and the delimiter for the program itself. But as the language isn't ambiguous we should be able to parse that, no?
I think you can use Text.Parsec.Expr to parse the interior expression; then you can embed backtracking for the / case, for example:
Infix (try $ do { reserved "/"; notFollowedBy eof; return div }) AssocLeft
You can also parse the exterior language and the interior expression in separate passes. I’ve done this in a compiler for a language with custom operators: first parse the program without touching infix expressions, then run another pass to parse infix expressions according to the operators in scope.

Fast parsing of string that allows escaped characters?

I'm trying to parse a string that can contain escaped characters, here's an example:
import qualified Data.Text as T
exampleParser :: Parser T.Text
exampleParser = T.pack <$> many (char '\\' *> escaped <|> anyChar)
where escaped = satisfy (\c -> c `elem` ['\\', '"', '[', ']'])
The parser above creates a String and then packs it into Text. Is there any way to parse a string with escapes like the above using the functions for efficient string handling that attoparsec provides? Like string, scan, runScanner, takeWhile, ...
Parsing something like "one \"two\" \[three\]" would produce one "two" [three].
Update:
Thanks to #epsilonhalbe I was able to come out with a generalized solution perfect for my needs; note that the following function doesn't look for matching escaped characters like [..], "..", (..), etc; and also, if it finds an escaped character that is not valid it treats \ as a literal character.
takeEscapedWhile :: (Char -> Bool) -> (Char -> Bool) -> Parser Text
takeEscapedWhile isEscapable while = do
x <- normal
xs <- many escaped
return $ T.concat (x:xs)
where normal = Atto.takeWhile (\c -> c /= '\\' && while c)
escaped = do
x <- (char '\\' *> satisfy isEscapable) <|> char '\\'
xs <- normal
return $ T.cons x xs
It is possible writing some escaping code, attoparsec and text - altogether it is pretty straightforward - seeing you have already worked with parsers
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
escaped, quoted, brackted :: Parser Text
normal = AT.takeWhile (/= '\\')
escaped = do r <- normal
rs <- many escaped'
return $ T.concat $ r:rs
where escaped' = do r1 <- normal
r2 <- quoted <|> brackted
return $ r1 <> r2
quoted = do string "\\\""
res <- normal
string "\\\""
return $ "\""<>res <>"\""
brackted = do string "\\["
res <- normal
string "\\]"
return $ "["<>res<>"]"
then you can use it to parse the following test cases
Prelude >: MyModule
Prelude MyModule> import Data.Attoparsec.Text as AT
Prelude MyModule AT> import Data.Text.IO as TIO
Prelude MyModule AT TIO>:set -XOverloadedStrings
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test"
test
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\\"test\\\""
"test"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\[test\\]"
[test]
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test \\\"test\\\" \\[test\\]"
test "test" [test]
note you have to escape the escapes - that's why you see \\\" instead of \"
Also if you just parse it will print the Text values escaped, like
Right "test \"text\" [test]"
for the last example.
If you parse a file you write simpley escaped text in the file.
test.txt
I \[like\] \"Haskell\"
then you can
Prelude MyModule AT TIO> file <- TIO.readFile "test.txt"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped file
I [like] "Haskell"

Is there any trick about translating BNF to Parsec program?

The BNF that match function call chain (like x(y)(z)...):
expr = term T
T = (expr) T
| EMPTY
term = (expr)
| VAR
Translate it to Parsec program that looks so tricky.
term :: Parser Term
term = parens expr <|> var
expr :: Parser Term
expr = do whiteSpace
e <- term
maybeAddSuffix e
where addSuffix e0 = do e1 <- parens expr
maybeAddSuffix $ TermApp e0 e1
maybeAddSuffix e = addSuffix e
<|> return e
Could you list all the design patterns about translating BNF to Parsec program?
The simplest think you could do if your grammar is sizeable is to just use the Alex/Happy combo. It is fairly straightforward to use, accepts the BNF format directly - no human translation needed - and perhaps most importantly, produces blazingly fast parsers/lexers.
If you are dead set on doing it with parsec (or you are doing this as a learning exercise), I find it easier in general to do it in two stages; first lexing, then parsing. Parsec will do both!
First write the appropriate types:
{-# LANGUAGE LambdaCase #-}
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Prim
import Text.Parsec.Pos
import Text.ParserCombinators.Parsec.Char
import Control.Applicative hiding ((<|>))
import Control.Monad
data Term = App Term Term | Var String deriving (Show, Eq)
data Token = LParen | RParen | Str String deriving (Show, Eq)
type Lexer = Parsec [Char] () -- A lexer accepts a stream of Char
type Parser = Parsec [Token] () -- A parser accepts a stream of Token
Parsing a single token is simple. For simplicity, a variable is 1 or more letters. You can of course change this however you like.
oneToken :: Lexer Token
oneToken = (char '(' >> return LParen) <|>
(char ')' >> return RParen) <|>
(Str <$> many1 letter)
Parsing the entire token stream is just parsing a single token many times, possible separated by whitespace:
lexer :: Lexer [Token]
lexer = spaces >> many1 (oneToken <* spaces)
Note the placement of spaces: this way, white space is accepted at the beginning and end of the string.
Since Parser uses a custom token type, you have to use a custom satisfy function. Fortunately, this is almost identical to the existing satisfy.
satisfy' :: (Token -> Bool) -> Parser Token
satisfy' f = tokenPrim show
(\src _ _ -> incSourceColumn src 1)
(\x -> if f x then Just x else Nothing)
Then we can write parsers for each of the primitive tokens.
lparen = satisfy' $ \case { LParen -> True ; _ -> False }
rparen = satisfy' $ \case { RParen -> True ; _ -> False }
strTok = (\(Str s) -> s) <$> (satisfy' $ \case { Str {} -> True ; _ -> False })
As you may imagine, parens would be useful for our purposes. It is very straightforward to write.
parens :: Parser a -> Parser a
parens = between lparen rparen
Now the interesting parts.
term, expr, var :: Parser Term
term = parens expr <|> var
var = Var <$> strTok
These two should be fairly obvious to you.
Parec contains combinators option and optionMaybe which are useful when you you need to "maybe do something".
expr = do
e0 <- term
option e0 (parens expr >>= \e1 -> return (App e0 e1))
The last line means - try to apply the parser given to option - if it fails, instead return e0.
For testing you can do:
tokAndParse = runParser (lexer <* eof) () "" >=> runParser (expr <* eof) () ""
The eof attached to each parser is to make sure that the entire input is consumed; the string cannot be a member of the grammar if there are extra trailing characters. Note - your example x(y)(z) is not actually in your grammar!
>tokAndParse "x(y)(z)"
Left (line 1, column 5):
unexpected LParen
expecting end of input
But the following is
>tokAndParse "(x(y))(z)"
Right (App (App (Var "x") (Var "y")) (Var "z"))

Parsing text with optional data at the end

Please note, subsequently to posting this question I managed to derive a solution myself. See the end of this question for my final answer.
I'm working on a little parser at the moment for org-mode documents, and in these documents headings can have a title, and may optionally consist of a list of tags at the of the heading:
* Heading :foo:bar:baz:
I'm having difficulty writing a parser for this, however. The following is what I'm working with for now:
import Control.Applicative
import Text.ParserCombinators.Parsec
data Node = Node String [String]
deriving (Show)
myTest = parse node "" "Some text here :tags:here:"
node = Node <$> (many1 anyChar) <*> tags
tags = (char ':') >> (sepEndBy1 (many1 alphaNum) (char ':'))
<?> "Tag list"
While my simple tags parser works, it doesn't work in the context of node because all of the characters are used up parsing the title of the heading (many1 anyChar). Furthermore, I can't change this parser to use noneOf ":" because : is valid in the title. In fact, it's only special if it's in a taglist, at the very end of the line.
Any ideas how I can parse this optional data?
As an aside, this is my first real Haskell project, so if Parsec is not even the right tool for the job - feel free to point that out and suggest other options!
Ok, I got a complete solution now, but it needs refactoring. The following works:
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Data.Char (isSpace)
import Text.ParserCombinators.Parsec
data Node = Node { level :: Int, keyword :: Maybe String, heading :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> level <*> (optionMaybe keyword) <*> name <*> (optionMaybe tags)
where level = length <$> many1 (char '*') <* space
keyword = (try (many1 upper <* space))
name = noneOf "\n" `manyTill` (eof <|> (lookAhead (try (tags *> eof))))
tags = char ':' *> many1 alphaNum `sepEndBy1` char ':'
myTest = parse parseNode "org-mode" "** Some : text here :tags: JUST KIDDING :tags:here:"
myTest2 = parse parseNode "org-mode" "* TODO Just a node"
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Text.ParserCombinators.Parsec
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
data Node = Node { name :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> name <*> tags
where tags = optionMaybe $ optional (string " :") *> many (noneOf ":\n") `sepEndBy` (char ':')
name = noneOf "\n" `manyTill` try (string " :" <|> string "\n")
myTest = parse parseNode "" "Some:text here :tags:here:"
myTest2 = parse parseNode "" "Sometext here :tags:here:"
Results:
*Main> myTest
Right (Node {name = "Some:text here", tags = Just ["tags","here",""]})
*Main> myTest2
Right (Node {name = "Sometext here", tags = Just ["tags","here",""]})

Categories

Resources