Haskell: Traverse through a String/Text File - parsing

I am trying to read a script file then process and output it to a html file. In my script file, whenever there is a #title(this is a title), I will add tag [header] this is a title [/header] in my html output. So my approach is to first read the script file, write the content to a string, process the string, then write the string to html file.
In other to recognize the #title, I will need to read character by character in the string. When I read '#', I will need to detect the next character to see if they are t i t l e.
QUESTION: How do I traverse through a string (which is a list of char) in Haskell?

You could use a simple recursion trick, for example
findTag [] = -- end of list code.
findTag ('#':xs)
| take 5 xs == "title" = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) = findTag xs
so basically you just pattern match if the next char (head of list) is '#' and then you check if the next 5 characters form "title". if so you can then continue your parsing code. if next character isnt '#' you just continue the recursing. Once the list is empty you reach the first pattern match.
Someone else might have a better solution.
I hope this answers your question.
edit:
For a bit more flexibility, if you want to find a specific tag you could do this:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take (length tagName) xs == tagName = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) _ = findTag xs
This way if you do
findTag text "title"
You'll specifically look for the title, and you can always change the tagname to whatever you want.
Another edit:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take tLength xs == tagName = getTagContents tLength xs
| otherwise = findTag xs
where tLength = length tagName
findTag (_:xs) _ = findTag xs
getTagContents :: Int -> String -> String
getTagContents len = takeWhile (/=')') . drop (len + 1)
to be honest, it's getting a bit messy but here's what's happening:
You first drop the length of the tagName, then one more for the open bracket, and then you finish off by using takeWhile to take the characters until the closing bracket.

Evidently your problem falls into parsing category. As wisely stated by Daniel Wagner, for maintainability reasons you're much better off approaching it generally with a parser.
Another thing is if you want to work with textual data efficiently, you're better off using Text instead of String.
Here's how you could solve your problem using the Attoparsec parser library:
-- For autocasting of hardcoded strings to `Text` type
{-# LANGUAGE OverloadedStrings #-}
-- Import a way more convenient prelude, excluding symbols conflicting
-- with the parser library. See
-- http://hackage.haskell.org/package/classy-prelude
import ClassyPrelude hiding (takeWhile, try)
-- Exclude the standard Prelude
import Prelude ()
import Data.Attoparsec.Text
-- A parser and an inplace converter for title
title = do
string "#title("
r <- takeWhile $ notInClass ")"
string ")"
return $ "[header]" ++ r ++ "[/header]"
-- A parser which parses the whole document to parts which are either
-- single-character `Text`s or modified titles
parts =
(try endOfInput >> return []) ++
((:) <$> (try title ++ (singleton <$> anyChar)) <*> parts)
-- The topmost parser which concats all parts into a single text
top = concat <$> parts
-- A sample input
input = "aldsfj#title(this is a title)sdlfkj#title(this is a title2)"
-- Run the parser and output result
main = print $ parseOnly top input
This outputs
Right "aldsfj[header]this is a title[/header]sdlfkj[header]this is a title2[/header]"
P.S. ClassyPrelude reimplements ++ as an alias for Monoid's mappend, so you can replace it with mappend, <> or Alternative's <|> if you want.

For pattern search-and-replace, you can use
streamEdit.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
title :: Parsec Void String String
title = do
void $ string "#title("
someTill anySingle $ string ")"
editor t = "[header]" ++ t ++ "[/header]"
streamEdit title editor " #title(this is a title) "
" [header]this is a title[/header] "

Related

How can I make a Haskell parser from a list of words?

I'm a Haskell beginner, using Attoparsec to find some color expressions in a text. I want to be able to match, for example, "light blue-green" and "light blue green" in a text. But of course I need a generalized solution for any string like that. So I've been thinking that it would be something like
"light" >> sep >> "blue" >> sep >> "green"
where sep = inClass "\n\r- "
In other words, I think I need a way to intercalate >> sep >> to a list of words. Something like:
import qualified Data.Text as T
import Data.Attoparsec.Text
-- | Makes a parser from a list of words, accepting
-- spaces, newlines, and hyphens as separators.
wordListParser :: [T.Text] -> Parser
wordListParser wordList = -- Some magic here
Or maybe I'm thinking about this the wrong way entirely, and there's an easier way?
Edit: this minimal non-working example feels like it's almost there:
{-# LANGUAGE OverloadedStrings #-}
import Replace.Attoparsec.Text
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Control.Applicative (empty)
wordListParser :: [T.Text] -> Parser T.Text
wordListParser (w:ws) = string w >> satisfy (inClass " -") >> wordListParser ws
wordListParser [w] = string w
wordListParser [] = empty -- or whatever the empty parser is
main :: IO ()
main = parseTest (wordListParser (T.words "light green blue")) "light green-blue"
which I think can be run with something like
stack runhaskell ThisFile.hs --package attoparsec replace-attoparsec text
Here is what I would do, assuming that you have a data type for your colours; if you don't, just substitute it for what you're using. The function parseColourGen takes any Text that is space-separated, and generates a parser that accepts a colour where each word is separated by one or more legal separators.
import Prelude hiding (concat, words)
import Control.Applicative ((<|>))
import Data.Attoparsec.Text
import Data.List (intersperse)
import Data.Text (concat, pack, singleton, Text, words)
data Colour = LightBlue | DarkBlue | VibrantRed deriving Show
parseColourGen :: Text -> Parser [Text]
parseColourGen = sequence . intersperse (mempty <$ many1 legalSep) .
fmap string . words
parseColour :: [(Text, Colour)] -> Parser Colour
parseColour = foldl1 (<|>) . fmap (\(text, colour) ->
colour <$ parseColourGen text)
legalSep :: Parser Text
legalSep = singleton <$> satisfy (inClass "\n\r- ")
You can then feed your wordList to the parser; however, it needs to be an association list:
wordList :: [(Text, Colour)]
wordList = [("light blue", LightBlue), ("dark blue", DarkBlue), ("vibrant red", VibrantRed)]
This way, you can configure all of your colours and their corresponding colour names in one place, and you can then run the parser like so:
> parse (parseColour wordList) $ pack "vibrant-red"
Done "" VibrantRed
EDIT
After the edit in your question, I think I understand what you want a little bit better. FWIW, I would still prefer the solution above, but here is how to fix your last block of code:
As the compiler should tell you, patterns (w:ws) and [w] overlap, so if you want the runtime to catch the single-element pattern, you have to place it on top.
a >> b means "run action a, discard its result, then run action b and use that result". Which is why your parser (with the fix above) will output Done "" "blue". A simple way to fix this is to use do notation to bind the result of all three computations, and return their concatenation.
Here is what your code now looks like:
wordListParser :: [Text] -> Parser Text
wordListParser [w] = string w
wordListParser (w:ws) = do
a <- string w
b <- satisfy (inClass " -")
c <- wordListParser ws
return (a `append` (singleton b) `append` c) -- singleton :: Char -> Text
wordListParser [] = empty
One last thing: your current implementation will not parse Windows line breaks (\n\r). I don't know if you dropped \n and \r from your separator characters, but if you haven't and Windows files are a possibility for you, it's something to keep in mind.
I’m not familiar with attoparsec, but you could possibly use a recursive solution:
wordListParser :: [T.Text] -> Parser
wordListParser [] = empty
wordListParser [w] = text w
wordListParser (w:ws) = text w >> inClass "\n\r- " >> wordListParser ws

Recursively return all words from .txt file using attoparsec

I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say
"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
I only get back:
World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word =
"Hello"})
This is my current code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
{-
This is how I would usually get the length of the list of words in a .txt file normally.
countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)
-}
data Prose = Prose {
word :: String
} deriving Show
prose :: Parser Prose
prose = do
word <- many' $ letter
return $ Prose word
main :: IO()
main = do
input <- Txt.readFile "small.txt"
print $ parse prose input
Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?
You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.
From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].
EDIT:
Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.
{-# LANGUAGE OverloadedStrings #-}
module Atto where
--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)
import qualified Data.Text as T
data Prose = Prose {
word :: String
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter
separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()
wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional separator *> prose `sepBy1` separator
main :: IO ()
main = do
let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
let words = wordParser input
print words
print $ length words
The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.
Hope it helps! :)
You're on the right track! You've written a parser (prose) which reads a single word: many' letter recognises a sequence of letters.
So now that you've figured out how to parse a single word, your job is to scale this up to parse a sequence of words separated by spaces. That's what sepBy does: p `sepBy` q runs the p parser repeatedly with the q parser interspersed.
So a parser for a sequence of words looks something like this (I've taken the liberty of renaming your prose to word):
word = many letter
phrase = word `sepBy` some space -- "some" runs a parser one-or-more times
ghci> parseOnly phrase "wibble wobble wubble" -- with -XOverloadedStrings
Right ["wibble","wobble","wubble"]
Now, phrase, being composed out of letter and space, will die on non-letter non-space characters such as ' and .. I'll leave it to you to figure out how to fix that. (As a hint, you'll probably need to change many letter to many (letter <|> ...), depending on how exactly you want it to behave on the various punctuation marks.)

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.
What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res
#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

Resources