Parsec: Getting start and end source positions of expressions? - parsing

I'm writing a programming language which uses Parsec for its parsing. For reporting error messages, I've got each element of my syntax tree labelled with its source location, using the getPosition function from the Pos module of Parsec.
However, it only gives the location of the beginning of each expression I parse, and I'd like the beginning and end, so that I can highlight their entire location within the source code.
Is such a thing possible with parsec? Is there a standard way of getting the end-point of an expression I'm parsing, so that I can include it in my AST?

You can use getPosition after you parse as well.
import Text.Parsec
import Text.Parsec.String
spanned :: Parser a -> Parser (SourcePos, SourcePos, a)
spanned p = do
pos1 <- getPosition
a <- p
pos2 <- getPosition
pure (pos1, pos2, a)
Testing:
> parseTest (spanned (many1 (char 'a'))) "aaaaafff"
((line 1, column 1),(line 1, column 6),"aaaaa")

Related

Recursively return all words from .txt file using attoparsec

I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say
"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
I only get back:
World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word =
"Hello"})
This is my current code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
{-
This is how I would usually get the length of the list of words in a .txt file normally.
countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)
-}
data Prose = Prose {
word :: String
} deriving Show
prose :: Parser Prose
prose = do
word <- many' $ letter
return $ Prose word
main :: IO()
main = do
input <- Txt.readFile "small.txt"
print $ parse prose input
Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?
You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.
From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].
EDIT:
Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.
{-# LANGUAGE OverloadedStrings #-}
module Atto where
--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)
import qualified Data.Text as T
data Prose = Prose {
word :: String
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter
separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()
wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional separator *> prose `sepBy1` separator
main :: IO ()
main = do
let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
let words = wordParser input
print words
print $ length words
The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.
Hope it helps! :)
You're on the right track! You've written a parser (prose) which reads a single word: many' letter recognises a sequence of letters.
So now that you've figured out how to parse a single word, your job is to scale this up to parse a sequence of words separated by spaces. That's what sepBy does: p `sepBy` q runs the p parser repeatedly with the q parser interspersed.
So a parser for a sequence of words looks something like this (I've taken the liberty of renaming your prose to word):
word = many letter
phrase = word `sepBy` some space -- "some" runs a parser one-or-more times
ghci> parseOnly phrase "wibble wobble wubble" -- with -XOverloadedStrings
Right ["wibble","wobble","wubble"]
Now, phrase, being composed out of letter and space, will die on non-letter non-space characters such as ' and .. I'll leave it to you to figure out how to fix that. (As a hint, you'll probably need to change many letter to many (letter <|> ...), depending on how exactly you want it to behave on the various punctuation marks.)

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Auto-completion suggestions from parse error

I am writing a parser for a custom jupter kernel using megaparsec. I was able to re-use the parser to provide completions too: the custom error message generated from the megaparsec library are transformed to the list of expected symbols. It that way, whenever I change the parser, completion automatically adjust itself. Which is great.
The only thing I am struggling is how to get info from the optional parsers. The minimal example illustrating what I want to achieve is following:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Data.Monoid
import Data.Text (Text)
import Data.Set (singleton)
type Parser = Parsec MyError Text
data MyError = ExpectKeyword Text deriving (Eq, Ord, Show)
lexeme = L.lexeme sc
sc = L.space (skipSome (oneOf [' ', '\t'])) empty empty
-- | Reserved words
rword :: Text -> Parser Text
rword w = region (fancyExpect (ExpectKeyword w)) $
lexeme (string w *> return w)
fancyExpect f e = FancyError (errorPos e) (singleton . ErrorCustom $ f)
p1 = rword "foo" <|> rword "bar"
p2 = (<>) <$> option "def" (rword "opt") <*> p1
main = do
putStrLn . show $ parse p1 "" ("xyz" :: Text) -- shows "foo" and "bar" in errors
putStrLn . show $ parse p2 "" ("xyz" :: Text) -- like above, no optional "opt"
In the first case, parser fails and I get the list of all errors from all alternatives. Ideally, in the second case I would like to see the error of the failed optional parser too.
This example can be simply solved by removing option and making two branches with <|>: one with option and the other without. However in real case the optional part is a permutation parser consisting of several optional parts, so such trick is not feasible.

Insert a character into parser combinator character stream in Haskell

This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra

Parsing user input with reads in Haskell

I am trying to parse user entered string like "A12", into a Haskell tuple, like ('A', 12).
Here's what I have tried:
import Data.Maybe
type Pos = (Char, Int)
parse :: String -> Maybe Pos
parse u = do
(c, rest) <- (listToMaybe.reads) u
(r, _) <- (listToMaybe.reads) rest
return $ (c, r)
But this always returns Nothing. Why does this happen, and what is the correct way to parse this string? Since this is fairly simple, I'd like to avoid using Parsec or a similar advanced parsing library.
EDIT (to clarify):
Sample Input and Output:
"A12" gives Just ('A', 12)
"J5" gives Just ('J', 5)
"A" gives Nothing
"2324" gives Nothing
read is usually the opposite of show and they both generally use Haskell syntax to represent the given values. This means that since the Haskell syntax for characters uses single quotes, show on a character will add single quotes around it, and read will expect the single quotes to be there.
In other words, your function expects syntax like 'A' 42, and indeed it works if you try that:
> parse "'A' 42"
Just ('A',42)
For your format, I would instead use pattern matching for the first character and then reads for the rest, e.g. something like this:
parse :: String -> Maybe Pos
parse [] = Nothing
parse (c:rest) = do
(r, _) <- listToMaybe $ reads rest
return (c, r)
Do you have to use do notation? If not, the following function suits your needs. It's not pretty, but it gets the job done.
parse :: String -> Maybe Pos
parse (x:xs) = Just (x,read xs::Int)
I'm not sure what you consider "failing" and thus worth of a Nothing

Resources