Correctly parsing line indentations in uu-parsinglib in Haskell - parsing

I want to create a parser combinator, which will collect all lines below current place, which indentation levels will be greater or equal some i. I think the idea is simple:
Consume a line - if its indentation is:
ok -> do it for next lines
wrong -> fail
Lets consider following code:
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
-- end of line
pEOL = pSym '\n'
pSpace = pSym ' '
pTab = pSym '\t'
indentOf s = case s of
' ' -> 1
'\t' -> 4
-- return the indentation level (number of spaces on the beginning of the line)
pIndent = (+) <$> (indentOf <$> (pSpace <|> pTab)) <*> pIndent `opt` 0
-- returns tuple of (indentation level, result of parsing the second argument)
pIndentLine p = (,) <$> pIndent <*> p <* pEOL
-- SHOULD collect all lines below witch indentations greater or equal i
myParse p i = do
(lind, expr) <- pIndentLine p
if lind < i
then pFail
else do
rest <- myParse p i `opt` []
return $ expr:rest
-- sample inputs
s1 = " a\
\\n a\
\\n"
s2 = " a\
\\na\
\\n"
-- execution
pProgram = myParse (pSym 'a') 1
parse p s = UU.parse ( (,) <$> p <*> pEnd) (createStr (LineColPos 0 0 0) s)
main :: IO ()
main = do
print $ parse pProgram s1
print $ parse pProgram s2
return ()
Which gives following output:
("aa",[])
Test.hs: no correcting alternative found
The result for s1 is correct. The result for s2 should consume first "a" and stop consuming. Where this error comes from?

The parsers which you are constructing will always try to proceed; if necessary input will be discarded or added. However pFail is a dead-end. It acts as a unit element for <|>.
In you parser there is however no other alternative present in case the input does not comply to the language recognised by the parser. In you specification you say you want the parser to fail on input s2. Now it fails with a message saying that is fails, and you are surprised.
Maybe you do not want it to fail, but you want to stop accepting further input? In that case
replace pFail by return [].
Note that the text:
do
rest <- myParse p i `opt` []
return $ expr:rest
can be replaced by (expr:) <$> (myParse p i `opt` [])
A natural way to solve your problem is probably something like
pIndented p = do i <- pGetIndent
(:) <$> p <* pEOL <*> pMany (pToken (take i (repeat ' ')) *> p <* pEOL)
pIndent = length <$> pMany (pSym ' ')

Related

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Haskell: Parsing a file finishes after first expression despite more input in file

The following is an example program of a language in which I'm writing a parser.
n := 1
Do (1)-> -- The 1 in brackets is a placeholder for a Boolean or relational expression.
n := 1 + 1
Od
When the program looks like this, the parseFile functions ends after the assignment on the first line however when the assignment is removed, it parses as expected. Below is how it's called in GHCI, first with the first line present then removed:
λ > parseFile "example.hnry"
Assign "n" (HInteger 1)
λ > parseFile "example.hnry"
Do (HInteger 1) (Assign "n" (AExpr (HInteger 1) Add (HInteger 1)))
The expected output would look similar to this:
λ > parseFile "example.hnry"
Assign "n" (HInteger 1) Do (HInteger 1) (Assign "n" (AExpr (HInteger 1) Add (HInteger 1)))
I first assumed it was something to do with the the assignment parser but in the body of the loop, there exists an assignment which parses as expected so I was able to rule that out. I believe that the issue is within the parseFile function itself. The following is the parseFile function and the other functions that make up the parseExpression function that I'm using to parse a program.
I think that the error is within parseFile because it parses an expression only once and doesn't "loop" for the want of a better word to itself to check if there's more input left the parse. I think that's the error but I'm not quite sure.
parseFile :: String -> IO HVal
parseFile file =
do program <- readFile file
case parse parseExpression "" program of
Left err -> fail "Parse Error"
Right parsed -> return $ parsed
parseExpression :: Parser HVal
parseExpression = parseAExpr <|> parseDo <|> parseAssign
parseDo :: Parser HVal
parseDo = do
_ <- string "Do "
_ <- char '('
x <- parseHVal -- Will be changed to a Boolean expression
_ <- string ")->"
spaces
y <- parseExpression
spaces
_ <- string "Od"
return $ Do x y
parseAExpr :: Parser HVal
parseAExpr = do
x <- parseInteger
spaces
op <- parseOp
spaces
y <- parseInteger <|> do
_ <- char '('
z <- parseAExpr
_ <- char ')'
return $ z
return $ AExpr x op y
parseAssign :: Parser HVal
parseAssign = do
var <- oneOf ['a'..'z'] <|> oneOf ['A'..'Z']
spaces
_ <- string ":="
spaces
val <- parseHVal <|> do
_ <- char '('
z <- parseAExpr
_ <- char ')'
return $ z
return $ Assign [var] val
As you note, your parseFile function parses a single expression (though maybe "statement" would be a better name) using the parseExpression parser. You probably want to introduce a new parser for a "program" or sequence of expressions/statements:
parseProgram :: Parser [HVal]
parseProgram = spaces *> many (parseExpression <* spaces)
and then in parseFile, replace parseExpression with parseProgram:
parseFile :: String -> IO [HVal]
parseFile file =
do program <- readFile file
case parse parseProgram "" program of
Left err -> fail "Parse Error"
Right parsed -> return $ parsed
Note that I've had to change the type here from HVal to [HVal] to reflect the fact that a program, being a sequence of expressions each of type HVal, needs to be represented as some sort of data type capable of combining multiple HVals together, and a list [HVal] is one way of doing so.
If you want a program to be an HVal instead of an [HVal], then you need to introduce a new constructor in your HVal type that's capable of representing programs. One method is to use a constructor to directly represent a block of statements:
data HVal = ... | Block [HVal]
Another is to add a constructor represent a sequence of two statements:
data HVal = ... | Seq HVal HVal
Both methods are used in real parsers. (Note that you'd normally pick one; you wouldn't use both.) To represent a sequence of three assignment statements, for example, the block method would do it directly as a list:
Block [Assign "a" (HInteger 1), Assign "b" (HInteger 2), Assign "c" (HInteger 3)]
while the two-statement sequence method would build a sort of nested tree:
Seq (Assign "a" (HInteger 1)) (Seq (Assign "b" (HInteger 2)
(Assign "c" (HInteger 3))
The appropriate parsers for these two alternatives, both of which return a plain HVal, might be:
-- use blocks
parseProgram1 :: Parser HVal
parseProgram1 = do
spaces
xs <- many (parseExpression <* spaces)
return $ Block xs
parseProgram2 :: Parser HVal
parseProgram2 = do
spaces
x <- parseExpression
spaces
(do xs <- parseProgram2
return $ Seq x xs)
<|> return x

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.

Haskell read variable name

I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.
Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.

Unplanned greedy behaviour in uu-parsinglib

The problem
I came across a problem today and I do not know how to solve it. It is very strange to me, because the code I've written should (according to my current knowledge) is correct.
So below you can find a sample parser combinators. The most important one is pOperator, which in very simple way (only for demonstration purposes) builds an operator AST.
It consumes "x" and can consume multiple "x" separated by whitespaces.
I've got also pParens combinator which is defined like:
pPacked pParenL (pWSpaces *> pParenR)
so it consumes Whitespaces before closing bracket.
Sample input / output
The correct input/output SHOULD be:
in: "(x)"
out: Single "x"
in: "(x )"
out: Single "x"
but I'm getting:
in: "(x)"
out: Single "x"
in: "(x )"
out: Multi (Single "x") (Single "x")
-- Correcting steps:
-- Inserted 'x' at position LineColPos 0 3 3 expecting one of ['\t', ' ', 'x']
but in the second example I'm getting error - and the parser behaves like it greedy eats some tokens (and there is no greedy operation).
I would be thankful for any help with it.
Sample code
import Prelude hiding(lex)
import Data.Char hiding (Space)
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import qualified Text.ParserCombinators.UU.Utils as Utils
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
data El = Multi El El
| Single String
deriving (Show)
---------- Example core grammar ----------
pElement = Single <$> pSyms "x"
pOperator = applyAll <$> pElement <*> pMany (flip <$> (Multi <$ pWSpaces1) <*> pElement)
---------- Basic combinators ----------
applyAll x (f:fs) = applyAll (f x) fs
applyAll x [] = x
pSpace = pSym ' '
pTab = pSym '\t'
pWSpace = pSpace <|> pTab
pWSpaces = pMany pWSpace
pWSpaces1 = pMany1 pWSpace
pMany1 p = (:) <$> p <*> pMany p
pSyms [] = pReturn []
pSyms (x : xs) = (:) <$> pSym x <*> pSyms xs
pParenL = Utils.lexeme $ pSym '('
pParenR = Utils.lexeme $ pSym ')'
pParens = pPacked pParenL (pWSpaces *> pParenR)
---------- Program ----------
pProgram = pParens pOperator
-- if you replace it with following line, it works:
-- pProgram = pParens pElement
-- so it seems like something in pOperator is greedy
tests = [ ("test", "(x)")
, ("test", "(x )")
]
---------- Helpers ----------
type Parser a = P (Str Char String LineColPos) a
parse p s = UU.parse ( (,) <$> p <*> pEnd) (createStr (LineColPos 0 0 0) s)
main :: IO ()
main = do
mapM_ (\(desc, p) -> putStrLn ("\n=== " ++ desc ++ " ===") >> run pProgram p) tests
return ()
run :: Show t => Parser t -> String -> IO ()
run p inp = do let (a, errors) = parse p inp
putStrLn ("-- Result: \n" ++ show a)
if null errors then return ()
else do putStr ("-- Correcting steps: \n")
show_errors errors
putStrLn "-- "
where show_errors :: (Show a) => [a] -> IO ()
show_errors = sequence_ . (map (putStrLn . show))
IMPORTANT
pOperator = applyAll <$> pElement <*> pMany (flip <$> (Multi <$ pWSpaces1) <*> pElement)
is equivalent to:
foldr pChainl pElement (Multi <$ pWSpaces1)
according to: Combinator Parsing: A Short Tutorial
And it is used to define operator precedense.
The definition of pMany reads:
pMany :: IsParser p => p a -> p [a]
pMany p = pList p
and this suggest the solution. When seeing the space we should not commit immediately to the choice to continue with more x-es so we define:
pMany :: IsParser p => p a -> p [a]
pMany_ng p = pList_ng p
Of course you may also call pList_ng immediately. Even better would be to write:
pParens (pChainr_ng (pMulti <$ pWSpaces1) px) --
I did not test it since I am not sure whether between x-es there should be at least one space etc.
Doaitse

Resources