Parsing single qoute char in a single-quoted string in parsec - parsing

I've got a silly situation in my parsec parsers that I would like your help on.
I need to parse a sequence of strongs / chars that are separated by | characters.
So, we could have a|b|'c'|'abcd'
which should be turned into
[a,b,c,abcd]
Space is not allowed, unless inside of a ' ' string. Now, in my naïve attempt, I got the situation now where I can parse strings like a'a|'bb' to [a'a,bb] but not aa|'b'b' to [aa,b'b].
singleQuotedChar :: Parser Char
singleQuotedChar = noneOf "'" <|> try (string "''" >> return '\'')
simpleLabel = do
whiteSpace haskelldef
lab <- many1 (noneOf "|")
return $ lab
quotedLabel = do
whiteSpace haskelldef
char '\''
lab <- many singleQuotedChar
char '\''
return $ lab
Now, how do I tell the parser to consider ' a stoping ' iff it is followed by a | or white space?
(Or, get some ' char counting into this). The input is user generated, so I cannot rely on them \'-ing chars.

Note that allowing a quote in the middle of a string delimited by quotes is very confusing to read, but I believe this should allow you to parse it.
quotedLabel = do -- reads the first quote.
whiteSpace
char '\''
quotedLabel2
quotedLabel2 = do -- reads the string and the finishing quote.
lab <- many singleQuotedChar
try (do more <- quotedLabel3
return $ lttrace "quotedLabel2" (lab ++ more))
<|> (do char '\''
return $ lttrace "quotedLabel2" lab)
quotedLabel3 = do -- handle middle quotes
char '\''
lookAhead $ noneOf ['|']
ret <- quotedLabel2
return $ lttrace "quotedLabel3" $ "'" ++ ret

Related

parsec: stopping at empty line

I would like to solve the following task with parsec, although splitOn "\n\n" is probably the simpler answer.
I have an inputstring like
testInput = unlines ["ab", "cd", "", "e"] -- "ab\ncd\n\ne"
The parser shall stop when encountering an empty line.
I tried this
import Text.ParserCombinators.Parsec
inputFileP :: GenParser Char st String
inputFileP = many (lower <|> delimP)
delimP :: GenParser Char st Char
delimP = do
x <- char '\n'
notFollowedBy (char '\n')
return x
This fails with unexpected '\n'.
Why?
I was under the impression that many x parses x until it fails and then stops.
I was under the impression that many x parses x until it fails and then stops.
This is only the case if x fails without consuming any input. If x fails after consuming input, the whole parse will fail unless there's a try somewhere (this isn't just specific to many: x <|> y would also fail in that case even if y would succeed). In your case delimP fails on the notFollowedBy (char '\n') after already consuming the first \n, so the whole parse fails.
To change this behaviour, you need to explicitly enable backtracking using try like this:
delimP = try $ do
x <- char '\n'
notFollowedBy (char '\n')
return x
Alternatively, you can make it so that delimP fails without consuming any input (and thus no need for try) by making it look ahead by two characters before matching the \n:
delimP = do
notFollowedBy (string "\n\n")
char '\n'

Parse a sub-string with parsec (by ignoring unmatched prefixes)

I would like to extract the repository name from the first line of git remote -v, which is usually of the form:
origin git#github.com:some-user/some-repo.git (fetch)
I quickly made the following parser using parsec:
-- | Parse the repository name from the output given by the first line of `git remote -v`.
repoNameFromRemoteP :: Parser String
repoNameFromRemoteP = do
_ <- originPart >> hostPart
_ <- char ':'
firstPart <- many1 alphaNum
_ <- char '/'
secondPart <- many1 alphaNum
_ <- string ".git"
return $ firstPart ++ "/" ++ secondPart
where
originPart = many1 alphaNum >> space
hostPart = many1 alphaNum
>> (string "#" <|> string "://")
>> many1 alphaNum `sepBy` char '.'
But this parser looks a bit awkward. Actually I'm only interested in whatever follows the colon (":"), and it would be easier if I could just write a parser for it.
Is there a way to have parsec skip a character upon a failed match, and re-try from the next position?
If I've understood the question, try many (noneOf ":"). This will consume any character until it sees a ':', then stop.
Edit: Seems I had not understood the question. You can use the try combinator to turn a parser which may consume some characters before failing into one that consumes no characters on a failure. So:
skipUntil p = try p <|> (anyChar >> skipUntil p)
Beware that this can be quite expensive, both in runtime (because it will try matching p at every position) and memory (because try prevents p from consuming characters and so the input cannot be garbage collected at all until p completes). You might be able to alleviate the first of those two problems by parameterizing the anyChar bit so that the caller could choose some cheap parser for finding candidate positions; e.g.
skipUntil p skipper = try p <|> (skipper >> skipUntil p skipper)
You could then potentially use the above many (noneOf ":") construction to only try p on positions that start with a :.
The
sepCap
combinator from
replace-megaparsec
can skip a character upon a failed match, and re-try from the next position.
Maybe this is overkill for your particular case, but it does solve the
general problem.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
username :: Parsec Void String String
username = do
void $ single ':'
some $ alphaNumChar <|> single '-'
listToMaybe . rights =<< parseMaybe (sepCap username)
"origin git#github.com:some-user/some-repo.git (fetch)"
Just "some-user"

Designing parsing code using Parsec

In the course for following the tutorial Write yourself a Scheme in 48 hours, I was attempting to enhance my parsing code to create support for hexadecimal, octal, binary and decimal literals.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Monad
import Numeric
hexChar :: Parser Char
hexChar = ...
octChar :: Parser Char
octChar = ...
hexNumber :: Parser Integer
hexNumber = do
char '#'
char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
char '#'
char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
If we forget about decimal and binary numbers in this discussion:
parseNumber :: Parse Integer
parseNumber = hexNumber <|> octNumber
Then this parser will fail to recognize octal numbers. This seems to be related to the number of lookahead characters required to tell apart and octal from hexadecimal numbers (if we drop the leading '#' in the syntax, then the
parser will work). Hence it seems we are forced to revisit the code and 'factorize' the leading '#' so to speak, by dropping the char '#' in the individual parsers and defining:
parseNumber = char '#' >> (hexNumber <|> octNumber)
This is fine but I find the code less pleasant. Somehow, if I have a function called hexNumber I would expect it to recognize #xffff (which is proper Scheme syntax) and not xffff. Is this something I have to live with, or are there ways to go around this 'forced factorization' of the leading character #?
If the first argument of (<|>) fails after having consumed some input, then it fails immediately without trying the second alternative. If a failure in the first argument should lead to a retry with the second argument, you can use try to avoid consuming input. In hexNumber you must consume '#' only if the following character matches 'x'.
hexNumber :: Parser Integer
hexNumber = do
try $ char '#' >> char 'x'
s <- many1 hexChar
return $ (fst . head) readHex s
octNumber :: Parser Integer
octNumber = do
try $ char '#' >> char 'o'
s <- many1 hexChar
return $ (fst . head) readOct s
Note that this is somewhat inefficient since you parse '#' twice, and it gets worse as the common prefix gets longer and more complex.

How to write a parsec parser for a list of interspersed elements?

Let's say the input looks something like foo#1 bar baz-3.qux [...]. I want to write a parser that only consumes the input up until the first space before the [, which means foo#1 bar baz-3.qux (without the trailing space).
How should I approach this using parsec?
I can imagine something like
foo = many1 $ letter <|> digit <|> oneOf " #-."
but this consumes even the space at the end, which I'd like to avoid. What is a general approach to parsing a list of things interspersed with another thing? (Imagine it's not just a space, but something that would also need to be parsed).
P.S: I'm looking for the most general solution possible, not a clever hack that solves this particular example.
I think what you're looking for is exactly notFollowedBy. Something like
foo = many1 $ letter
<|> digit
<|> oneOf "#-."
<|> (try $ char ' ' >> notFollowedBy (char '[') >> return ' ')
You can abstract out the pattern to get the general function of course:
endedBy :: (Show y) => Parser x -> Parser x -> Parser y -> Parser [x]
endedBy p final terminal = many1 $ p <|> t where
t = try $ do
x <- final
notFollowedBy terminal
return x
foo' = endedBy (letter <|> digit <|> oneOf "#-.") (char ' ') (char '[')

Make a parser ignore all redundant whitespace

Say I have a Parser p in Parsec and I want to specify that I want to ignore all superfluous/redundant white space in p. Let's for example say that I define a list as starting with "[", end with "]", and in the list are integers separated by white space. But I don't want any errors if there are white space in front of the "[", after the "]", in between the "[" and the first integer, and so on.
In my case, I want this to work for my parser for a toy programming language.
I will update with code if that is requested/necessary.
Just surround everything with space:
parseIntList :: Parsec String u [Int]
parseIntList = do
spaces
char '['
spaces
first <- many1 digit
rest <- many $ try $ do
spaces
char ','
spaces
many1 digit
spaces
char ']'
return $ map read $ first : rest
This is a very basic one, there are cases where it'll fail (such as an empty list) but it's a good start towards getting something to work.
#Joehillen's suggestion will also work, but it requires some more type magic to use the token features of Parsec. The definition of spaces matches 0 or more characters that satisfies Data.Char.isSpace, which is all the standard ASCII space characters.
Use combinators to say what you mean:
import Control.Applicative
import Text.Parsec
import Text.Parsec.String
program :: Parser [[Int]]
program = spaces *> many1 term <* eof
term :: Parser [Int]
term = list
list :: Parser [Int]
list = between listBegin listEnd (number `sepBy` listSeparator)
listBegin, listEnd, listSeparator :: Parser Char
listBegin = lexeme (char '[')
listEnd = lexeme (char ']')
listSeparator = lexeme (char ',')
lexeme :: Parser a -> Parser a
lexeme parser = parser <* spaces
number :: Parser Int
number = lexeme $ do
digits <- many1 digit
return (read digits :: Int)
Try it out:
λ :l Parse.hs
Ok, modules loaded: Main.
λ parseTest program " [1, 2, 3] [4, 5, 6] "
[[1,2,3],[4,5,6]]
This lexeme combinator takes a parser and allows arbitrary whitespace after it. Then you only need to use lexeme around the primitive tokens in your language such as listSeparator and number.
Alternatively, you can parse the stream of characters into a stream of tokens, then parse the stream of tokens into a parse tree. That way, both the lexer and parser can be greatly simplified. It’s only worth doing for larger grammars, though, where maintainability is a concern; and you have to use some of the lower-level Parsec API such as tokenPrim.

Resources