Parse a sub-string with parsec (by ignoring unmatched prefixes) - parsing

I would like to extract the repository name from the first line of git remote -v, which is usually of the form:
origin git#github.com:some-user/some-repo.git (fetch)
I quickly made the following parser using parsec:
-- | Parse the repository name from the output given by the first line of `git remote -v`.
repoNameFromRemoteP :: Parser String
repoNameFromRemoteP = do
_ <- originPart >> hostPart
_ <- char ':'
firstPart <- many1 alphaNum
_ <- char '/'
secondPart <- many1 alphaNum
_ <- string ".git"
return $ firstPart ++ "/" ++ secondPart
where
originPart = many1 alphaNum >> space
hostPart = many1 alphaNum
>> (string "#" <|> string "://")
>> many1 alphaNum `sepBy` char '.'
But this parser looks a bit awkward. Actually I'm only interested in whatever follows the colon (":"), and it would be easier if I could just write a parser for it.
Is there a way to have parsec skip a character upon a failed match, and re-try from the next position?

If I've understood the question, try many (noneOf ":"). This will consume any character until it sees a ':', then stop.
Edit: Seems I had not understood the question. You can use the try combinator to turn a parser which may consume some characters before failing into one that consumes no characters on a failure. So:
skipUntil p = try p <|> (anyChar >> skipUntil p)
Beware that this can be quite expensive, both in runtime (because it will try matching p at every position) and memory (because try prevents p from consuming characters and so the input cannot be garbage collected at all until p completes). You might be able to alleviate the first of those two problems by parameterizing the anyChar bit so that the caller could choose some cheap parser for finding candidate positions; e.g.
skipUntil p skipper = try p <|> (skipper >> skipUntil p skipper)
You could then potentially use the above many (noneOf ":") construction to only try p on positions that start with a :.

The
sepCap
combinator from
replace-megaparsec
can skip a character upon a failed match, and re-try from the next position.
Maybe this is overkill for your particular case, but it does solve the
general problem.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
username :: Parsec Void String String
username = do
void $ single ':'
some $ alphaNumChar <|> single '-'
listToMaybe . rights =<< parseMaybe (sepCap username)
"origin git#github.com:some-user/some-repo.git (fetch)"
Just "some-user"

Related

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines
You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

How to write a parsec parser for a list of interspersed elements?

Let's say the input looks something like foo#1 bar baz-3.qux [...]. I want to write a parser that only consumes the input up until the first space before the [, which means foo#1 bar baz-3.qux (without the trailing space).
How should I approach this using parsec?
I can imagine something like
foo = many1 $ letter <|> digit <|> oneOf " #-."
but this consumes even the space at the end, which I'd like to avoid. What is a general approach to parsing a list of things interspersed with another thing? (Imagine it's not just a space, but something that would also need to be parsed).
P.S: I'm looking for the most general solution possible, not a clever hack that solves this particular example.
I think what you're looking for is exactly notFollowedBy. Something like
foo = many1 $ letter
<|> digit
<|> oneOf "#-."
<|> (try $ char ' ' >> notFollowedBy (char '[') >> return ' ')
You can abstract out the pattern to get the general function of course:
endedBy :: (Show y) => Parser x -> Parser x -> Parser y -> Parser [x]
endedBy p final terminal = many1 $ p <|> t where
t = try $ do
x <- final
notFollowedBy terminal
return x
foo' = endedBy (letter <|> digit <|> oneOf "#-.") (char ' ') (char '[')

Make a parser ignore all redundant whitespace

Say I have a Parser p in Parsec and I want to specify that I want to ignore all superfluous/redundant white space in p. Let's for example say that I define a list as starting with "[", end with "]", and in the list are integers separated by white space. But I don't want any errors if there are white space in front of the "[", after the "]", in between the "[" and the first integer, and so on.
In my case, I want this to work for my parser for a toy programming language.
I will update with code if that is requested/necessary.
Just surround everything with space:
parseIntList :: Parsec String u [Int]
parseIntList = do
spaces
char '['
spaces
first <- many1 digit
rest <- many $ try $ do
spaces
char ','
spaces
many1 digit
spaces
char ']'
return $ map read $ first : rest
This is a very basic one, there are cases where it'll fail (such as an empty list) but it's a good start towards getting something to work.
#Joehillen's suggestion will also work, but it requires some more type magic to use the token features of Parsec. The definition of spaces matches 0 or more characters that satisfies Data.Char.isSpace, which is all the standard ASCII space characters.
Use combinators to say what you mean:
import Control.Applicative
import Text.Parsec
import Text.Parsec.String
program :: Parser [[Int]]
program = spaces *> many1 term <* eof
term :: Parser [Int]
term = list
list :: Parser [Int]
list = between listBegin listEnd (number `sepBy` listSeparator)
listBegin, listEnd, listSeparator :: Parser Char
listBegin = lexeme (char '[')
listEnd = lexeme (char ']')
listSeparator = lexeme (char ',')
lexeme :: Parser a -> Parser a
lexeme parser = parser <* spaces
number :: Parser Int
number = lexeme $ do
digits <- many1 digit
return (read digits :: Int)
Try it out:
λ :l Parse.hs
Ok, modules loaded: Main.
λ parseTest program " [1, 2, 3] [4, 5, 6] "
[[1,2,3],[4,5,6]]
This lexeme combinator takes a parser and allows arbitrary whitespace after it. Then you only need to use lexeme around the primitive tokens in your language such as listSeparator and number.
Alternatively, you can parse the stream of characters into a stream of tokens, then parse the stream of tokens into a parse tree. That way, both the lexer and parser can be greatly simplified. It’s only worth doing for larger grammars, though, where maintainability is a concern; and you have to use some of the lower-level Parsec API such as tokenPrim.

Correct ReadP usage in Haskell

I did a very simple parser for lists of numbers in a file, using ReadP in Haskell. It works, but it is very slow... is this normal behavior of this type of parser or am I doing something wrong?
import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char
setsReader :: ReadP [ IntSet.IntSet ]
setsReader =
setReader `sepBy` ( char '\n' )
innocentWhitespace :: ReadP ()
innocentWhitespace =
skipMany $ (char ' ') <++ (char '\t' )
setReader :: ReadP IntSet.IntSet
setReader = do
innocentWhitespace
int_list <- integerReader `sepBy1` innocentWhitespace
innocentWhitespace
return $ IntSet.fromList int_list
integerReader :: ReadP Int
integerReader = do
digits <- many1 $ satisfy isDigit
return $ read digits
readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do
whole_file <- readFile filename
return $ ( fst . last ) $ readP_to_S setsReader whole_file
setReader has exponential behavior, because it is allowing the whitespace between the numbers to be optional. So for the line:
12 34 56
It is seeing these parses:
[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]
You could see how this could get out of hand for long lines. ReadP returns all valid parses in increasing length order, so to get to the last parse you have to traverse through all these intermediate parses. Change:
int_list <- integerReader `sepBy1` innocentWhitespace
To:
int_list <- integerReader `sepBy1` mandatoryWhitespace
For a suitable definition of mandatoryWhitespace to squash this exponential behavior. The parsing strategy used by parsec is more resistant to this kind of error, because it is greedy -- once it consumes input in a given branch, it is committed to that branch and never goes back (unless you explicitly asked it to). So once it correctly parsed 12, it would never go back to parse 1 2. Of course that means it matters in which order you state your choices, which I always find to be a bit of a pain to think about.
Also I would use:
head [ x | (x,"") <- readP_to_S setsReader whole_file ]
To extract a valid whole-file parse, in case it very quickly consumed all input but there were a hundred bazillion ways to interpret that input. If you don't care about the ambiguity, you would probably rather it return the first one than the last one, because the first one will arrive faster.

Resources