I've written some parsers using Attoparsec but only now realised that I don't always want them to backtrack on failure, but attoparsec parsers always backtrack on failure.
Is there a way to force a parser not to backtrack?
For example, this attoparsec parser will succeed when given the input "for":
string "foo" <|> string "for"
A parsec parser would not succeed on that input and I want to emulate this behaviour using an attoparsec parser.
Related
I'm looking at the following approach to using parser combinators in Haskell. The author gives the following example of Parser Combinators:
windSpeed :: String -> Maybe Int
windSpeed windInfo =
parseMaybe windSpeedParser windInfo
windSpeedParser :: ReadP Int
windSpeedParser = do
direction <- numbers 3
speed <- numbers 2 <|> numbers 3
unit <- string "KT" <|> string "MPS"
return speed
The author gives the following reasons for this approach:
easy to read (I agree with this)
similar format to the specification ie the parser itself is basically a description of what it parses (I agree with this)
I can't help but feel I'm missing some of the reasons for choosing parser combinators. Some benefit of either using Haskell, compile-time guarantees, elimination of runtime errors. Or some subsequent benefit when you starting parsing DSLs and using free monads.
My question is: What are the reasons for using parser combinators?
I see several benefits of using parser combinators:
Parser combinators are a generalization of hand-written top-down parsers. In the case that you hand-write a parser, use parser combinators to abstract away common patterns.
Unlike parser generators, parser combinators are potentially dynamic, allowing for decisions during runtime. This aspect may be useful if the language's grammar may be redefined based on the input.
Parsers are first-class objects.
I can't seem to find the parsing expression grammar (PEG) of PEG itself.
How to parse a parsing expression grammar?
Note that this question is not about how to construct a recursive decent parser from a PEG, but rather for parsing a PEG.
The PEG Paper ("Parsing Expression Grammars: A Recognition-Based Syntactic Foundation") includes the grammar for PEGs.
I think I understand (roughly) how recursive descent parsers (e.g. Scala's Parser Combinators) work: You parse the input string with one parser, and that parser calls other, smaller parsers for each "part" of the whole input, and so on, until you reach the low level parsers which directly generate the AST from fragments of the input string
I also think I understand how Lexing/Parsing works: you first run a lexer to break the whole input into a flat list of tokens, and you then run a parser to take the token list and generate an AST.
However, I do not understand is how the Lex/Parse strategy deals with cases where exactly how you tokenize something depends on the tokens that were tokenized earlier. For example, if I take a chunk of XML:
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
A recursive descent parser may take this and break it down (each subsequent indent represents the decomposition of the parent string)
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
-> "<tag attr='moo' omg='wtf'>"
-> "<tag"
-> "attr='moo'"
-> "attr"
-> "="
-> "moo"
-> "omg='wtf'"
-> "omg"
-> "="
-> "wtf"
-> ">"
-> "attr='moo' omg='wtf'"
-> "</tag>"
And the small parsers which individually parse <tag, attr="moo", etc. would then construct a representation of an XML tag and add attributes to it.
However, how does a single-step Lex/Parse work? How does the Lexer know that the string after <tag and before > must be tokenized into separate attributes, while the string between > and </tag> does not need to be? Wouldn't it need the Parser to tell it that the first string is within a tag body, and the second case is outside a tag body?
EDIT: Changed the example to make it clearer
Typically the lexer will have a "mode" or "state" setting, which changes according to the input. For example, on seeing a < character, the mode would change to "tag" mode, and the lexer would tokenize appropriately until it sees a >. Then it would enter "contents" mode, and the lexer would return all of attr='moo' omg='wtf' as a single string. Programming language lexers, for example, handle string literals this way:
string s1 = "y = x+5";
The y = x+5 would never be handled as a mathematical expression and then turned back into a string. It's recognized as a string literal, because the " changes the lexer mode.
For languages like XML and HTML, it's probably easier to build a custom parser than to use one of the parser generators like yacc, bison, or ANTLR. They have a different structure than programming languages, which are a better fit for the automatic tools.
If your parser needs to turn a list of tokens back into the string it came from, that's a sign that something is wrong in the design. You need to parse it a different way.
How does the Lexer know that the string after must
be tokenized into separate attributes, while the string between > and
does not need to be?
It doesn't.
Wouldn't it need the Parser to tell it that the first string is within
a tag body, and the second case is outside a tag body?
Yes.
Generally, the lexer turns the input stream into a sequence of tokens. A token has no context - that is, a token has the same meaning no matter where it occurs in the input stream. Once the lexing process has completed, each token is treated as a single unit.
For XML, a generated lexer would typically identify integers, identifiers, string literal and so on as well as the control characters, like '<' and '>' but not a whole tag. The work of understanding what is an open tag, close tag, attribute, element, etc., is left to the parser proper.
When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?
I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.