I can't seem to find the parsing expression grammar (PEG) of PEG itself.
How to parse a parsing expression grammar?
Note that this question is not about how to construct a recursive decent parser from a PEG, but rather for parsing a PEG.
The PEG Paper ("Parsing Expression Grammars: A Recognition-Based Syntactic Foundation") includes the grammar for PEGs.
Related
I'm looking at the following approach to using parser combinators in Haskell. The author gives the following example of Parser Combinators:
windSpeed :: String -> Maybe Int
windSpeed windInfo =
parseMaybe windSpeedParser windInfo
windSpeedParser :: ReadP Int
windSpeedParser = do
direction <- numbers 3
speed <- numbers 2 <|> numbers 3
unit <- string "KT" <|> string "MPS"
return speed
The author gives the following reasons for this approach:
easy to read (I agree with this)
similar format to the specification ie the parser itself is basically a description of what it parses (I agree with this)
I can't help but feel I'm missing some of the reasons for choosing parser combinators. Some benefit of either using Haskell, compile-time guarantees, elimination of runtime errors. Or some subsequent benefit when you starting parsing DSLs and using free monads.
My question is: What are the reasons for using parser combinators?
I see several benefits of using parser combinators:
Parser combinators are a generalization of hand-written top-down parsers. In the case that you hand-write a parser, use parser combinators to abstract away common patterns.
Unlike parser generators, parser combinators are potentially dynamic, allowing for decisions during runtime. This aspect may be useful if the language's grammar may be redefined based on the input.
Parsers are first-class objects.
Is there a parser that can parse ambiguous grammars (ideally in Haskell)?
Parsec's paper (http://research.microsoft.com/en-us/um/people/daan/download/papers/parsec-paper.pdf) states the following:
"Ambiguous grammars have more than one parse tree for a sentence in the language. Only parser combinators that can return more than one value can handle ambiguous grammars. Such combinators use a list as their reply type."
But I haven't found any such parser combinators. Do they exist?
I've written some parsers using Attoparsec but only now realised that I don't always want them to backtrack on failure, but attoparsec parsers always backtrack on failure.
Is there a way to force a parser not to backtrack?
For example, this attoparsec parser will succeed when given the input "for":
string "foo" <|> string "for"
A parsec parser would not succeed on that input and I want to emulate this behaviour using an attoparsec parser.
I'm currently having a look at GNU Bison to parse program code (or actually to extend a program that uses Bison for doing that). I understand that Bison can only (or: best) handle LR(1) grammars, i.e. a special form of context-free grammars; and I actually also (believe to) understand the rules of context-free and LR(1) grammars.
However, somehow I'm lacking a good understanding of the notion of a LR(1) grammar. Assume SQL, for instance. SQL incorporates - I believe - a context-free grammar. But is it also a LR(1) grammar? How could I tell? And if yes, what would violate the LR(1) rules?
LR(1) means that you can choose proper rule to reduce by knowing all tokens that will be reduced plus one token after them. There are no problems with AND in boolean queries and in BETWEEN operation. The following grammar, for example is LL(1), and thus is LR(1) too:
expr ::= and_expr | between_expr | variable
and_expr ::= expr "and" expr
between_expr ::= "between" expr "and" expr
variable ::= x
I believe that the whole SQL grammar is even simpler than LR(1). Probably LR(0) or even LL(n).
Some of my customers created SQL and DB2 parsers using my LALR(1) parser generator and used them successfully for many years. The grammars they sent me are LALR(1) (except for the shift-reduce conflicts which are resolved the way you would want). For the purists -- not LALR(1), but work fine in practice, no GLR or LR(1) needed. You don't even need the more powerful LR(1), AFAIK.
I think the best way to figure this out is to find an SQL grammar and a good LALR/LR(1) parser generator and see if you get a conflict report. As I remember an SQL grammar (a little out of date) that is LALR(1), is available in this download: http://lrstar.tech/downloads.html
LRSTAR is an LR(1) parser generator that will give you a conflict report. It's also LR(*) if you cannot resolve the conflicts.
When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?