eliminating left-recursion, e, left-factoring? - parsing

I have a grammar
S->a{b}
and I'm trying to rewrite it to avoid using {} . If I write
S->a|aB
B->b|bB
then I'm unable to parse predictively in the second rule. If I write
S->a|aB
B->b|Bb
then I become left-recursive in the second rule.
Trying to do left-factoring,
B->bC
C->(e)|B
I'm introducing empty symbols. The wish so far is to make grammar without (e), suitable for predictive parsing and not left-recursive.
Is it possible?

I don't think so. Essentially, you can drop the a part in your grammar and the first of the b's, they are not relevant to your problem. You have then enumerated all three styles of declaring an infinite number of b's. You'll have to choose one of them.
I'd advise to just go with the empty symbol.

Related

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

FParsec parse unordered clauses

I want to parse some grammar like the following
OUTPUT data
GROUPBY key
TO location
USING object
the order of the GROUPBY TO USING clauses is allowed to vary, but each clause can occur at most once.
Is there a convenient or built-in way to parse this in FParsec? I read some questions and answers that mentions Haskell Parsec permute. There doesn't seem to be a permute in FParsec. If this is the way to go, what would I go about building a permute in FParsec?
I don't think there's a permutation parser in FParsec. I see a few directions you could take it though.
In general, what #FuleSnabel suggests is pretty sound, and probably simplest to implement. Don't make the parser responsible for asserting the property that each clause appears at most once. Instead parse each clause separately, allowing for duplicates, then inspect the resulting AST and error out if your property doesn't hold.
You could generate all permutations of your parsers and combine them with choice. Obviously this approach doesn't scale, but for three parsers I'd say it's fair game.
You could write your own primitive for parsing using a collection of parsers applied in any order. This would be a variant of many where in each step you make a choice of a parser, then discard that parser. So in each step you choose from a shrinking list of parsers until you can't parse anymore, finally returning the results collected along the way.
You could use user state to keep track of parsers already used and fail if a parser would be used twice within the same context. Not sure if this would yield a particularly nice solution - haven't really tried it before.

Issue with left recursion in top down parsing

I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.

How to fix grammar with optional non-terminal?

I wrote grammar for LALR parser and I am stuck at optional non-terminal. Consider for example C++ dereference, when you can write:
******expression;
Of course you can write:
expression;
And here is my problem, dereference non terminal is optional really and this has such impact on grammar, that now parser sees it fits everywhere (almost), because, well, it might be empty.
Is there a common pattern how should I rewrite the grammar to fix it?
I would also be grateful for pointing out some book or other resources which deals with "common problems & patterns when writing grammars".
First of all, the problem you are having is not the one you are claiming to have. Having a nullable (possibly empty) nonterminal does not mean that the parser will try to stick it everywhere. (I use the term “nullable” here to avoid confusion, because “optional” might refer to an optional occurrence of a nonterminal, as in x? where x is the nonterminal name). It just means that whenever you use that nonterminal in your grammar, the parser might skip over it or match with an empty word (details are according to the rules of the particular parsing algorithm, in your case LALR).
Secondly, the problem most probably is that the resulting grammar is ambiguous. My guess is that you used some kind of combination of right recursion for defining the nonterminal with the stars, and having an asterisk as a binary multiplication operator. (Feel free to update the question with a grammar fragment, then I might be able to offer more detailed help).
Thirdly, and mainly concerning your quest for general problems and patterns in grammars: usually people would not put the stars in one nonterminal and the expression in another, because ultimately you would want to transform your parse tree into an abstract syntax tree on which you probably intend to perform some calculations, in that case you would prefer to have a construction that says “dereference of a dereference of a dereference of an expression” rather than “three stars followed by an expression”. Again, the answer would have been less vague if you provided more details.

If the grammar is ambiguous then there exists exactly one handle for each sentential form.?

there can be two productions from which we can do the reduction. After giving precedence and associations as required there will be one handle only.so is this statement true??
This is partially true, a reduce/reduce conflict is usually resolved by specifying precedence or by letting the parser builder choose which rule to apply before the other.
This means that the conflict is solved but not that the parser is going to behave exactly as intended. It is convenient to study what is causing the conflict and think if a refactoring of the grammar is needed to express what you are trying to parse or if the automatic choice/precedence is enough.
If you have a grammar which has ambiguous rules, you get multiple interpretations. You don't have to insist that the grammar removes ambiguity; you can simply agree that something is ambiguous and parse it multiple ways:
fruit flies like an arrow.
The result of the parse is multiple interpretations.
Now, for such a language to be useful to a reader, either he has to be happy with the ambiguity, or you need to give him a way to resolve it. (In the example, I've decided for you that you are happy the ambiguity, because I haven't given you a way to resolve it!). Or, one can provide the reader of something with ambiguous parsess, a way to choose which parse make sense, and he rejects the inappropriate parses.
I can do that for the above case by telling you that I mean "fruit => watermelon".
Computer grammars are not different, but most programmers don't want ambiguous code. So in general, langauge designers like to define unambiguous grammars. In practice, they don't succeed and you get funny language rules like, "If this could be interpreted ambiguously, then interpret it this way.".

Resources