How to disprove grammar of being LL(3)?

How to disprove grammar of being LL(3)? - parsing

I have the following grammar I made up:
S -> a a a t | a a a a
I'm trying to prove that it's not LL(3). I found that First(S)={a} and Follow(S)={$} but I can't seem to figure out what I need to do to disprove it being LL(3). It's a small grammar I built for myself to understand how to disprove LL(k). For LL(1) I build a table and in each field I insert the rule base on there First/Follow. But with lookahead 3 how can I do it?

To prove a grammar is (not) LL(3), you need to construct an LL(3) parser for it and show that there is (or is not) a conflict in the resulting table.
To construct the parser tables for LL(3) you need FIRST3 and FOLLOW3, which are analogous to FIRST and FOLLOW, except they are sets of sequences of (up to) 3 tokens, rather than sets of single tokens. So you get FIRST3(S) = { "aaa" } which can select either production for S, giving you a conflict in the table between those productions.

Related

Are the following grammars LL(1)?

S -> Abc|aAcb
A -> b|c|ε
I think the first one is LL(1)
S -> aAS|b
A -> a|bSA
But the problem is second one. There's no conflict problem, but I think it doesn't satisfy right-recursion.
I'm not sure about those problems.

The first grammar is not LL(1), because of several conflicts, as for example for input bc, an LL parser will need 2 tokens of look-ahead to parse it:
enter in rule S
enter in rule A
recognize character b
load the next token c
exit rule A
another b is expected, but the current token is c
go back into A
move one token backwards, again to b
exit rule A without recognizing anything (because of the epsilon)
recognize the b character after the reference to rule A that was "jumped" without the use of any token
load the next token c
recognize c
success
You have a similar case in the second alternative of S for an input acb. The grammar is not ambiguous, because in the end there is only one possible syntax tree. Its not LL(1), as in fact is LL(2).
The second grammars is deterministic - there is only one way to parse any input that is valid according to the grammar. This means that it can be used for parsing by an LL(1) parser.
I have made a tool (Tunnel Grammar Studio) that detects the grammar conflicts for nondeterministic grammars and generates parsers. This grammar in ABNF (RFC 5234) like syntax is:
S = 'a' A S / 'b'
A = 'a' / 'b' S A
The right recursion by itself does not create ambiguities inside the grammar. One way to have a right recursion ambiguity is to have some dangling element as in this grammar:
S = 'c' / 'a' S 0*1 'b'
You can read it as: rule S recognizes character c or character a followed by rule S itself and maybe (zero or one time) followed by character b.
The grammar up has a right recursion related ambiguity because of the dangling b character. Meaning that for an input aacb, there is more then one way to parse it: recognize the first a in S; enter into S again; recognize the second a; enter again into S and recognize c; exit S one time; then there are two choices:
case one) recognize the b character, exit S two times or case two) first exit S one time and then recognize the b character. Both cases are these (screen shots from the visual debugger of TGS):
This grammar is thus ambiguous (i.e. not LL(1)) because more then one syntax tree can be generated for some valid inputs. For this input the possible trees are only two, but for an input aaacb there are three trees, as there are three trees for aaacbb, because of the possible 3 places where you can 'attach' the two b characters, two of this places will have b, and one will remain empty. For input aaacbbb there is of course only one possible syntax tree, but the grammar is defined to be ambiguous if there is at least one input for which there is more then one possible syntax tree.

How to parse this simple grammar? Is it ambiguous?

I'm delving deeper into parsing and came across an issue I don't quite understand. I made up the following grammar:
S = R | aSc
R = b | RbR
where S is the start symbol. It is possible to show that abbbc is a valid sentence based on this grammar, hopefully, that is correct but I may have completely missunderstood something. If I try to implement this using recursive descent I seem to have a problem when trying to parse abbbc, using left-derivation eg
S => aSc
aSc => aRc
at this point I would have thought that recursive descent would pick the first option in the second production because the next token is b leading to:
aRc => abc
and we're finished since there are no more non-terminals, which isn't of course abbbc. The only way to show that abbbc is valid is to pick the second option but with one lookahead I assume it would always pick b. I don't think the grammar is ambiguous unless I missed something. So what I am doing wrong?
Update: I came across this nice derivation app at https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/. I used to do a sanity check that abbbc is a valid sentence and it is.
Thinking more about this problem, is it true to say that I can't use LL(1) to parse this grammar but in fact need LL(2)? With two lookaheads I could correctly pick the second option in the second production because I now also know there are more tokens to be read and therefore picking b would prematurely terminate the derivation.

For starters, I’m glad you’re finding our CFG tool useful! A few of my TAs made that a while back and we’ve gotten a lot of mileage out of it.
Your grammar is indeed ambiguous. This stems from your R nonterminal:
R → b | RbR
Generally speaking, if you have recursive production rules with two copies of the same nonterminal in it, it will lead to ambiguities because there will be multiple options for how to apply the rule twice. For example, in this case, you can derive bbbbb by first expanding R to RbR, then either
expanding the left R to RbR and converting each R to a b, or
expanding the right R to RbR and converting each R to a b.
Because this grammar is ambiguous, it isn’t going to be LL(k) for any choice of k because all LL(k) grammars must be unambiguous. That means that stepping up the power of your parser won’t help here. You’ll need to rewrite the grammar to not be ambiguous.
The nonterminal R that you’ve described here generates strings of odd numbers of b’s in them, so we could try redesigning R to achieve this more directly. An initial try might be something like this:
R → b | bbR
This, unfortunately, isn’t LL(1), since after seeing a single b it’s unclear whether you’re supposed to apply the first production rule or the second. However, it is LL(2).
If you’d like an LL(1) grammar, you could do something like this:
R → bX
X → bbX | ε
This works by laying down a single b, then laying down as many optional pairs of b’s as you’d like.

How to handle assignment and variable syntax in an interpreter

Most interpreters let you type the following at their console:
>> a = 2
>> a+3
5
>>
My question is what mechanisms are usually used to handle this syntax? Somehow the parser is able to distinguish between an assignment and an expression even though they could both start with a digit or letter. It's only when we retrieve the second token that you know if you have an assignment or not. In the past, I've looked ahead two tokens and if the second token isn't an equals I push the tokens back into the lexical stream and assume it's an expression. I suppose one could treat the assignment as an expression which I think some languages do. I thought of using left-factoring but I can't see it working.
eg
assignment = variable A
A = '=' expression | empty
Update I found this question on StackOverflow which address the same question: How to modify parsing grammar to allow assignment and non-assignment statements?

From how you're describing your approach - doing a few tokens of lookahead to decide how to handle things - it sounds like you're trying to write some sort of top-down parser along the lines of an LL(1) or an LL(2) parser, and you're trying to immediately decide whether the expression you're parsing is a variable assignment or an arithmetical expression. There are several ways that you could parse expressions like these quite naturally, and they essentially involve weakening one of those two assumptions.
The first way we could do this would be to switch from using a top-down parser like an LL(1) or LL(2) parser to something else like an LR(0) or SLR(1) parser. Those parsers work bottom-up by reading larger prefixes of the input string before deciding what they're looking at. In your case, a bottom-up parser might work by seeing the variable and thinking "okay, I'm either going to be reading an expression to print or an assignment statement, but with what I've seen so far I can't commit to either," then scanning more tokens to see what comes next. If they see an equals sign, great! It's an assignment statement. If they see something else, great! It's not. The nice part about this is that if you're using a standard bottom-up parsing algorithm like LR(0), SLR(1), LALR(1), or LR(1), you should probably find that the parser generally handles these sorts of issues quite well and no special-casing logic is necessary.
The other option would be to parse the entire expression assuming that = is a legitimate binary operator like any other operation, and then check afterwards whether what you parsed is a legal assignment statement or not. For example, if you use Dijkstra's shunting-yard algorithm to do the parsing, you can recover a parse tree for the overall expression, regardless of whether it's an arithmetical expression or an assignment. You could then walk the parse tree to ask questions like
if the top-level operation is an assignment, is the left-hand side a single variable?
if the top-level operation isn't an assignment, are there nested assignment statements buried in here that we need to get rid of?
In other words, you'd parse a broader class of statements than just the ones that are legal, and then do a postprocessing step to toss out anything that isn't valid.

eliminating left-recursion, e, left-factoring?

I have a grammar
S->a{b}
and I'm trying to rewrite it to avoid using {} . If I write
S->a|aB
B->b|bB
then I'm unable to parse predictively in the second rule. If I write
S->a|aB
B->b|Bb
then I become left-recursive in the second rule.
Trying to do left-factoring,
B->bC
C->(e)|B
I'm introducing empty symbols. The wish so far is to make grammar without (e), suitable for predictive parsing and not left-recursive.
Is it possible?

I don't think so. Essentially, you can drop the a part in your grammar and the first of the b's, they are not relevant to your problem. You have then enumerated all three styles of declaring an infinite number of b's. You'll have to choose one of them.
I'd advise to just go with the empty symbol.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.

LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.

My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart