Calculating FIRST sets by hand - parsing

I'm having a slight problem that's been bugging me for quite some time now and I haven't been able to find a suitable answer online.
Given a grammar:
S = Sc | Eab | Db | b
D = EDcD | ca | ɛ
E = dE | DY
Y = Ed | Dad | ɛ
To find the FIRST set of say Y, so FIRST(Y), am I correct in assuming that it goes like this:
FIRST(Y)
= FIRST(Ed) ∪ FIRST(Dad) ∪ FIRST(ɛ)
= FIRST(E) ∪ (FIRST(D)\{ɛ}) ∪ FIRST(ad) ∪ {ɛ}
= FIRST(E) ∪ (FIRST(D)\{ɛ}) ∪ {a, ɛ}
Now the question is how do I find FIRST(E) and FIRST(D)?

So, the problem with FIRST(E) and FIRST(D) is that E and D reference one another. And the solution is the usual one when you want a sort of "least fixed point" -- start with everything empty and keep iterating until they stabilize.
That is: first of all, initialize all your FIRST sets to the empty set. Now, repeatedly, consider each production and pretend your current estimates for the non-terminal's FIRST sets are the truth. (In reality, they will typically be "underestimates".) Work out what the production tells you about the FIRST set of its LHS, and update your estimate of that non-terminal's FIRST set accordingly. Keep doing this, processing all the productions in turn, until you've gone through all of them and your estimates haven't changed. At that point, you're done.
In this case, here's how it goes (assuming I haven't goofed, of course). The first pass produces successively S: {b}, D: {c,ɛ}, E: {c,d}, Y: {c,d,ɛ}. The second produces successively S: {b,c,d}, D: {c,d,ɛ}, E: {c,d,ɛ}, Y: {c,d,ɛ}. The third doesn't change any of those, so those are the final answers.

Related

Applying YACC to GCODE (GRBL)

GCode is language used to tell multi-axis (CNC) robots how to move.
It looks like this :
M3 S5000 (Start Spindle Clockwise at 5000 RPM)
G21 (All units in mm)
G00 Z1.000000 (lift Z axis up by 1mm)
G00 X94.720505 Y-14.904622 (Go to this XY coordinate)
G01 Z0.000000 F100.0 (Penetrate at 100mm/m)
G01 X97.298434 Y-14.870127 F400 (cut to here)
G03 X98.003848 Y-14.275867 I-0.028107 J0.749174 (cut an arc)
G00 Z1.000000 (lift Z axis)
etc.
I have layed these commands out in sentences, but each token could be on a separate line.
And in fact there are no rules about numbers being concatenated to their respective code letters. Yet I already have a LEX parser which can get me the tokens as described below.
Note that certain commands (M or G codes) have parameters.
In the case of M3, it can have an S (spindle speed) parameter.
G0 and G1 can have X,Y,Z,F etc.
G3 can have X,Y,Z,I,J,R...
However each G code does not require ALL those parameters, just one, many or all.
One thing to note here is that we are cutting a single path, then lifting the z axis.
That is, we move to a location above the work surface, penetrate, cut a path then lift off.
I would call this a 'block' or a 'path' and it is this that I'm interested in.
I need to be able to parse GCode in any messy format and then create a structure of 'blocks', where a block is any series of 'commands' between an z axis down and up.
I can tokenise this language using LEX (python PLY specifically).
And get :
type M value 3
type S value 5000
type COMMENT value "Start Spindle Clockwise at 5000 RPM"
type G value 31
type COMMENT value "All unites in mm"
type G value 0
type Z value 1.0
etc.
Now using Lexx I need a rule for a thing called a 'command'.
A command is any comment, or :
A 'G' or 'M' code followed by ANY of the appropriate parameter codes (X,Y,Z etc.)
Command ends when another command (comment, G or M) is encountered.
Then I need a thing called a 'block',
where a block is any set of 'commands' that come after a Z down and before a Z up.
There are 100 G codes and 100 M Codes and 25 parameter codes (A-Z minus G and M)
A rule for 'command' might look like :
command : G F H I J K L S T W X Y Z (how to specify ONE OF)
| M S F (How to specify one of)
| COMMENT
And then how would we define block!?
I realise this is a very long post, but if anyone can give me even an idea as to whether YACC can do this? Otherwise I'll just write some code that converts the lex tokens into a tree manually.
Addendum #rici
Thank you for taking the time to understand this question.
By way of feedback:
My task in full is to get YACC to do the heavy lifting of separating chunks of code into blocks based on different use cases.
For example When 'engraving', often a block will represent a letter or some other shape (in the xy plane). So a block will be defined by the movement of the z axis in and out of the xy plane.
I want to be able to post process blocks:
hatch fill a 'block'. which will involve some fairly complicated calculation of path boundaries, tangents to those boundaries, tool diameter etc. This is the most pressing use case and I haven't a good solution to this yet but I know it can be done because it can be done in Inkscape (vector graphics application)
rotate by n degrees. A fairly simply coordinate translation, I have a solution for this already.
iteratively deepen (extrude). Copy blocks and adjust Z depth on each iteration. Simple.
etc.
If you just want to ensure that a G command is followed by something, you can do this:
g_modifier: F | H | I | J | K | L | S | T | W | X | Y | Z
m_modifier: S | F
g_command: G g_modifier | g_command g_modifier
m_command: M m_modifier | m_command m_modifier
command: g_command | m_command | COMMENT
If you want to split those into sequences using the presence of a Z modifier, that can be done. You might want the lexer to be able to produce two different Z token types, based on the sign of the argument, because the parser can only make syntax decision based on tokens, not on semantic values.
Your question provides at least two different definitions of a block, making it a bit difficult to provide a clear answer.
"That is, we move to a location above the work surface, penetrate, cut a path then lift off. I would call this a 'block' or a 'path' and it is this that I'm interested in."
That would be, for example:
G00 X94.7 Y-14.9 (Move)
G01 Z0.0 (Penetrate)
G01 X97.2 Y-14.8 G03 X98.0 Y-14.2 I-0.02 J0.7 (Path)
G00 Z1.0 (Lift)
But later you say, "a block is any set of 'commands' that come after a Z down and before a Z up.
That would be just this part of the previous example:
G01 X97.2 Y-14.8 G03 X98.0 Y-14.2 I-0.02 J0.7 (Path)
Those are both possible, but obviously different. Here are some possible building blocks:
# This list doesn't include Z words
g_modifier: F | H | I | J | K | L | S | T | W | X | Y
g_command_no_z: G g_modifier
| g_command_no_z g_modifier
# This doesn't distinguish between Z up and Z down. If you want that to
# affect syntax, you need two different Z tokens, and then two different
# with_z non-terminals.
g_command_with_z: G Z
| g_command_no_z Z
| g_command_with_z g_modifier
# You might or might not want this.
# It's a non-empty sequence of G or M commands with no Z's.
path: command_no_z
| path command_no_z
command_no_z: COMMENT
| m_command
| g_command_no_z

How many equivalence classes in the RL relation for {w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}

How many equivalence classes in the RL relation for
{w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}
I am looking at a past test question which gives me the options
m(m+1)
2m
m^2
m^2+1
infinite
However, i claim that its m, and I came up with an automaton that I believe accepts this language which contains 3 states (for m=3).
Am I right?
Actually you're right. To see this, observe that the difference of #a(w) and #b(w), #a(w) - #b(w) modulo m, is all that matters; and there are only m possible values of this difference modulo m. So, m states are always sufficient to accept a language of this form: simply make the state corresponding to the appropriate difference the accepting state.
In your DFA, a2 corresponds to a difference of zero, a1 to a difference of one and a3 to a difference of two.

Why is this grammar LL(1) even though all the FIRST sets are the same?

Consider the following CFG:
S := AbC
A := aA | epsilon
C := Ac
Here, FIRST(A) = FIRST(B) = FIRST(C) = {a, ε}, so all the FIRST sets are the same. However, this grammar is supposedly LL(1). How is that possible? Wouldn't that mean that there would be a bunch of FIRST/FIRST conflicts everywhere?
There's nothing fundamentally wrong about having multiple nonterminals that have the same FIRST sets. Things only become problematic if you have multiple nonterminals with overlapping FIRST or FOLLOW sets in a context where you have to choose between a number of production options.
As an example, consider this simple grammar:
A → bB | cC
B → b | c
C → b | c
Notice that all three of A, B, and C have the same FIRST set, namely {b, c}. But this grammar is also LL(1). While you can formally convince yourself of this by writing out the actual LL(1) parsing table, you can think of this in another way as well. Imagine you're reading the nonterminal A, and you see the character b. Which production should you pick: A → bB, or A → cC? Well, there's no reason to pick A → cC, because that would put c at the front of your string. So don't pick that one. Instead, pick A → bB. Similarly, suppose you're reading an A and you see the character c. Which production should you pick? You'd never want to pick A → bB, since that would put b at the front of your string. Instead, you'd pick A → cC.
Notice that in this discussion, we never stopped to think about what FIRST(B) or FIRST(C) was. It simply didn't come up because we never needed to know what characters could be produced there.
Now, let's look at your example. If you're trying to expand an S, there's only one possible production to apply, which is S → AbC. So there's no possible conflict here; when you see S, you always apply that rule. Similarly, if you're trying to expand a C, there's no choice of what to do. You have to expand C → Ac.
So now let's think about the nonterminal A, where there really is a choice of what to do next. If you see the character a, then we have to decide - do we expand out A → aA, or do we expand out A → ε? In answering that question, we have to think about the FOLLOW set of A, since the production A → ε would only make sense to pick if we saw a terminal symbol where we basically just want to get A out of the way. Here, FOLLOW(A) = {b, c}, with the b coming from the production S → AbC and the c coming from the production C → Ac. So we'd only pick A → ε if we see b or c, not if we see a. That means that
on reading a, we pick A → aA, and
on reading b o r c, we pick A → ε.
Notice that in this discussion we never needed to think about what FIRST(B) or FIRST(C) was. In fact, we never even needed to look at what FIRST(A) was either! So that's why there isn't necessarily a conflict. Were we to encounter a scenario where we needed to compare FIRST(A) against FIRST(B) or something like that, then yes, we'd definitely have a conflict. But that never came up, so no conflict exists.

Top down parsing - Compute FIRST and FOLLOW

Given the following grammar:
S -> S + S | S S | (S) | S* | a
S -> S S + | S S * | a
For the life of me I can't seem to figure out how to compute the FIRST and FOLLOW for the above grammar. The recursive non-terminal of S confuses me. Does that mean I have to factor out the grammar first before computing the FIRST and FOLLOW?
The general rule for computing FIRST sets in CFGs without ε productions is the following:
Initialize FIRST(A) as follows: for each production A → tω, where t is a terminal, add t to FIRST(A).
Repeatedly apply the following until nothing changes: for each production of the form A → Bω, where B is a nonterminal, set FIRST(A) = FIRST(A) ∪ FIRST(B).
We could follow the above rules as written, but there's something interesting here we can notice. Your grammar only has a single nonterminal, so that second rule - which imports elements into the FIRST set of one nonterminal from FIRST sets from another nonterminal - won't actually do anything. In other words, we can compute the FIRST set just by applying that initial rule. And that's not too bad here - we just look at all the productions that start with a terminal and get FIRST(S) = { a, ( }.

How to calculate FIRST sets by hand

I don't understand one of the examples provided by my tutor.
Example
S ::= aBA | BB | Bc
A ::= Ad | d
B ::= ε
We have
FIRST(B) = FIRST(ε)
= {ε}
FIRST(A) = FIRST(Ad) ∪ FIRST(d)
= FIRST(A) ∪ {d}
= {d}
FIRST(S) = FIRST(aBA) ∪ FIRST(BB) ∪ FIRST(Bc)
= FIRST(a) ∪ (FIRST(B)\{ε}) ∪ FIRST(B) ∪ (FIRST(B)\{ε) ∪ FIRST(c)
= {a, ε, c}
Why is there a FIRST(B) in the FIRST(S) calculation? Shouldn't it be
(FIRST(B)\{ε)?
Why is A missing from FIRST(S) calculation?
This page gives the mechanical rules for deriving FIRST (and FOLLOW) sets. I'll try to explain the logic behind these rules and how they apply to your example.
FIRST sets
FIRST(u) is the set of terminals that can occur first in a full derivation of u, where u is a sequence of terminals and non-terminals. In other words, when calculating the FIRST(u) set, we are looking only for the terminals that could possibly be the first terminal of a string that can be derived from u.
FIRST(aBA)
Given the definition, we can see that FIRST(aBA) reduces to FIRST(a), then to a. This is because no matter what the A and B productions are, the terminal a will always occur first in anything derived from aBA since a is a terminal, and can't be removed from the front of that sequence.
FIRST(Bc)
I'm going to skip FIRST(BB) for now and move on to FIRST(Bc). Things are different here, since B is a non-terminal. At first, we say that anything in FIRST(B) is also in FIRST(S). Unfortunately, FIRST(B) contains ε which causes problems, as we could have the scenario
FIRST(Bc)
-> FIRST(εc)
= FIRST(c)
= c
where the arrow is a possible derivation/reduction. In general, we therefore say that FIRST(Xu), where ε is in FIRST(X), is equal to (FIRST(X)\{ε}) ∪ FIRST(u). This explains the last two terms in your calculation.
FIRST(BB)
Using the above rule, we can now derive FIRST(BB) as (FIRST(B)\{ε}) ∪ FIRST(B). Similarly, if we were calculating FIRST(BBB) we would reduce it as
FIRST(BBB)
= (FIRST(B)\{ε}) ∪ FIRST(BB)
= (FIRST(B)\{ε}) ∪ (FIRST(B)\{ε}) ∪ FIRST(B)
Of note is that while calculating a FIRST set, the last symbol in a sequence of symbols never has the empty string removed from it, because at this point, the empty string is a legitimate possibility. This can be seen in a possible derivation in your example:
S
-> BB
-> εε
-> ε
Hopefully you can see from all of the above why FIRST(B) appears in your calculation while FIRST(A) does not.

Resources