Applying YACC to GCODE (GRBL) - parsing

GCode is language used to tell multi-axis (CNC) robots how to move.
It looks like this :
M3 S5000 (Start Spindle Clockwise at 5000 RPM)
G21 (All units in mm)
G00 Z1.000000 (lift Z axis up by 1mm)
G00 X94.720505 Y-14.904622 (Go to this XY coordinate)
G01 Z0.000000 F100.0 (Penetrate at 100mm/m)
G01 X97.298434 Y-14.870127 F400 (cut to here)
G03 X98.003848 Y-14.275867 I-0.028107 J0.749174 (cut an arc)
G00 Z1.000000 (lift Z axis)
etc.
I have layed these commands out in sentences, but each token could be on a separate line.
And in fact there are no rules about numbers being concatenated to their respective code letters. Yet I already have a LEX parser which can get me the tokens as described below.
Note that certain commands (M or G codes) have parameters.
In the case of M3, it can have an S (spindle speed) parameter.
G0 and G1 can have X,Y,Z,F etc.
G3 can have X,Y,Z,I,J,R...
However each G code does not require ALL those parameters, just one, many or all.
One thing to note here is that we are cutting a single path, then lifting the z axis.
That is, we move to a location above the work surface, penetrate, cut a path then lift off.
I would call this a 'block' or a 'path' and it is this that I'm interested in.
I need to be able to parse GCode in any messy format and then create a structure of 'blocks', where a block is any series of 'commands' between an z axis down and up.
I can tokenise this language using LEX (python PLY specifically).
And get :
type M value 3
type S value 5000
type COMMENT value "Start Spindle Clockwise at 5000 RPM"
type G value 31
type COMMENT value "All unites in mm"
type G value 0
type Z value 1.0
etc.
Now using Lexx I need a rule for a thing called a 'command'.
A command is any comment, or :
A 'G' or 'M' code followed by ANY of the appropriate parameter codes (X,Y,Z etc.)
Command ends when another command (comment, G or M) is encountered.
Then I need a thing called a 'block',
where a block is any set of 'commands' that come after a Z down and before a Z up.
There are 100 G codes and 100 M Codes and 25 parameter codes (A-Z minus G and M)
A rule for 'command' might look like :
command : G F H I J K L S T W X Y Z (how to specify ONE OF)
| M S F (How to specify one of)
| COMMENT
And then how would we define block!?
I realise this is a very long post, but if anyone can give me even an idea as to whether YACC can do this? Otherwise I'll just write some code that converts the lex tokens into a tree manually.
Addendum #rici
Thank you for taking the time to understand this question.
By way of feedback:
My task in full is to get YACC to do the heavy lifting of separating chunks of code into blocks based on different use cases.
For example When 'engraving', often a block will represent a letter or some other shape (in the xy plane). So a block will be defined by the movement of the z axis in and out of the xy plane.
I want to be able to post process blocks:
hatch fill a 'block'. which will involve some fairly complicated calculation of path boundaries, tangents to those boundaries, tool diameter etc. This is the most pressing use case and I haven't a good solution to this yet but I know it can be done because it can be done in Inkscape (vector graphics application)
rotate by n degrees. A fairly simply coordinate translation, I have a solution for this already.
iteratively deepen (extrude). Copy blocks and adjust Z depth on each iteration. Simple.
etc.

If you just want to ensure that a G command is followed by something, you can do this:
g_modifier: F | H | I | J | K | L | S | T | W | X | Y | Z
m_modifier: S | F
g_command: G g_modifier | g_command g_modifier
m_command: M m_modifier | m_command m_modifier
command: g_command | m_command | COMMENT
If you want to split those into sequences using the presence of a Z modifier, that can be done. You might want the lexer to be able to produce two different Z token types, based on the sign of the argument, because the parser can only make syntax decision based on tokens, not on semantic values.
Your question provides at least two different definitions of a block, making it a bit difficult to provide a clear answer.
"That is, we move to a location above the work surface, penetrate, cut a path then lift off. I would call this a 'block' or a 'path' and it is this that I'm interested in."
That would be, for example:
G00 X94.7 Y-14.9 (Move)
G01 Z0.0 (Penetrate)
G01 X97.2 Y-14.8 G03 X98.0 Y-14.2 I-0.02 J0.7 (Path)
G00 Z1.0 (Lift)
But later you say, "a block is any set of 'commands' that come after a Z down and before a Z up.
That would be just this part of the previous example:
G01 X97.2 Y-14.8 G03 X98.0 Y-14.2 I-0.02 J0.7 (Path)
Those are both possible, but obviously different. Here are some possible building blocks:
# This list doesn't include Z words
g_modifier: F | H | I | J | K | L | S | T | W | X | Y
g_command_no_z: G g_modifier
| g_command_no_z g_modifier
# This doesn't distinguish between Z up and Z down. If you want that to
# affect syntax, you need two different Z tokens, and then two different
# with_z non-terminals.
g_command_with_z: G Z
| g_command_no_z Z
| g_command_with_z g_modifier
# You might or might not want this.
# It's a non-empty sequence of G or M commands with no Z's.
path: command_no_z
| path command_no_z
command_no_z: COMMENT
| m_command
| g_command_no_z

Related

How many equivalence classes in the RL relation for {w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}

How many equivalence classes in the RL relation for
{w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}
I am looking at a past test question which gives me the options
m(m+1)
2m
m^2
m^2+1
infinite
However, i claim that its m, and I came up with an automaton that I believe accepts this language which contains 3 states (for m=3).
Am I right?
Actually you're right. To see this, observe that the difference of #a(w) and #b(w), #a(w) - #b(w) modulo m, is all that matters; and there are only m possible values of this difference modulo m. So, m states are always sufficient to accept a language of this form: simply make the state corresponding to the appropriate difference the accepting state.
In your DFA, a2 corresponds to a difference of zero, a1 to a difference of one and a3 to a difference of two.

Different methods of implementing a specific parsing rule for a compiler

Let's say we have a rule in parsing tokens that specifies:
x -> [y[,y]*]
Where the brackets '[ ]' mean that anything in them is optional in order for the rule to take place and the * means 0 or more.
e.g it could be:
x : (empty)
OR
x : y
OR
x : y,y
as well etc. (the above are examples of input that 'x' rule would be activated, not how the code should be)
I have tried the following that works already
x : y commaY
|
;
commaY : COMMA y commaY
|
;
I would like to know alternative options in the above that would make it work, if there are any, for educational purposes.
Thank you in advance.
EDIT my earlier answer was incorrect (as pointed out in the comments), but I cannot remove an accepted answer, so I decided to edit it.
You will need (at least) 2 rules for x -> [y[,y]*]. Here is another possibility:
x
: list
| /* eps */
;
list
: list ',' y
| y
;

How to count number of non-empty nodes in binary tree in F#

Consider the binary tree algebraic datatype
type btree = Empty | Node of btree * int * btree
and a new datatype defined as follows:
type finding = NotFound | Found of int
Heres my code so far:
let s = Node (Node(Empty, 5, Node(Empty, 2, Empty)), 3, Node (Empty, 6, Empty))
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
(* size: btree -> int *)
let rec size t =
match t with
Empty -> false
| Node (t1, m, t2) -> if (m != Empty) then sum+1 || (size t1) || (size t2)
let num = occurs s
printfn "There are %i nodes in the tree" num
This probably isn't close, I took a function that would find if an integer existed in a tree and tried changing the code for what I was trying to do.
I am very new to using F# and would appreciate any help. I am trying to count all non empty nodes in the tree. For example the tree I'm using should print the value 4.
I did not run the compiler on your code, but I believe this does even compile.
However your idea to use a pattern match in a recursive function is good.
As rmunn commented, you want to determine the number of nodes in each case:
An empty tree has no nodes, hence the result is zero.
A non-empty tree, has at least the root node plus the count of its left and right subtrees.
So something along the lines of the following should work
let rec size t =
match t with
| Empty -> 0
| Node (t1, _, t2) -> 1 + (size t1) + (size t2)
The most important detail here is, that you do not need a global variable sum to store any intermediate values. The whole idea of a recursive function is that those intermediate values are the results of recursive calls.
As a remark, your tree in the comment should look like this, I believe.
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
Edit: I misread the misaligned () as leaves of an empty tree, where in fact they are leaves of the subtree (2). So it was just an ASCII art issue :-)
Friedrich already posted a simple version of the size function that will work for most trees. However, the solution is not "tail-recursive", so it can cause a Stack Overflow for large trees. In functional programming languages like F#, recursion is often the preferred technique for things like counting and other aggregate functions. However, recursive functions generally consume a stack frame for each recursive call. This means that for large structures, the call stack can be exhausted before the function completes. In order to avoid this problem, compilers can optimize functions that are considered "tail-recursive" so that they use only one stack frame regardless of how many times they recurse. Unfortunately, this optimization cannot just be implemented for any recursive algorithm. It requires that the recursive call be the last thing that the function does, thereby ensuring that the compiler does not have to worry about jumping back into the function after the call, allowing it to overwrite the stack frame instead of adding another one.
In order to change the size function to be tail-recursive, we need some way to avoid having to call it twice in the case of a non-empty node, so that the call can be the last step of the function, instead of the addition between the two calls in Friedrich's solution. This can be accomplished using a couple different techniques, generally either using an accumulator or using Continuation Passing Style. The simpler solution is often to use an accumulator to keep track of the total size instead of having it be the return value, while Continuation Passing Style is a more general solution that can handle more complex recursive algorithms.
In order to make an accumulator pattern work for a tree where we have to sum both the left and right sub-trees, we need some way to make one tail-call at the end of the function, while still making sure that both sub-trees are evaluated. A simple way to do that is to also accumulate the right sub-trees in addition to the total count, so we can make subsequent tail-calls to evaluate those trees while evaluating the left sub-trees first. That solution might look something like this:
let size t =
let rec size acc ts = function
| Empty ->
match ts with
| [] -> acc
| head :: tail -> head |> size acc tail
| Node (t1, _, t2) ->
t1 |> size (acc + 1) (t2 :: ts)
t |> size 0 []
This adds the acc parameter and the ts parameter to represent the total count and remaining unevaluated sub-trees. When we hit a populated node, we evaluate the left sub-tree while adding the right sub-tree to our list of trees to evaluate later. When we hit the an empty node, we start evaluating any ts we've accumulated, until we have no further populated nodes or unevaluated sub-trees. This isn't the best possible solution for computing the tree-size, and most real solutions would use Continuation Passing Style to make it tail-recusive, but that should make a good exercise as you get more familiar with the language.

elimination of indirect left recursion

I'm having problems understanding an online explanation of how to remove the left recursion in this grammar. I know how to remove direct recursion, but I'm not clear how to handle the indirect. Could anyone explain it?
A --> B x y | x
B --> C D
C --> A | c
D --> d
The way I learned to do this is to replace one of the offending non-terminal symbols with each of its expansions. In this case, we first replace B with its expansions:
A --> B x y | x
B --> C D
becomes
A --> C x y | D x y | x
Now, we do the same for non-terminal symbol C:
A --> C x y | D x y | x
C --> A | c
becomes
A --> A x y | c x y | D x y | x
The only other remaining grammar rule is
D --> d
so you can also make that replacement, leaving your entire grammar as
A --> A x y | c x y | d x y | x
There is no indirect left recursion now, since there is nothing indirect at all.
Also see here.
To eliminate left recursion altogether (not merely indirect left recursion), introduce the A' symbol from your own materials (credit to OP for this clarification and completion):
A -> x A'
A' -> xyA' | cxyA' | dxyA' | epsilon
Response to naomik's comments
Yes, grammars have interesting properties, and you can characterize certain semantic capabilities in terms of constraints on grammar rules. There are transformation algorithms to handle certain types of parsing problems.
In this case, we want to remove left-recursion: one desirable property of a grammar is that the use of any rule must consume at least one input token (terminal symbol). Left-recursion opens a door to infinite recursion in the parser.
I learned these things in my "Foundations of Computing" and "Compiler Construction" classes many years ago. Instead of writing a parser to adapt to a particular grammar, we'd transform the grammar to fit the parser style we wanted.

What is a "Production" in plain English?

I can read on Wikipedia the formal definition of a Production, however when you start reading that article, it makes an assumption about prior knowledge.
Wikipedia defines it as follows:
A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences.
This assumes that I know and understand what a rewrite rule is. I don't, and if I click the link, I get into another fairly technical explanation.
Can someone explain to me in plain English what a Production actually is?
Note: I have made many attempts to understand this, but I don't think I've succeeded. From what I can tell it rewrites the given string in terms of grammar rules. Not sure if I'm correct.
To explain what a production is I'd like to introduce a bit of context first.
The dragon book states that a context free grammar has 4 components:
a set of terminal symbols (tokens)
a set of non-terminal symbols (syntactic variables)
a set of productions of the form: non-term --> sequence of terminals and non-terminals
a non-terminal symbol designated as the start symbol
It is also said that parsing is the problem of taking a string of terminals (the source code) and figuring out what are the steps required to derive this string of terminals from the start symbol of the grammar.
Now that this has been said, a production is essentially a possible (intermediate) step. I say possible because some symbols can derive into different sequences.
For example, let's make a simple grammar to represent an arbitrarily long sequences of a's ending with a b. The 4 components of this grammar would be:
Terminals: a, b
Non-terminals: S, X
Rules: S --> X, X --> aX, X --> ab
Start symbol: S
From the description I gave above "aaaab" should be derivable from this grammar. Let's see if that holds up. We start from, the start symbol, and then apply productions until a) we get the final sequence, b) we exhaust all possibilities without succeeding (meaning the sequence is not "grammatically correct").
S
X (after applying S --> X)
aX (after applying X --> aX)
aaX (after applying X --> aX)
aaaX (after applying X --> aX)
aaaab (after applying X --> ab)
And we're done, we got the original sequence. So as you can see we re-wrote the non-terminal symbols by applying rules (one of them we applied recursively) which transformed the sequence into a new sequence of symbols at every step and we did so until we had the final sequence.
A rewrite rule is a method of replacing subterms of a formula with other terms. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.
An example of a rewrite rule could look like:
A → B
Now as for what this actually does! You are right on your note, take for example a list of things (and 2 rewrite rules):
X, Y, Z
X → Y
Y → Z
Which would result in:
Z, Z, Z
A production rule is a rewrite rule because it is a method of replacing subterms of a formula (probably a string in your case). A production rule could look like:
X, Y, Z
X → aX
By using the rule in such a way it becomes possible to apply recursion (create new sequences) as it will keep replacing itself:
aX, Y, Z
aaX, Y, Z
aaaX, Y, Z
As for the question you are asking, you could say: "A production rule is a replacement rule for formulas that uses recursion to create new sequences".

Resources