I was writing a little helper function toString(TypeSymbol t, M3 m) when I encountered a weird parser error.
The function has a lot of statements like:
...
}else if(object() := t){
return "object";
}else if(float() := t){
return "float";
}else if(double() := t){
return "double";
...
These work fine.
However, when I try this same pattern for int() or void(), the compiler gives an error, specifically on the = sign.
if(int() := t){}
^ Parse error here
As it often happens, I found the answer to this question while I was typing it up.
However, I think it'll be valuable for others so I will post it nonetheless.
I got the syntax for pattern matching in this answer: https://stackoverflow.com/a/21929342/451847
It seems that the 'proper' way of pattern matching is to prefix the type you want to test for with a \.
So, the code above becomes:
...
}else if(\object() := t){
return "object";
}else if(\float() := t){
return "float";
}else if(\double() := t){
return "double";
...
The non-\ syntax works for most of the cases but I think int() and void() have a different definition.
Related
Hi Relatively new to Dafny and have defined methods set2Seq and seq2Set for conversion between sets and seqs. But can only find how to write a function fseq2Set from sets to sequences.
I can not find how to define fseq2Set. As Lemmas can not reference methods this makes proving the identity beyond me. Any help much appreciated?
Code:
function method fseq2Set(se: seq<int>) :set<int>
{ set x:int | x in se :: x }
method seq2Set(se: seq<int>) returns (s:set<int>)
{ s := set x:int | x in se :: x; }
method set2Seq(s: set<int>) returns (se:seq<int>)
requires s != {}
ensures s == fseq2Set(se)
decreases |s|
{
var y :| y in s;
var tmp ;
if (s=={y}) {tmp := [];} else {tmp := set2Seq(s-{y});}
se := [y] + tmp;
assert (s-{y}) + {y} == fseq2Set([y] + tmp);
}
/* below fails */
function fset2Seq(s:set<int>):seq<int>
decreases s { var y :| y in s ; [y] + fset2Seq(s-{y}) }
lemma cycle(s:set<int>) ensures forall s:set<int> :: fseq2Set(fset2Seq(s)) == s { }
Ok, there is kind of a lot going on here. First of all, I'm not sure if you intended to do anything with the methods seq2Set and set2Seq, but they don't seem to be relevant to your failure, so I'm just going to ignore them and focus on the functions.
Speaking of functions, Danfy reports an error on your definition of fset2Seq, because s might be empty. In that case, we should return the empty sequence, so I adjusted your definition to:
function fset2Seq(s:set<int>):seq<int>
decreases s
{
if s == {} then []
else
var y := Pick(s);
[y] + fset2Seq(s - {y})
}
function Pick(s: set<int>): int
requires s != {}
{
var x :| x in s; x
}
which fixes that error. Notice that I also wrapped the let-such-that operator :| in a function called Pick. This is essential, but hard to explain. Just trust me for now.
Now on to the lemma. Your lemma is stated a bit weirdly, because it takes a parameter s, but then the ensures clause doesn't mention s. (Instead it mentions a completely different variable, also called s, that is bound by the forall quantifier!) So I adjusted it to get rid of the quantifier, as follows:
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
Next, I follow My Favorite Heuristic™ in program verification, which is that the structure of the proof follows the structure of the program. In this case, the "program" in question is fseq2Set(fset2Seq(s)). Starting from our input s, it first gets processed recursively by fset2Seq and then through the set comprehension in fseq2Set. So, I expect a proof by induction on s that follows the structure of fset2Seq. That structure is to branch on whether s is empty, so let's do that in the lemma too:
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
{
if s == {} {
} else {
}
...
Dafny reports an error on the else branch but not on the if branch. In other words, Dafny has proved the base case, but it needs help with the inductive case. The next thing that fset2Seq(s) does is call Pick(s). Let's do that too.
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
{
if s == {} {
} else {
var y := Pick(s);
...
Now we know from its definition that fset2Seq(s) is going to return [y] + fset2Seq(s - {y}), so we can copy-paste our ensures clause and manually substitute this expression.
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
{
if s == {} {
} else {
var y := Pick(s);
assert fseq2Set([y] + fset2Seq(s - {y})) == s;
...
Dafny reports an error on this assertion, which is not surprising, since it's just a lightly edited version of the ensures clause we're trying to prove. But importantly, Dafny no longer reports an error on the ensures clause itself. In other words, if we can prove this assertion, we are done.
Looking at this assert, we can see that fseq2Set is applied to two lists appended together. And we would expect that to be equivalent to separately converting the two lists to sets, and then taking their union. We could prove a lemma to that effect, or we could just ask Dafny if it already knows this fact, like this:
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
{
if s == {} {
} else {
var y := Pick(s);
assert fseq2Set([y] + fset2Seq(s - {y})) == fseq2Set([y]) + fseq2Set(fset2Seq(s - {y}));
assert fseq2Set([y] + fset2Seq(s - {y})) == s;
...
(Note that the newly added assertion is before the last one.)
Dafny now accepts our lemma. We can clean up a little by deleting the the base case and the final assertion that was just a copy-pasted version of our ensures clause. Here is the polished proof.
lemma cycle(s:set<int>)
ensures fseq2Set(fset2Seq(s)) == s
{
if s != {} {
var y := Pick(s);
assert fseq2Set([y] + fset2Seq(s - {y})) == fseq2Set([y]) + fseq2Set(fset2Seq(s - {y}));
}
}
I hope this explains how to prove the lemma and also gives you a little bit of an idea about how to make progress when you are stuck.
I did not explain Pick. Basically, as a rule of thumb, you should just always wrap :| in a function whenever you use it. To understand why, see the Dafny power user posts on iterating over collecion and functions over set elements. Also, see Rustan's paper Compiling Hilbert's epsilon operator.
The grammar for the type language is as follows:
TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;
My input like this: TYPE
for example, if I input (int,int)->float, this is valid. If I input ( [int] , int), it's a wrong type and invalid.
I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). However, I don't know how to build this grammar with go and how to parse input by each byte. Is there any hint or similar implementation? That's will be really helpful.
For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar.
As a concrete example, let's say that we're recognizing a similar language.
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE
Not quite exactly the same as your original problem, but you should be able to see the similarities.
A recursive descent parser consists of functions for each production rule.
func ParseType(???) error {
???
}
func ParsePrimitiveType(???) error {
???
}
func ParseTupleType(???) error {
???
}
func ParseArgList(???) error {
???
}
where we'll denote things that we don't quite know what to put as ???* till we get there. We at least will say for now that we get an error if we can't parse.
The input into each of the functions is some stream of tokens. In our case, those tokens consist of sequences of:
"int"
"("
")"
and we can imagine a Stream might be something that satisfies:
type Stream interface {
Peek() string // peek at next token, stay where we are
Next() string // pick next token, move forward
}
to let us walk sequentially through the token stream.
A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens.
Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. As mentioned before, each production rule corresponds to a parsing function. Within a production rule, each alternative is a conditional branch. If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take.
For example, let's look at TYPE and its corresponding ParseType function:
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
How might this corresponds to the definition of ParseType?
The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. We can peek at the token stream: if we see "int", then we know it's primitive. If we see a "(", then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work.
It's important to note: if we don't see either a "(" nor an "int", then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens.
Ok, let's write the code.
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
Parsing PRIMITIVETYPE and TUPLETYPE is equally direct.
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
The only one that might cause some issues is the parser for argument lists. Let's look at the rule.
ARGLIST ::= TYPE ARGLIST | TYPE
If we try to write the function ParseArgList, we might get stuck because we don't yet know which choice to make. Do we go for the first, or the second choice?
Well, let's at least parse out the part that's common to both alternatives: the TYPE part.
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
/// ... FILL ME IN. Do we call ParseArgList() again, or stop?
}
So we've parsed the prefix. If it was the second case, we're done. But what if it were the first case? Then we'd still have to read additional lists of types.
Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. And we know that all types FIRST start either with "int" or "(". So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this!
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
Believe it or not, that's pretty much all we need. Here is working code.
package main
import "fmt"
type Stream interface {
Peek() string
Next() string
}
type TokenSlice []string
func (s *TokenSlice) Peek() string {
return (*s)[0]
}
func (s *TokenSlice) Next() string {
result := (*s)[0]
*s = (*s)[1:]
return result
}
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
func main() {
fmt.Println(ParseType(&TokenSlice{"int"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))
// Should show error:
fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}
This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. For your own purposes, you'll also want to expand the parsers so that they don't just return error, but also some kind of useful result from the parse.
This answer is just a sketch on how recursive descent parsers work. But you should really read a good compiler book to get the details, because you need them. The Dragon Book, for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions.
I would like to know how I can handle multiple optionals without concrete pattern matching for each possible permutation.
Below is a simplified example of the problem I am facing:
lexical Int = [0-9]+;
syntax Bool = "True" | "False";
syntax Period = "Day" | "Month" | "Quarter" | "Year";
layout Standard = [\ \t\n\f\r]*;
syntax Optionals = Int? i Bool? b Period? p;
str printOptionals(Optionals opt){
str res = "";
if(!isEmpty("<opt.i>")) { // opt has i is always true (same for opt.i?)
res += printInt(opt.i);
}
if(!isEmpty("<opt.b>")){
res += printBool(opt.b);
}
if(!isEmpty("<opt.p>")) {
res += printPeriod(opt.period);
}
return res;
}
str printInt(Int i) = "<i>";
str printBool(Bool b) = "<b>";
str printPeriod(Period p) = "<p>";
However this gives the error message:
The called signature: printInt(opt(lex("Int"))), does not match the declared signature: str printInt(sort("Int"));
How do I get rid of the opt part when I know it is there?
I'm not sure how ideal this is, but you could do this for now:
if (/Int i := opt.i) {
res += printInt(i);
}
This will extract the Int from within opt.i if it is there, but the match will fail if Int was not provided as one of the options.
The current master on github has the following feature to deal with optionals: they can be iterated over.
For example:
if (Int i <- opt.i) {
res += printInt(i);
}
The <- will produce false immediately if the optional value is absent, and otherwise loop once through and bind the value which is present to the pattern.
An untyped solution is to project out the element from the parse tree:
rascal>opt.i.args[0];
Tree: `1`
Tree: appl(prod(lex("Int"),[iter(\char-class([range(48,57)]))],{}),[appl(regular(iter(\char-class([range(48,57)]))),[char(49)])[#loc=|file://-|(0,1,<1,0>,<1,1>)]])[#loc=|file://-|(0,1,<1,0>,<1,1>)]
However, then to transfer this back to an Int you'd have to pattern match, like so:
rascal>if (Int i := opt.i.args[0]) { printInt(i); }
str: "1"
One could write a generic cast function to help out here:
rascal>&T cast(type[&T] t, value v) { if (&T a := v) return a; throw "cast exception"; }
ok
rascal>printInt(cast(#Int, opt.i.args[0]))
str: "1"
Still, I believe Rascal is missing a feature here. Something like this would be a good feature request:
rascal>Int j = opt.i.value;
rascal>opt.i has value
bool: true
Let's say I have the following grammar for boolean queries:
Query := <Atom> [ <And-Chain> | <Or-Chain> ]
Atom := "(" <Query> ")" | <Var> <Op> <Var>
And-Chain := "&&" <Atom> [ <And-Chain> ]
Or-Chain := "||" <Atom> [ <Or-Chain> ]
Var := "A" | "B" | "C" | ... | "Z"
Op := "<" | ">" | "="
A parser for this grammar would have the pseudo-code:
parse(query) :=
tokens <- tokenize(query)
parse-query(tokens)
assert-empty(tokens)
parse-query(tokens) :=
parse-atom(tokens)
if peek(tokens) equals "&&"
parse-chain(tokens, "&&")
else if peek(tokens) equals "||"
parse-chain(tokens, "||")
parse-atom(tokens) :=
if peek(tokens) equals "("
assert-equal( "(", shift(tokens) )
parse-query(tokens)
assert-equal( ")", shift(tokens) )
else
parse-var(tokens)
parse-op(tokens)
parse-var(tokens)
parse-chain(tokens, connector) :=
assert-equal( connector, shift(tokens) )
parse-atom(tokens)
if peek(tokens) equals connector
parse-chain(tokens, connector)
parse-var(tokens) :=
assert-matches( /[A-Z]/, shift(tokens) )
parse-op(tokens) :=
assert-matches( /[<>=]/, shift(tokens) )
What I want to make sure of, though, is that my parser will report helpful parse errors. For example, given a query that starts with "( A < B && B < C || ...", I'd like an error like :
found "||" but expected "&&" or ")"
The trick with this is that it gathers expectations from across different parts of the parser. I can work out ways to do this, but it all ends up feeling a little clunky.
Like, in Java, I might throw a GreedyError when attempting to peek for a "&&" or "||"
// in parseAtom
if ( tokens.peek() == "(" ) {
assertEqual( "(", tokens.shift() );
try {
parseQuery(tokens);
caught = null;
}
catch (GreedyError e) {
caught = e;
}
try {
assertEqual( ")", tokens.shift() );
}
catch (AssertionError e) {
throw e.or(caught);
}
}
// ...
// in parseChain
assertEqual( connector, tokens.shift() );
parseAtom(tokens);
if (tokens.peek() == connector) {
parseChain(tokens, connector);
}
else {
new GreedyError(connector);
}
Or, in Haskell, I might use WriterT to keep track of my last failing comparisons for the current token, using censor to clear out after every successful token match.
But both of these solutions feel a bit hacked and clunky. I feel like I'm missing something fundamental, some pattern, that could handle this elegantly. What is it?
Build an L(AL)R(1) state machine. More details about LALR can be found here.
What you want to use for error reporting is the FIRSTOF set for dot in each state. This answer will make sense when you understand how the parser states are generated.
If you have such a set of states, you can record the union of the FIRSTOF sets with each state; then when in that state, and no transition is possible, your error message is "Exepect (firstof currentstate)".
If you don't want to record the FIRST sets, you can easily write an algorithm that will climb through the state tables to reconstruct it. That would the algorithmic equivalent of your "peeking ahead".
I am looking to write some pseudo-code of a recursive descent parser. Now, I have no experience with this type of coding. I have read some examples online but they only work on grammar that uses mathematical expressions. Here is the grammar I am basing the parser on.
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
I must write methods S(), L() and E() and return some error messages, but the tutorials I have found online have not helped a lot. Can anyone point me in the right direction and give me some examples?.
I would like to write it in C# or Java syntax, since it is easier for me to relate.
Update
public void S() {
if (currentToken == "if") {
getNextToken();
E();
if (currentToken == "then") {
getNextToken();
S();
if (currentToken == "else") {
getNextToken();
S();
Return;
}
} else {
throw new IllegalTokenException("Procedure S() expected a 'then' token " + "but received: " + currentToken);
} else if (currentToken == "begin") {
getNextToken();
S();
L();
return;
} else if (currentToken == "print") {
getNextToken();
E();
return;
} else {
throw new IllegalTokenException("Procedure S() expected an 'if' or 'then' or else or begin or print token " + "but received: " + currentToken);
}
}
}
public void L() {
if (currentToken == "end") {
getNextToken();
return;
} else if (currentToken == ";") {
getNextToken();
S();
L();
return;
} else {
throw new IllegalTokenException("Procedure L() expected an 'end' or ';' token " + "but received: " + currentToken);
}
}
public void E() {
if (currentToken == "i") {
getNextToken();
return;
} else {
throw new IllegalTokenException("Procedure E() expected an 'i' token " + "but received: " + currentToken);
}
}
Basically in recursive descent parsing each non-terminal in the grammar is translated into a procedure, then inside each procedure you check to see if the current token you are looking at matches what you would expect to see on the right hand side of the non-terminal symbol corresponding to the procedure, if it does then continue applying the production, if it doesn't then you have encountered an error and must take some action.
So in your case as you mentioned above you will have procedures: S(), L(), and E(), I will give an example of how L() could be implemented and then you can try and do S() and E() on your own.
It is also important to note that you will need some other program to tokenize the input for you, then you can just ask that program for the next token from your input.
/**
* This procedure corresponds to the L non-terminal
* L -> 'end'
* L -> ';' S L
*/
public void L()
{
if(currentToken == 'end')
{
//we found an 'end' token, get the next token in the input stream
//Notice, there are no other non-terminals or terminals after
//'end' in this production so all we do now is return
//Note: that here we return to the procedure that called L()
getNextToken();
return;
}
else if(currentToken == ';')
{
//we found an ';', get the next token in the input stream
getNextToken();
//Notice that there are two more non-terminal symbols after ';'
//in this production, S and L; all we have to do is call those
//procedures in the order they appear in the production, then return
S();
L();
return;
}
else
{
//The currentToken is not either an 'end' token or a ';' token
//This is an error, raise some exception
throw new IllegalTokenException(
"Procedure L() expected an 'end' or ';' token "+
"but received: " + currentToken);
}
}
Now you try S() and E(), and post back.
Also as Kristopher points out your grammar has something called a dangling else, meaing that you have a production that starts with the same thing up to a point:
S -> if E then S
S -> if E then S else S
So this begs the question if your parser sees an 'if' token, which production should it choose to process the input? The answer is it has no idea which one to choose because unlike humans the compiler can't look ahead into the input stream to search for an 'else' token. This is a simple problem to fix by applying a rule known as Left-Factoring, very similar to how you would factor an algebra problem.
All you have to do is create a new non-terminal symbol S' (S-prime) whose right hand side will hold the pieces of the productions that aren't common, so your S productions no becomes:
S -> if E then S S'
S' -> else S
S' -> e
(e is used here to denote the empty string, which basically means there is no
input seen)
This is not the easiest grammar to start with because you have an unlimited amount of lookahead on your first production rule:
S -> if E then S | if E then S else S | begin S L | print E
consider
if 5 then begin begin begin begin ...
When do we determine this stupid else?
also, consider
if 5 then if 4 then if 3 then if 2 then print 2 else ...
Now, was that else supposed to bind to the if 5 then fragment? If not, that's actually cool, but be explicit.
You can rewrite your grammar (possibly, depending on else rule) equivalently as:
S -> if E then S (else S)? | begin S L | print E
L -> end | ; S L
E -> i
Which may or may not be what you want. But the pseudocode sort of jumps out from this.
define S() {
if (peek()=="if") {
consume("if")
E()
consume("then")
S()
if (peek()=="else") {
consume("else")
S()
}
} else if (peek()=="begin") {
consume("begin")
S()
L()
} else if (peek()=="print") {
consume("print")
E()
} else {
throw error()
}
}
define L() {
if (peek()=="end") {
consume("end")
} else if (peek==";")
consume(";")
S()
L()
} else {
throw error()
}
}
define E() {
consume_token_i()
}
For each alternate, I created an if statement that looked at the unique prefix. The final else on any match attempt is always an error. I consume keywords and call the functions corresponding to production rules as I encounter them.
Translating from pseudocode to real code isn't too complicated, but it isn't trivial. Those peeks and consumes probably don't actually operate on strings. It's far easier to operate on tokens. And simply walking a sentence and consuming it isn't quite the same as parsing it. You'll want to do something as you consume the pieces, possibly building up a parse tree (which means these functions probably return something). And throwing an error might be correct at a high level, but you'd want to put some meaningful information into the error. Also, things get more complex if you do require lookahead.
I would recommend Language Implementation Patterns by Terence Parr (the guy who wrote antlr, a recursive descent parser generator) when looking at these kinds of problems. The Dragon Book (Aho, et al, recommended in a comment) is good, too (it is still probably the dominant textbook in compiler courses).
I taught (really just, helped) the parsing section of a PL class last semester. I really recommend looking at the parsing slides from our page: here. Basically, for recursive descent parsing, you ask yourself the following question:
I've parsed a little bit of a nonterminal, now I'm at a point where I can make a choice as to what I'm supposed to parse next. What I see next will make decisions as to what nonterminal I'm in.
Your grammar -- by the way -- exhibits a very common ambiguity called the "dangling else," which has been around since the Algol days. In shift reduce parsers (typically produced by parser generators) this generates a shift / reduce conflict, where you typically choose to arbitrarily shift instead of reduce, giving you the common "maximal much" principal. (So, if you see "if (b) then if (b2) S1 else S2" you read it as "if (b) then { if (b2) { s1; } else { s2; } }")
Let's throw this out of your grammar, and work with a slightly simpler grammar:
T -> A + T
| A - T
| A
A -> NUM * A
| NUM / A
| NUM
we're also going to assume that NUM is something you get from the lexer (i.e., it's just a token). This grammar is LL(1), that is, you can parse it with a recursive descent parser implemented using a naive algorithm. The algorithm works like this:
parse_T() {
a = parse_A();
if (next_token() == '+') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_plus_constructor(a, t);
}
else if (next_token() == '-') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_minus_constructor(a, t);
} else {
return T_node_a_constructor(a);
}
}
parse_A() {
num = token(); // gets the current token from the token stream
next_token(); // advance to the next token in the stream
assert(token_type(a) == NUM);
if (next_token() == '*') {
a = parse_A();
return A_node_times_constructor(num, a);
}
else if (next_token() == '/') {
a = parse_A();
return A_node_div_constructor(num, a);
} else {
return A_node_num_constructor(num);
}
}
Do you see the pattern here:
for each nonterminal in your grammar: parse the first thing, then see what you have to look at to decide what you should parse next. Continue this until you're done!
Also, please note that this type of parsing typically won't produce what you want for arithmetic. Recursive descent parsers (unless you use a little trick with tail recursion?) won't produce leftmost derivations. In particular, you can't write left recursive rules like "a -> a - a" where the leftmost associativity is really necessary! This is why people typically use fancier parser generator tools. But the recursive descent trick is still worth knowing about and playing around with.