I am trying to build a parser for a desk calculator and am using the following bison code for it.
%union{
float f;
char c;
// int
}
%token <f> NUM
%token <c> ID
%type <f> S E T F G
%%
C : S ';'
| C S ';'
;
S : ID '=' E {fprintf(debug,"13\n");printf("%c has been assigned the value %f.",$1,$3);symbolTable[$1]=$3;}
| E {fprintf(debug,"12\n");result = $$;}
;
E : E '+' T {fprintf(debug,"11\n");$$ = $1+$3;}
| E '-' T {fprintf(debug,"10\n");$$ = $1-$3;}
| T {fprintf(debug,"9\n");$$ = $1;}
;
T : T '*' F {fprintf(debug,"7\n");$$ = $1*$3;}
| T '/' F {fprintf(debug,"6\n");$$ = $1/$3;}
| F {fprintf(debug,"5\n");$$ = $1;}
;
F : G '#' F {fprintf(debug,"4\n");$$ = pow($1,$3);}
| G {fprintf(debug,"3\n");$$ = $1;}
;
G : '(' E ')' {fprintf(debug,"2\n");$$ = $2;}
| NUM {fprintf(debug,"1\n");$$ = $1;}
| ID {fprintf(debug,"0\n");$$ = symbolTable[$1];}
;
%%
My LEX rules are
digit [0-9]
num {digit}+
alpha [A-Za-z]
id {alpha}({alpha}|{digit})*
white [\ \t]
%%
let {printf("let");return LET;}
{num} {yylval.f = atoi(yytext);return NUM;}
{alpha} {yylval.c = yytext[0];return ID;}
[+\-\*/#\(\)] {return yytext[0];}
. {}
%%
The input I gave is a=2+3
When the lexer returns an ID(for 'a'), the parser is going for the production with fprintf(debug,"0\n"). But I want it to go for the production fprintf(debug,"13\n").
So, I am wondering what made my parser go for a reduction on production 0, instead of shifting = to stack, and how do I control it?
What you actually specified is a translation grammar, given by the following:
C → S ';' 14 | C S ';' 8
S → ID '=' E 13 | E 12
E → E '+' T 11 | E '-' T 10 | T 9
T → T '*' F 7 | T "/" F 6 | F 5
F → G '#' F 4 | G 3
G → '(' E ')' 2 | NUM 1 | ID 0
with top-level/start configuration C. (For completeness, I added in 8 and 14).
There is only one word generated from C, by this translation grammar, containing ID '=' NUM '+' NUM as the subword of input tokens, and that is ID ('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14, which is equal to the input-output pair (ID '=' NUM '+' NUM ';', 1 3 5 9 1 3 5 11 13 14). So, the sequence 1 3 5 9 1 3 5 11 13 14 is the one and only translation. Provided the grammar is LALR(1), then this translation will be produced, as a result; and the grammar is LALR(1).
If you're not getting this result, then that can only mean that you implemented wrong whatever you left out of your description: i.e. the lexer ... or that your grammar processor has a bug or your machine has a failure.
And, no; actually what you did is the better way to see what's going on - just stick in a single printf statement to the right hand side of each rule and run it that way to see what translation sequences are produced. The "trace" facility in the parser generator is superfluous for that very reason ... at least the way it is usually implemented (more on that below). In addition, you can get a direct view of everything with the -v option, which produces the LR(0) tables with LALR(1) annotations.
The kind of built-in testing facility that would actually be more helpful - especially for examples like this - is just what I described: one that echoes the inputs interleaved with the output actions. So, when you run it on "a = 2 + 3 ;", it would give you ID('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14 with echo turned on, and just 1 3 5 9 1 3 5 11 13 14 with echo turned off. That would actually be more useful to have as a built-in capability, instead of the trace mode you normally see in implementations of yacc.
The POSIX specification actually leaves open the issue of how "YYDEBUG", "yydebug" and "-t" are to be implemented in a compliant implementation of yacc, to make room for alternative approaches like this.
Well, it turns out that the problem is I am not identifying = as a token here, in my LEX.
As silly as it sounds, it points out a very important concept of yacc/Bison. The question of whether to shift or reduce is answered by checking the next symbol, also called the lookahead. In this case, the lookahead was NUM(for 2) and not =, because of my faulty LEX code. Since there is no production involving ID followed by NUM, it is going for a reduction to G.
And about how I figured it out, it turns out bison has a built-in trace feature. It lays out neatly like a diary entry, whatever it does while parsing. each and every step is written down.
To enable it,
Run bison with -Dparse.trace option.
bison calc.y -d -Dparse.trace
In the main function of parser grab the extern yydebug and set it to non-zero value.
int main(){
extern int yydebug;
yydebug = 1;
.
.
.
}
Related
I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.
I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.
I have a little grammar containing a few commands which have to be used with Numbers and some of these commands return Numbers as well.
My grammar snippet looks like this:
Command:
name Numbers
| Numbers "test"
;
name:
"abs"
| "acos"
;
Numbers:
NUMBER
| numberReturn
;
numberReturn:
name Numbers
;
terminal NUMBER:
('0'..'9')+("."("0".."9")+)?
;
After having inserted the "Numbers 'test'" part in rule command the compiler complains about non-LL() decicions and tells me I have to work around these (left-factoring, syntactic predicates, backtracking) but my problem is that I have no idea what kind of input wouldn't be non-LL() in this case nor do I have an idea how to left-factor my grammar (I don't want toturn on backtracking).
EDIT:
A few examples of what this grammar should match:
abs 3;
acos abs 4; //interpreted as "acos (abs 4)"
acos 3 test; //(acos 3) test
Best regards
Raven
The grammar you are trying to achieve is left-recursive; that means the parser does not know how to tell between (acos 10) test and acos (10 test) (without the parentheses). However, you can give the parser some hints for it to know the correct order, such as parenthesized expressions.
This would be a valid Xtext grammar, with testparenthesized expressions:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model
: operations += UnaryOperation*
;
UnaryOperation returns Expression
: 'abs' exp = Primary
| 'acos' exp = Primary
| '(' exp = Primary 'test' ')'
;
Primary returns Expression
: NumberLiteral
| UnaryOperation
;
NumberLiteral
: value = INT
;
The parser will correctly recognize expressions such as:
(acos abs (20 test) test)
acos abs 20
acos 20
(20 test)
These articles may be helpful for you:
https://dslmeinte.wordpress.com/tag/unary-operator/
http://blog.efftinge.de/2010/08/parsing-expressions-with-xtext.html
I've tried several different parser generators (Bison, DParser, etc.) that claim to be able to generate GLR parsers i.e., ones that can handle ambiguous grammars. Here is a very simple ambiguous grammar of the type I'm talking about:
START: A | B;
A: C | D;
B: C | D;
C: T1 | T2;
D: T3 | T4;
T1: 't1';
T2: 't2';
T3: 't3';
T4: 't4';
I can generate the parsers just fine, but I get "unresolved ambiguity" errors or just outright crashes when I give the parser input that should be valid. There are no problems of any kind when I change the grammar to an unambiguous version.
What am I not understanding about GLR parsers? I thought the whole point was that in cases of ambiguity, ALL possible parses are tracked up until they merge or reach a dead end. All I need is a parser that can tell me whether there is ANY valid parse of the input.
Thanks for any help.
edit:
This is frustrating. Using %dprec and %merge I've been able to get Bison to handle ambiguous rules and terminals, but it still chokes on very simple but highly pathological pseudo-English grammars of the kind that I need to handle:
S: NP VP | NPA VP;
NPA: D N | NP PP;
NP: D N | NP PP | NPA;
VP: V NP | VP PP;
PP: P NP;
D: "the" | "a";
P: "in" | "with";
V: "saw";
N: "saw" | "girl" | "boy" | "park" | "telescope";
With input "a boy saw a girl", Bison is unable to parse and returns with code 1. Tom on the other hand, deals with this grammar and this input sentence flawlessly, and even naturally handles unknown terminals by just assigning them to all possible terminal types. But unlike Bison, Tom chokes on large grammars. (By "chokes" I mean fails in various different ways. If the failure modes would be helpful, I can report those.)
Anyone have any other ideas?
Unfortunately, bison really insists on producing a (single) parse, so you have to specify some way to merge ambiguous parses. If you don't, and there is more than one possible parse, bison's GLR parser will indeed complain that the parse is ambiguous.
If you don't really care which of the multiple parses is accepted, then it's not too difficult to bend bison to your will. The simplest way is to just assign a different %dprec to every possibly ambiguous production. Bison will then select whichever applicable production happens to have the best precedence.
You can even get bison to tell you about multiple parses with a simple %merge function; there is an example in the bison manual. (The documentation of this feature isn't great but it might be adequate to your needs. If not, feel free to ask a more specific question.)
I don't have much experience with DParser, but the manual indicates that its behaviour when faced with multiple possible parses is similar: the default is to complain, but you can provide a trivial merge function of your own: (The quote comes from Section 12, Ambiguities)
Ambiguities are resolved automatically based on priorities and associativities. In addition, when the other resolution techniques fail, user defined ambiguity resolution is possible. The default ambiguity handler produces a fatal error on an unresolved ambiguity. This behavior can be replaced with a user defined resolvers the signature of which is provided in dparse.h.
Here's an example bison GLR grammar for the second example. I left out the lexer, which is really not relevant (and slightly embarrassing because I was rushing).
%{
int yylex();
void yyerror(const char* msg);
%}
%error-verbose
%glr-parser
%token WORD_A "a"
%token WORD_BOY "boy"
%token WORD_GIRL "girl"
%token WORD_IN "in"
%token WORD_LIKED "liked"
%token WORD_PARK "park"
%token WORD_SAW "saw"
%token WORD_TELESCOPE "telescope"
%token WORD_THE "the"
%token WORD_WITH "with"
%%
S : NP VP {puts("S: NP VP");} %dprec 1
| NPA VP {puts("S: NPA VP");} %dprec 2
;
NPA: D N {puts("NPA: D N");} %dprec 3
| NP PP {puts("NPA: NP PP");} %dprec 4
;
NP : D N {puts("NP: D N");} %dprec 6
| NP PP {puts("NP: NP PP");} %dprec 7
| NPA {puts("NP: NPA");} %dprec 10
;
VP : V NP {puts("VP: V NP ");} %dprec 11
| VP PP {puts("VP: VP PP");} %dprec 12
;
PP : P NP {puts("PP: P NP");} %dprec 14
;
D : "the" {puts("D: the");} %dprec 15
| "a" {puts("D: a");} %dprec 16
;
P : "in" {puts("P: in");} %dprec 17
| "with" {puts("P: with");} %dprec 18
;
V : "liked" {puts("V: liked");} %dprec 19
| "saw" {puts("V: saw");} %dprec 20
;
N : "girl" {puts("N: girl");} %dprec 21
| "boy" {puts("N: boy");} %dprec 22
| "park" {puts("N: park");} %dprec 23
| "saw" {puts("N: saw");} %dprec 24
| "telescope"{puts("N: telescope");} %dprec 25
;
%%
int main(int argc, char** argv) {
printf("yyparse returned %d\n", yyparse());
return 0;
}
Compilation:
$ make ambig2
bison30 -v -d -o ambig2.c ambig2.y
ambig2.y: warning: 6 shift/reduce conflicts [-Wconflicts-sr]
ambig2.y: warning: 10 reduce/reduce conflicts [-Wconflicts-rr]
gcc-4.8 -ggdb -Wall -D_POSIX_C_SOURCE=200809L -std=c99 -c -o ambig2.o ambig2.c
gcc-4.8 ambig2.o -o ambig2
rm ambig2.o ambig2.c
Sample parses:
$ ./ambig2 <<<"a boy saw a girl"
D: a
N: boy
NPA: D N
V: saw
D: a
N: girl
NPA: D N
NP: NPA
VP: V NP
S: NPA VP
yyparse returned 0
$ ./ambig2 <<<"a saw saw the saw in a saw"
D: a
N: saw
NPA: D N
V: saw
D: the
N: saw
NPA: D N
NP: NPA
VP: V NP
P: in
D: a
N: saw
NPA: D N
NP: NPA
PP: P NP
VP: VP PP
S: NPA VP
yyparse returned 0
Your grammar doesn't cause GLR parsers to choke.
You need a GLR parsing engine that delivers what GLR parsers are supposed to deliver: parsing in the face of ambiguities, and handing you the result. Presumably you use additional context to resolve the ambiguities. (You can tangle context-checking
into the parsing process if you really insist on avoiding producing context-prevented ambiguities. If you do that, you get the kind of complications that the GCC guys
had when they tried to parse C and C++ with LALR).
Here's the output for OP's problem, given to our DMS Software Reengineering Toolkit's GLR parser generator. I had to define a lexer and a grammar compatible for DMS:
Lexer (defining individual tokens as words; a more scalable version
might have defined word class tokens such as D P V N):
%%
%%main
#skip "\s+"
#skip "[\u000d\u000a]+"
#token 'the' "the"
#token 'a' "a"
#token 'in' "in"
#token 'with' "with"
#token 'saw' "saw"
#token 'girl' "girl"
#token 'boy' "boy"
#token 'park' "park"
#token 'telescope' "telescope"
%%
Grammar (DMS doesn't bother with EBNF):
S = NP VP ;
S = NPA VP ;
NPA = D N ;
NPA = NP PP ;
NP = D N ;
NP = NP PP ;
NP = NPA ;
VP = V NP ;
VP = VP PP ;
PP = P NP ;
D = 'the' ;
D = 'a';
P = 'in' ;
P = 'with' ;
V = 'saw' ;
N = 'saw' ;
N = 'girl' ;
N = 'boy' ;
N = 'park' ;
N = 'telescope' ;
Sample file "aboysawagirl.txt"
a boy saw a girl\n
From start to finish, building lexer and parser (about 10 minutes of fumbling...)
Parsing the sample file and dumping the automatically built tree:
C:\DMS\Domains\simpenglish\Tools\Parser\Source>run ..\domainparser ++AST ..\..\Lexer\aboysawagirl.txt
simpenglish Domain Parser Version 2.5.15
Copyright (C) 1996-2013 Semantic Designs, Inc; All Rights Reserved; SD Confidential
Powered by DMS (R) Software Reengineering Toolkit
24 tree nodes in tree.
3 ambiguity nodes in tree.
(AMBIGUITY<S=11>#simpenglish=31##1f35140^0{2} Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(S#simpenglish=1##1f350e0^1#1f35140:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(AMBIGUITY<NP=12>#simpenglish=31##1f34ba0^1#1f350e0:1{2} Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(NP#simpenglish=5##1f34b80^1#1f34ba0:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(D#simpenglish=12##1f34aa0^2#1f34b80:1#1f34b40:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('a'#simpenglish=22##1f349c0^1#1f34aa0:1[Keyword:0] Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'a'
|)D#1f34aa0
|(N#simpenglish=18##1f34b20^2#1f34b80:2#1f34b40:2 Line 1 Column 3 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('boy'#simpenglish=27##1f34a80^1#1f34b20:1[Keyword:0] Line 1 Column 3 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'boy'
|)N#1f34b20
)NP#1f34b80
(NP#simpenglish=7##1f34c60^1#1f34ba0:2 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(NPA#simpenglish=3##1f34b40^2#1f35040:1#1f34c60:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34aa0^2... [ALREADY PRINTED] ...)
| (N#simpenglish=18##1f34b20^2... [ALREADY PRINTED] ...)
|)NPA#1f34b40
)NP#1f34c60
)AMBIGUITY#1f34ba0
(VP#simpenglish=8##1f34fc0^1#1f350e0:2 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(V#simpenglish=15##1f34d60^1#1f34fc0:1 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|('saw'#simpenglish=25##1f34b00^2#1f34d60:1#1f34d40:1[Keyword:0] Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'saw'
)V#1f34d60
(AMBIGUITY<NP=12>#simpenglish=31##1f34f00^2#1f34f80:2#1f34fc0:2{2} Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(NP#simpenglish=5##1f34e60^1#1f34f00:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34da0^2#1f34e60:1#1f34de0:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('a'#simpenglish=22##1f34ce0^1#1f34da0:1[Keyword:0] Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'a'
| )D#1f34da0
| (N#simpenglish=17##1f34dc0^2#1f34e60:2#1f34de0:2 Line 1 Column 13 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('girl'#simpenglish=26##1f34d80^1#1f34dc0:1[Keyword:0] Line 1 Column 13 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'girl'
| )N#1f34dc0
|)NP#1f34e60
|(NP#simpenglish=7##1f34f20^1#1f34f00:2 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (NPA#simpenglish=3##1f34de0^1#1f34f20:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34da0^2... [ALREADY PRINTED] ...)
| (N#simpenglish=17##1f34dc0^2... [ALREADY PRINTED] ...)
| )NPA#1f34de0
|)NP#1f34f20
)AMBIGUITY#1f34f00
)VP#1f34fc0
)S#1f350e0
(S#simpenglish=2##1f35040^1#1f35140:2 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(NPA#simpenglish=3##1f34b40^2... [ALREADY PRINTED] ...)
(VP#simpenglish=8##1f34f80^1#1f35040:2 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(V#simpenglish=15##1f34d40^1#1f34f80:1 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|('saw'#simpenglish=25##1f34b00^2... [ALREADY PRINTED] ...)
)V#1f34d40
(AMBIGUITY<NP=12>#simpenglish=31##1f34f00^2... [ALREADY PRINTED] ...)
)VP#1f34f80
)S#1f35040
)AMBIGUITY#1f35140
Your simple english grammar parser parses your example sentence in different ways.
This is a lot more spectacular with a full C++11 grammar.
Parsing binary sums / products are easy, but I'm having troubles defining a grammar that parses
a + b * c + d + e
as
sum(a, prod(b, c), d, e)
My initial (naive) attempt generated 61 shift / reduce conflicts.
I'm using java cup (but I suppose a solution for any other parser generator would be easily translated).
The following ANTLR grammar:
parse
: exp EOF
;
exp
: add_exp
;
add_exp
: mul_exp ('+' mul_exp)*
;
mul_exp
: atom ('*' atom)*
;
atom
: Number
| '(' exp ')'
;
Number
: 'a'..'z'
;
parses the input a + b * c + d + e as:
alt text http://img266.imageshack.us/img266/7099/17212574.png
As you can see, the mul_exp is the furthest away in the tree and (using an appropriate "walk" through your tree) will be evaluated first.
and the input a + b * (c + d) + e is parsed as:
alt text http://img688.imageshack.us/img688/2207/89332200.png
The images were generated with ANTLRWorks.
EDIT:
A tool like ANTLRWorks makes debugging a grammar a breeze! For example, if I click on the atom rule in the grammar above, the following is automatically generated and displayed at the bottom of the screen:
alt text http://img340.imageshack.us/img340/6793/53395907.png
Of course, that rule isn't complex at all, but when you do get to work with more complex rules, it's pretty darn easy to visualize them like that.
HTH.