I made a simple grammar for parsing regular expressions. Unfortunately, when I try to test my regex compiler on large expressions I reach StackOverflowException. The problem is similar to this one except that their solution no longer works in my scenario. Here is my grammar:
union: concat | concat '|' union ;
concat: kleene_closure concat | kleene_closure;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;
Now the problem is that I have a really large file that looks like
something1 | something2 | something3 | .... | something1000
I use ANTLR's Visitor class for parsing. I know I could probably make some optimization by using +/* like this
union: (concat '|')* concat ;
concat: kleene_closure+;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;
However, it doesn't really solve the problem, due to recursive nature of this grammar. For instance, it would now fail on the following sample that clearly requires recursion:
(...(((something1) | something2) | something3) | .... ) | something1000
How can I avoid StackOverflowExcpetion? How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?
If you're going to use a recursive descent parser, then you will inevitably run into an input which exceeds the call stack depth. This problem is ameliorated by languages like Java which are capable of controlling their own stack depth, so that there is a controllable result like a StackOverflowException. But it's still a real problem.
Parser generators like Yacc/Bison and Java Cup use a bottom-up LALR(1) algorithm which uses an explicit stack for temporary storage, rather than using the call stack for that purpose. That means that the parsers have to manage storage for the parser stack (or use a container ADT from the host language's standard library, if there is one), which is slightly more complex. But you don't have to deal with that complexity; it's built in to the parser generator.
There are several advantages of the explicit stack for the parser generator:
It's easier to control maximum stack size;
The maximum stack size is (usually) only limited by available memory;
It's probably more memory efficient because control flow information doesn't need to be kept in stack frames.
Still, it's not a panacea. A sufficiently complicated expression will exceed any fixed stack size, and that can be lead to certain programs being unparseable. Furthermore, if you take advantage of the flexibility mentioned in the second point above ("only limited by available memory"), you may well find that your compiler is terminated unceremoniously by an OOM process (or a segfault) rather than being able to respond to a more polite out-of-memory exception (depending on OS and configuration, of course).
As to:
How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?
Having thousands of lines of code is not a problem if you use a repetition operator in your grammar (or, in the case that you are using an LALR(1) parser, that your grammar is left-recursive). The problem arises, as you note in your question, when you have texts with thousands of nested blocks. And the answer is that many C compilers don't deal gracefully with such texts. Here's a simple experiment with gcc:
$ # A function which generates deeply-nested C programs
$ type deep
deep is a function
deep () {
n=$1;
printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
for ((i=0; i<n; ++i))
do
printf '%*s{ int a%d = a%d + 1;\n' $((i+1)) '' $((i+1)) $i;
done;
printf '%*sprintf("%%d\\n", a%d);\n' $n '' $n;
for ((i=0; i<n; ++i))
do
printf "%s" '}';
done;
printf "%s\n" '}'
}
$ deep 3
#include <stdio.h>
int main(void) {
int a0 = 0;
{ int a1 = a0 + 1;
{ int a2 = a1 + 1;
{ int a3 = a2 + 1;
printf("%d\n", a3);
}}}}
$ # For small depths, GCC is OK with that.
$ deep 3 | gcc -x c - && ./a.out
3
$ # Let's go deeper:
$ deep 10 | gcc -x c - && ./a.out
10
$ deep 100 | gcc -x c - && ./a.out
100
$ deep 1000 | gcc -x c - && ./a.out
1000
$ deep 10000 | gcc -x c - && ./a.out
10000
$ # Ka-bang. (Took quite a long time, too.)
$ deep 100000 | gcc -x c - && ./a.out
gcc: internal compiler error: Segmentation fault (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
Without the nested blocks, gcc is still slow but can handle the program:
$ type big
big is a function
big ()
{
n=$1;
printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
for ((i=0; i<n; ++i))
do
printf ' int a%d = a%d + 1;\n' $((i+1)) $i;
done;
printf ' printf("%%d\\n", a%d);\n' $n;
printf "%s\n" '}'
}
$ big 3
#include <stdio.h>
int main(void) {
int a0 = 0;
int a1 = a0 + 1;
int a2 = a1 + 1;
int a3 = a2 + 1;
printf("%d\n", a3);
}
$ $ big 3|gcc -x c - && ./a.out
3
$ big 10000|gcc -x c - && ./a.out
10000
$ big 100000|gcc -x c - && ./a.out
100000
You can define your grammar in ABNF syntax and give it to TGS* to parser it iteratively - without the use of the thread dedicated stack for recursion. The parser generator generates parsers that run iteratively for all of its operations: lexing, parsing, tree construction, tree to string conversion, tree iteration and tree destruction.
The parser at runtime, can also give you the tree building information with events only, then you can build your tree as you want (or do any calculation without any tree). In this case, when you parse with events (a deterministic parser grammar without explicit tree building), if you have enough operative memory to hold the depth of the parsing rules, you can practically "stream" any input regardless of its size.
The deterministic grammar in ABNF (RFC 5234) like syntax is this:
alternative = concatenation *('|' concatenation)
concatenation = 1* kleene-closure
kleene-closure = atomic 0*1 '*'
atomic = '(' alternative ')' / letter
letter = 'a'-'z' / 'A'-'Z'
This grammar however has one letter per item, and for input as "ab" you will get two atomic nodes with one letter per node. If you want to have more letters then maybe this grammar will do:
alternative = concatenation *('|' *ws concatenation)
concatenation = element *(ws 0*1 element)
element = primary 0*1 '*'
primary = '(' *ws alternative ')' / identifier
identifier = 1*('a'-'z' / 'A'-'Z')
ws = %x20 / %x9 / %xA / %xD
You can read that as: an alternative is made of one or more concatenations separated by |. A concatenation is one or more elements separated by at least one white space character. An element may end in * and can be an alternative in scopes or an identifier, which in turn is one or more letters. White space is space, tab, new line or carriage return. If you want to have more complex identifiers you may use this:
identifier = (letter / '_') *(letter / '_' / digit)
letter = 'a'-'z' / 'A'-'Z'
digit = '0'-'9'
*I work on that project.
Related
While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A* is the same as list(A)):
exp:
| A*; B; A; {1}
| B; B; {2}
However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:
** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:
. B B
** In state 0, looking ahead at B, reducing production
** list(A) ->
** is permitted because of the following sub-derivation:
list(A) B A // lookahead token appears
So | B A requires a shift, while | A* B A requires a reduce when the first token is B. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+ is the same as nonempty_list(A)):
exp2:
| B; A; {1}
| A+; B; A; {1}
| B; B; {2}
In my mind, exp and exp2 read the same, but are clearly treated differently. Is there a way to write exp to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?
exp and exp2 parse the same language, but they're definitely not the same grammar. exp requires a two-symbol lookahead to correctly parse a sentence starting with B, for precisely the reason you've noted: the parser can't decide whether to insert an empty A* into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B. By contrast, exp2 does not need an empty production to create an empty list of As before B A, so no decision is necessary.
You don't need a list to produce this conflict. Replacing A* with A? would produce exactly the same conflict.
You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.
Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:
prefix:
| B;
| A; prefix;
exp:
| prefix; A; { 1 }
| B; B; { 2 }
As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:
%public terminated_list(X, Y):
| y = Y;
{ ( [], y ) }
| x = X; xsy = terminated_list(X, Y);
{ ( x :: (fst xsy), (snd xsy) ) }
There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.
I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"
I would like to apply Demorgan's theorem to an input using yacc and lex.
The input could be any expression such as a+b, !(A+B) etc:
The expression a+b should result in !a∙!b
The expression !(a+b) should result in a+b
I think the lex part is done but I'm having difficulty with the yacc grammar needed to apply the laws to an expression.
What I'm trying to implement is the following algorithm. Consider the following equation as input: Y = A+B
After applying De Morgan's law it becomes: !Y = !(A+B)
Finally, expanding the parentheses should result in !Y = !A∙!B
here lex code:
%{
#include <stdio.h>
#include "y.tab.h"
extern int yylval;
int yywrap (void);
%}
%%
[a-zA-Z]+ {yylval = *yytext; return ALPHABET;}
"&&" return AND;
"||" return OR;
"=" return ('=');
[\t] ;
\n return 0;
. return yytext[0];
"0exit" return 0;
%%
int yywrap (void)
{
return 1;
}
Here is my yacc code:
%{
#include <stdio.h>
int yylex (void);
void yyerror (char *);
extern FILE* yyin;
%}
%token ALPHABET
%left '+''*'
%right '=' '!' NOT
%left AND OR
%start check
%%
check : expr {printf("%s\n",$$);}
;
expr : plus
|plus '+' plus {$$ = $1 + $3;}
;
plus : times
|times '*' times {$$ = $1 * $3;}
;
times : and_op
|and_op AND and_op{$$ = $1 && $3;}
;
and_op : or_op
|or_op OR or_op {$$ = $1 || $3;}
;
or_op : not_op
|'!' not_op {$$ = !$2;}
;
not_op : paren
|'(' paren ')' {$$ = $2;}
;
paren :
|ALPHABET {$$ = $1;}
;
/*
E: E '+' E {$$ = $1 + $3;}
|E '*' E {$$ = $1 * $3;}
|E '=' E {$$ = $1 = $3;}
|E AND E {$$ = ($1 && $3);}
|E OR E {$$ = ($1 || $3);}
|'(' E ')' {$$ = $2;}
|'!' E %prec NOT {$$ = !$2;}
|ALPHABET {$$ = $1;}
;*/
%%
int main()
{
char filename[30];
char * line = NULL;
size_t len = 0;
printf("\nEnter filename\n");
scanf("%s",filename);
FILE *fp = fopen(filename, "r");
if(fp == NULL)
{
fprintf(stderr,"Can't read file %s\n",filename);
exit(EXIT_FAILURE);
}
yyin = fp;
// while (getline(&line, &len, fp) != -1)
// {
// printf("%s",line);
// }
// printf("Enter the expression:\n");
do
{
yyparse();
}while(!feof(yyin));
return 0;
}
You are trying to build a computer algebra system.
Your task is conceptually simple:
Define a lexer for the atoms of your "boolean" expressions
Define a parser for propositional logic in terms of the lexemes
Build a tree that stores the expressions
Define procedures that implement logical equivalences (DeMorgan's theorem is one), that find a place in the tree where it can be applied by matching tree structure, and then modifying the tree accordingly
Run those procedures to achieve the logic rewrites you want
Prettyprint the final AST as the answer
But conceptually simple doesn't necessarily mean easy to do and get it all right.
(f)lex and yacc are designed to help you do steps 1-3 in a relatively straightforward way; their documentation contains a pretty good guide.
They won't help with steps 4-6 at all, and this is where the real work happens. (Your grammar looks like a pretty good start for this part).
(You can do 1-3 without flex and yacc by building
a recursive descent parser that also happens to build the AST as it goes).
Step 4 can be messy, because you have to decide what logical theorems you wish to use, and then write a procedure for each one to do tree matching, and tree smashing, to achieve the desired result. You can do it; its just procedural code that walks up and down the tree comparing node types and relations to children for a match, and then delinking nodes, deleting nodes, creating nodes, and relinking them to effect the tree modification. This is just a bunch of code.
A subtley of algebraic rewrites is now going to bite you: (boolean) algebra has associative and commutative operators. What this means is that some algebra rules will apply to parts of the tree that are arbitrarily far apart. Consider this rule:
a*(b + !a) => a*(b)
What happens when the actual term being parsed looks like:
q*(a + b + c + ... !q ... + z)
"Simple" procedural code to look at the tree now has to walk arbitrarily far down on of the subtrees to find where the rule can apply. Suddenly coding the matching logic isn't so easy, nor is the tree-smash to implement the effect.
If we ignore associative and commutative issues, for complex matches and modifications, the code might be a bit clumsy to write and hard to read; after you've done it once this will be obvious. If you only want to do DeMorgan-over-or, you can do it relatively easily by just coding it. If you want to implement lots of boolean algebras rules for simplification, this will start to be painful. What you'd ideally like to do is express the logic rules in the same notation as your boolean logic so they are easily expressed, but now you need something that can read and interpret the logic rules. That is complex piece of code, but if done right, you can code the logic rules something like the following:
rule deMorgan_for_or(t1:boolexp, t2:boolexp):boolexp->boolexp
" ! (\t1 + \t2) " -> " !\t1 * !\t2 ";
A related problem (step 5) is, where do you want apply the logic rules? Just because you can apply DeMorgan's law in 15 places in a very big logic term, doesn't mean you necessarily want to do that. So somewhere you need to have a control mechanism that decides which of your many rules should apply, and where they should apply. This gets you into metaprogramming, a whole new topic.
If your rules are "monotonic", that is, they in effect can only be applied once, you can simply run them all everywhere and get a terminating computation, if that monotonic answer is the one you want. If you have rules that are inverses (e.g., !(x+y) => !x * !y, and !a * !b => !(a+b)), then your rules may run forever repeatedly doing and undoing a rewrite. So you have to be careful to ensure you get termination.
Finally, when you have the modified tree, you'll need to print it back out in readable form (Step 6). See my SO answer on how to build a prettyprinter.
Doing all of this for one or two rules by yourself is a great learning exercise.
Doing it with the idea of producing a useful tool is a whole different animal. There what you want is a set of infrastructure that makes this easy to express: a program transformation system. You can see a complete example of this what it looks like for a system doing arithmetic rather than boolean computations using surface syntax rewrite rules, including the handling the associative and commutative rewrite issues. In another example, you can see what it looks like for boolean logic (see simplify_boolean near end of page), which shows a real example for rules like I wrote above.
I have been trying create my parser for expression with variables and simplify them to quadratic expression form.
This is my parser grammar:
Exercise : Expr '=' Expr
Expr : Term [+-] Expr | Term
Term : Factor [*/] Term | Factor
Factor: Atom [^] Factor | Atom
Atom: Number | Identified | [- sqrt] Atom | '(' Expr ')'
For parsing I'm using recursive descent parser. Let's say I'd like to parse this:
" 2 - 1 + 1 = 0"
and the result is 0, parser creates wrong tree:
-
/ \
2 +
/ \
1 1
How can I make this grammar left-associative? I'm newbie at this, please can you advice me source where I can find more information? Can I achieve this with recursive descent parser?
Take a look at Parsing Expressions by Recursive Descent by Theodore Norvell
There he gives three ways to solve the problem.
1. The shunting yard algorithm.
2. Classic solution by factoring the grammar.
3. Precedence climbing
Your problem stems from the fact that your grammar needs several changes, one example being
Exrp: Term { [+-] Expr} | Term
notice the addition of the { } around [+-] Expr indicating that they should be parsed together and that there can 0 or more of them.
Also by default you are building the tree as
-
/ \
2 +
/ \
1 1
i.e. -(2,+(1,1))
when you should be building the tree for left associative operators of the same precedence as
+
/ \
- 1
/ \
2 1
i.e. +(-(2,1),1)
Since there are three ways to do this in the paper I won't expand on them here. Also you mention that you are new to this so you should get a good compiler book to understand the details of you will encounter reading the paper. Most of these methods are implemented in common programming languages and available free on the internet, but be aware that many people do what you do and post wrong results.
The best way to check if you have it right is with a test like this using a multiple sequence of subtraction operations:
7-3-2 = 2
if you get
7-3-2 = 6 or something else
then it is wrong.
So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.