I am trying to write a simple parser using Lex and Yacc. And I am not familiar with these two before. When I finish the lex and yacc file, and compile it I got error. I think the error is related to string head files that are not included properly, but I couldn't figure it out by myself.
The Lex file named "tokens.l":
%{
#include "parser.hpp"
%}
MODEL "model"
PORT "input"|"output"|"intern"
GATE "xor"|"and"|"or"|"buf"|"cmos1"|"dff"|"dlat"|"inv"|"mux"|"nand"|"nor"|"tie0"|"tie1"|"tiex"|"tiez"|"tsh"|"tsl"|"tsli"|"xnor"
INSTNAME [A-Z0-9]+
PRIMITIVE "primitive"
LEFT "("
RIGHT ")"
COMMA ","
SEMICOLON ";"
EQUAL "="
BLANK [ \t\n]+
%%
{MODEL} {return MODEL;}
{PORT} { if (yytext == "input")
return INPUT;
else if (yytext == "output")
return OUTPUT;
else
return INTERN;
}
_{GATE} {return GATE;}
{INSTNAME} {return INSTNAME;}
{PRIMITIVE} {return PRIMITIVE;}
{LEFT} {return LEFT;}
{RIGHT} {return RIGHT;}
{COMMA} {return COMMA;}
{SEMICOLON} {return SEMICOLON;}
{EQUAL} {return EQUAL;}
{BLANK} {;}
"\0" {return END;}
%%
The yacc file named "parser.y":
%{
#include <iostream>
#include <string>
#include <cstdio>
extern FILE *fp;
%}
%union{
std::string* str;
}
%token <str> MODEL
%token <str> INPUT
%token <str> OUTPUT
%token <str> INTERN
%token <str> GATE
%token <str> INSTNAME
%token PRIMITIVE
%token LEFT
%token RIGHT
%token COMMA
%token SEMICOLON
%token EQUAL
%token END
%type <str> vfile modules module params param interngates interngate primitives
%%
vfile : modules END {
std::ofstream fp;
fp.open("output.v");
fp<<$1;
fp.close();
$$ = new std::string("success");
std::cout<<$$;
}
modules : modules module {$$=$1+$2;}
| module {$$=$1;}
module :MODEL INSTNAME LEFT params RIGHT LEFT interngates RIGHT
{$$ = "module "+$2+" ("+$4+");\n"+$7+"endmodule\n";}
interngates :interngates interngate {$$=$1+$2+"\n";}
|interngate {$$=$1+"\n";}
interngate :INPUT LEFT params RIGHT primitives {$$=$1+$3+"\n"+$5;}
| OUTPUT LEFT params RIGHT primitives { $$=$1+$3+"\n"+$5;}
| INTERN LEFT params RIGHT primitives {$$="wire"+$3+"\n"+$5;}
primitives :LEFT RIGHT {$$="";}
|LEFT PRIMITIVE EQUAL GATE INSTNAME params SEMICOLON RIGHT
{$$=$4+" "+$5+" ("+$6+");\n";}
params :params COMMA param {$$=$1+","+$3;}
| param {$$=$1;}
param :INSTNAME {$$=$1;}
%%
To compile the file, I use the command below:
bison -d -o parser.cpp parser.y
lex -o tokens.cpp tokens.l
g++ -o myparser tokens.cpp parser.cpp -lfl
Can anybody give me a clue? Thanks a lot!
Updated: Error report on osx.
http://www.edaplayground.com/x/3HL
You can't use automatic storage for C++ std::string (or any other string class with non-trivial constructor) in %union. You'll need to use dynamic (heap).
Instead of
%union {
string str;
}
Try:
%union {
std::string *str;
}
You will need to change all of the uses of yylval->str or $$, $1, etc. where $N %type is to use dynamically allocated strings.
So instead of
$$ = "success";
You have to do:
$$ = new std::string("success");
It is customary to use pointers in yacc/bison parser YYSTYPE %union anyway to avoid a huge amount of copying on the stack. Keep in mind your productions should take care of freeing strings for tokens or non-terminals that are no longer used unless your parser runtime is short-lived and the source files aren't huge, then you can cheat and just avoid freeing them or use garbage collection.
It is possible to redefine YYSTYPE to a regular string (non-pointer), but you lose the ability to use the union, which most non-trivial parsers need to pass up a mix of tokens or typed AST objects in semantic actions. Constraining your productions to a single type is less useful than void *.
It is also possible to redefine YYSTYPE to use a variant / polymorphic type, or use a multi-member struct (poor substitution for variant). The former defeats the purpose of the "type safe" %type and %token macros, and the latter forces you to remember the type of each terminal or non-terminal and use explicit notation for the member of your struct ($$->str = "foo", $$->expr.left = $1->str, etc.), This is the downside to using a C based parser with C++. You may want to try Bison's C++ parser skeleton, I have little experience with it due to compile errors everytime I tried it over the years.
There are other (better) workarounds that I have found; I have seen Bison patched to allow boost::variant for YYSTYPE with support of %type and %token. Google "bison Michiel de Wilde" or "bison variant YYSTYPE" (http://lists.gnu.org/archive/html/bison-patches/2007-06/msg00000.html), however, like many Bison suggestions over the years, the patches are met with some vague arguments or general discussion about alternatives, then it fizzles.
Related
Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.
I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"
I intended to use bison to parse some scripting language, in this language I can write code like the following:
a = input()
b = a + 1
function myfunc
a = input()
b = a + 1
end function
I found that the block
a = input()
b = a + 1
which appear both in and out of the function definition can be reduced by the same rule stmts, so I write code like the following
%{
#include <stdio.h>
#include <string>
#include <sstream>
#include <iostream>
#include <stdarg.h>
#include <tuple>
using namespace std;
%}
%debug
%token CRLF EXP FUNCTIONBEGIN FUNCTIONEND
%start program
%%
stmt : EXP
|
stmts : stmt CRLF stmts
| stmt
function : FUNCTIONBEGIN CRLF stmts CRLF FUNCTIONEND
unit : function
| stmts
program : unit
| unit CRLF program
%%
Of course this code can't work due to one shift/reduce conflict
State 3
3 stmts: stmt . CRLF stmts
4 | stmt .
CRLF shift, and go to state 9
CRLF [reduce using rule 4 (stmts)]
$default reduce using rule 4 (stmts)
I thought this conflict is due to both my program rule and stmts rule using the same terminal CRLF as a "binary operator", so I can't solve this conflict by set priority to operators.
Maybe I can merge program rule and stmts rule together by somehow adding another two rules to stmt
stmts : function CRLF stmts
| function
However I thought this method(whether it can practically work) is not very beautiful, so I ask if there's some other solutions
The problem has nothing to do with CRLF tokens. Rather, it is your definition of program. A program is a series of units where each unit is a function or a stmts. But stmts is not a "unit", which is hinted at by the fact that its name is plural. A stmts is a series of stmts.
So suppose we have a program consisting of three statements. How many units is that? Is it one stmts consisting of all three statements? Or two of them, one with two statements and the other with just one? Or the other way around? Or even three units, each consisting of a stmts containing a single statement?
The parser can't tell which of those alternatives is desired because the grammar is ambiguous. And that is what creates the conflict.
The simplest solution is to change the production unit: stmts to be singular: unit: stmt. Then there is no ambiguity; the three-statement program has exactly three units, each a single stmt.
By the way, you should always prefer left recursion when writing LR grammars. Right recursion doesn't usually create conflicts, and it has nothing to do with your current problem, but it does lead to unnecessarily Iarge parsing stacks, and the reduction of lists like units and stmts will execute from right-to-left as the components are popped off the stack, which is often not what is intended.
I would like to apply Demorgan's theorem to an input using yacc and lex.
The input could be any expression such as a+b, !(A+B) etc:
The expression a+b should result in !a∙!b
The expression !(a+b) should result in a+b
I think the lex part is done but I'm having difficulty with the yacc grammar needed to apply the laws to an expression.
What I'm trying to implement is the following algorithm. Consider the following equation as input: Y = A+B
After applying De Morgan's law it becomes: !Y = !(A+B)
Finally, expanding the parentheses should result in !Y = !A∙!B
here lex code:
%{
#include <stdio.h>
#include "y.tab.h"
extern int yylval;
int yywrap (void);
%}
%%
[a-zA-Z]+ {yylval = *yytext; return ALPHABET;}
"&&" return AND;
"||" return OR;
"=" return ('=');
[\t] ;
\n return 0;
. return yytext[0];
"0exit" return 0;
%%
int yywrap (void)
{
return 1;
}
Here is my yacc code:
%{
#include <stdio.h>
int yylex (void);
void yyerror (char *);
extern FILE* yyin;
%}
%token ALPHABET
%left '+''*'
%right '=' '!' NOT
%left AND OR
%start check
%%
check : expr {printf("%s\n",$$);}
;
expr : plus
|plus '+' plus {$$ = $1 + $3;}
;
plus : times
|times '*' times {$$ = $1 * $3;}
;
times : and_op
|and_op AND and_op{$$ = $1 && $3;}
;
and_op : or_op
|or_op OR or_op {$$ = $1 || $3;}
;
or_op : not_op
|'!' not_op {$$ = !$2;}
;
not_op : paren
|'(' paren ')' {$$ = $2;}
;
paren :
|ALPHABET {$$ = $1;}
;
/*
E: E '+' E {$$ = $1 + $3;}
|E '*' E {$$ = $1 * $3;}
|E '=' E {$$ = $1 = $3;}
|E AND E {$$ = ($1 && $3);}
|E OR E {$$ = ($1 || $3);}
|'(' E ')' {$$ = $2;}
|'!' E %prec NOT {$$ = !$2;}
|ALPHABET {$$ = $1;}
;*/
%%
int main()
{
char filename[30];
char * line = NULL;
size_t len = 0;
printf("\nEnter filename\n");
scanf("%s",filename);
FILE *fp = fopen(filename, "r");
if(fp == NULL)
{
fprintf(stderr,"Can't read file %s\n",filename);
exit(EXIT_FAILURE);
}
yyin = fp;
// while (getline(&line, &len, fp) != -1)
// {
// printf("%s",line);
// }
// printf("Enter the expression:\n");
do
{
yyparse();
}while(!feof(yyin));
return 0;
}
You are trying to build a computer algebra system.
Your task is conceptually simple:
Define a lexer for the atoms of your "boolean" expressions
Define a parser for propositional logic in terms of the lexemes
Build a tree that stores the expressions
Define procedures that implement logical equivalences (DeMorgan's theorem is one), that find a place in the tree where it can be applied by matching tree structure, and then modifying the tree accordingly
Run those procedures to achieve the logic rewrites you want
Prettyprint the final AST as the answer
But conceptually simple doesn't necessarily mean easy to do and get it all right.
(f)lex and yacc are designed to help you do steps 1-3 in a relatively straightforward way; their documentation contains a pretty good guide.
They won't help with steps 4-6 at all, and this is where the real work happens. (Your grammar looks like a pretty good start for this part).
(You can do 1-3 without flex and yacc by building
a recursive descent parser that also happens to build the AST as it goes).
Step 4 can be messy, because you have to decide what logical theorems you wish to use, and then write a procedure for each one to do tree matching, and tree smashing, to achieve the desired result. You can do it; its just procedural code that walks up and down the tree comparing node types and relations to children for a match, and then delinking nodes, deleting nodes, creating nodes, and relinking them to effect the tree modification. This is just a bunch of code.
A subtley of algebraic rewrites is now going to bite you: (boolean) algebra has associative and commutative operators. What this means is that some algebra rules will apply to parts of the tree that are arbitrarily far apart. Consider this rule:
a*(b + !a) => a*(b)
What happens when the actual term being parsed looks like:
q*(a + b + c + ... !q ... + z)
"Simple" procedural code to look at the tree now has to walk arbitrarily far down on of the subtrees to find where the rule can apply. Suddenly coding the matching logic isn't so easy, nor is the tree-smash to implement the effect.
If we ignore associative and commutative issues, for complex matches and modifications, the code might be a bit clumsy to write and hard to read; after you've done it once this will be obvious. If you only want to do DeMorgan-over-or, you can do it relatively easily by just coding it. If you want to implement lots of boolean algebras rules for simplification, this will start to be painful. What you'd ideally like to do is express the logic rules in the same notation as your boolean logic so they are easily expressed, but now you need something that can read and interpret the logic rules. That is complex piece of code, but if done right, you can code the logic rules something like the following:
rule deMorgan_for_or(t1:boolexp, t2:boolexp):boolexp->boolexp
" ! (\t1 + \t2) " -> " !\t1 * !\t2 ";
A related problem (step 5) is, where do you want apply the logic rules? Just because you can apply DeMorgan's law in 15 places in a very big logic term, doesn't mean you necessarily want to do that. So somewhere you need to have a control mechanism that decides which of your many rules should apply, and where they should apply. This gets you into metaprogramming, a whole new topic.
If your rules are "monotonic", that is, they in effect can only be applied once, you can simply run them all everywhere and get a terminating computation, if that monotonic answer is the one you want. If you have rules that are inverses (e.g., !(x+y) => !x * !y, and !a * !b => !(a+b)), then your rules may run forever repeatedly doing and undoing a rewrite. So you have to be careful to ensure you get termination.
Finally, when you have the modified tree, you'll need to print it back out in readable form (Step 6). See my SO answer on how to build a prettyprinter.
Doing all of this for one or two rules by yourself is a great learning exercise.
Doing it with the idea of producing a useful tool is a whole different animal. There what you want is a set of infrastructure that makes this easy to express: a program transformation system. You can see a complete example of this what it looks like for a system doing arithmetic rather than boolean computations using surface syntax rewrite rules, including the handling the associative and commutative rewrite issues. In another example, you can see what it looks like for boolean logic (see simplify_boolean near end of page), which shows a real example for rules like I wrote above.
I am using ocamlyacc for a small parser which also performs some semantic actions on most parsing rules.
I have defined a set of tokens in the beginning:
%token T_plus
%token T_minus
%token <int> T_int_const
%left T_plus T_minus
A parser rule which performs a semantic action is the following:
exp: exp T_plus exp
{
checkType T_plus $1 $3
}
where checkType is an external helper function. However, I'm getting this strange warning (which refers to a line in my Parser.mly file)
warning: T_plus was selected from type Parser.token.
It is not visible in the current scope,
and will not be selected if the type becomes unknown.
I haven't found any relevant info in the ocamlyacc manual. Has anyone encountered a similar error? Why is the token not visible inside the scope of the semantic action?
It is not possible to guess what goes wrong on your side, since you're not disclosing enough information. I can guess, that you somehow misread the error message, and the problem is in another file. For example, the following file:
%{
let f PLUS _ = ()
%}
%token PLUS
%left PLUS
%start exp
%type <unit> exp
%%
exp : exp PLUS exp {f PLUS $1}
compiles any problems or warnings with
ocamlbuild Parser.byte
I can only suggest, to look at the generated Parser.ml and see what's is happening there.
In general, this message means, that you're referring to a constructor, that was not brought to the scope. In Parser.mly tokens are always in the scope, so you can't see this error in that file. Usually, you may do this in your lexer. So make sure, that you have open Parser in the intro section of your lexer.