Bison parser expanding rule instead of reducing rule - parsing

Is it possible for a Bison rule to expand instead of reducing so that it turns into more tokens? Asked a different way: is it possible to insert extra tokens to be parsed before the next token in the parser input?
Here is an example where I might want this:
Suppose I want a parser that understands three token types. Numbers (just positive integers for the sake of simplicity - INT), words (any number of letters, upper or lower case STRING) and some kind of other symbol (lets use an exclamation mark for no good reason - EXC)
Suppose I have a rule that reduces a word followed by a number followed by an exclamation mark. This rule results in an integer type, let's say for now that it simply doubles its input. This rule also allows itself to be the integer that it parses.
I also have a rule to accept any number of these in a row (the start rule).
The Bison parser look like this: (quicktest.y)
%{
#include <stdio.h>
%}
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
main(int argc, char ** argv){
yyparse();
}
yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The tokens can be generated with a flex lexer like so: (quicktest.l)
%{
#include "quicktest.tab.h"
%}
%%
[A-Za-z]+ {return STRING;}
[1-9]+ {yylval.INT_VAL = atoi(yytext); return INT;}
"!" {return EXC;}
. {}
This can be built with the following commands:
bison -d quicktest.y
flex quicktest.l
gcc -o quicktest quicktest.tab.c lex.yy.c -lfl -ggdb
I can now input something like this:
double double 2 ! !
and get the result 8
Now if I want the user to be able to avoid having lots of exclamation marks on one line, like this:
a b c d e f 2 ! ! ! ! ! !
I'd like to be able to allow them to input something like this:
a b c d e f 2 !*6
So I can add a flex expression for such a token that simply extracts the number of exclamations needed:
!\*[1-9]+ {
char *number = malloc(sizeof(char) * (strlen(yytext)-1));
strcpy(number, yytext+2);
yylval.INT_VAL = atoi(number);
free(number);
printf("Multiple exclamations: %d\n", yylval.INT_VAL);
return REPEAT_EXC;
}
But how would I implement the bison side of things?
I can add the token type like so:
%token <INT_VAL> REPEAT_EXC
And then a rule of some kind perhaps?
repeat_exc: REPEAT_EXC {/*expand into n exclamation marks (EXC tokens)*/}
;
Does Bison support this in any way?
If not how should I implement this?
Should I somehow have the lexer return the EXC token n times when it receives the repeat exc expression? (I'd rather avoid this if possible as this requires the flex code to keep record of some kind of state, it could be in the repeat exclamation state or in a normal state. The lexer is then not as simple to maintain.)

That's really not possible in a context-free grammar.
It's not that difficult to do in a traditional lexer, but as you say it requires that the lexer maintain state. An easier approach is to use a push parser, where the parser is called from the lexer rather than the other way around. [Note 1]
The bison manual doesn't explain the API very well; if you declare a pure push parser, the interface you get is:
int yypush_parse(yypstate*, int, const YYSTYPE*);
or, if position-tracking is enabled:
int yypush_parse(yypstate*, int, const YYSTYPE*, YYLTYPE*);
I made fairly minimal changes to your example, in order to show the push_parser interface. First, the parser; the only differences are the %define directives to declare a push parser; the elimination of main (the lexer is now top-level), and the declaration of yyerror with an explicit void return type. [Note 2]
%{
#include <stdio.h>
void yyerror(char* msg);
%}
%define api.pure full
%define api.push-pull push
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
void yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The lexer has some more substantial changes, but I don't think the end result is any harder to read or maintain. It might even be easier.
The macro PARSE sends a token with a specified type tag and value to yyparse; the macro PARSE_TOKEN sends a token without a semantic value.
The %options line removes several warnings from the compile step
The initialization of the parser state was added. (Indented lines after the %% and before any rule are inserted at the top of the lexer function, in this case yypush_parse, so they can be used to declare and initialize local variables.)
The INT rule was changed to allow 10 to be a valid integer.
The !*<int> rule was added.
The <<EOF>> rule was added. (It's pretty well boiler-plate for lexer-driven push-parsing.)
A main function was added, which calls yylex.
(Oh, and I changed a rule to avoid echoing new lines.)
%{
#include "push.tab.h"
#define PARSE(tok,tag,val) do { \
YYSTYPE yylval = {.tag=val}; \
int status = yypush_parse(ps, tok, &yylval); \
if (status != YYPUSH_MORE) return status; \
} while(0)
#define PARSE_TOKEN(tok) do { \
int status = yypush_parse(ps, tok, 0); \
if (status != YYPUSH_MORE) return status; \
} while(0)
%}
%option noyywrap nounput noinput
%%
yypstate *ps = yypstate_new ();
[A-Za-z]+ {PARSE_TOKEN(STRING);}
[1-9][0-9]* {PARSE(INT,INT_VAL,atoi(yytext));}
"!*"[1-9][0-9]* {int r = atoi(yytext+2);
while (r--) PARSE_TOKEN(EXC);
}
"!" {PARSE_TOKEN(EXC);}
.|\n {}
<<EOF>> {int status = yypush_parse(ps, 0, 0);
yypstate_delete(ps);
return status;
}
%%
int main(int argc, char** argv) {
return yylex();
}
Notes
This is the style of the lemon parser generator. lemon was originally written to create the sqlite SQL parser but is used in various projects precisely for the convenience of the "push" interface. bison's push-parser support is more recent, and very welcome.
I'm not crazy about INT_VAL; I prefer lower-case for union tags, but I was trying to minimize the diff.

Related

Why the following LEX program is not printing "No. of tokens"

My code is printing the identifiers,separators and all other things except it is not printing the number of tokens.Can't point out the problem.
%{
int n=0;
%}
%%
"while"|"if"|"else"|"printf" {
n++;
printf("\t keywords : %s", yytext);}
"int"|"float" {
n++;printf("\t identifier : %s", yytext);
}
"<="|"=="|"="|"++"|"-"|"*"|"+" {
n++;printf("\t operator : %s", yytext);
}
[(){}|, ;] {n++;printf("\t seperator : %s", yytext);}
[0-9]*"."[0-9]+ {
n++;printf("\t float : %s", yytext);
}
[0-9]+ {
n++;printf("\t integer : %s", yytext);
}
.;
%%
int main(void)
{
yylex();
printf("\n total no. of tokens = %d\n",n);
}
int yywrap()
{
return 0;
}
If yywrap() returns 0, the lexer assumes that yywrap() has somehow arranged for yyin to have more data, and the lexer will continue to read input. So your lexer will never terminate.
If you want to signal that there is no more data, you need to return 1 from yywrap().
It's probably better to avoid the need for yywrap by placing
%option noyywrap
in the flex prologue.
I usually use %option noinput nounput noyywrap, which eliminates some compiler warnings assuming you ask for compiler warnings, which you should always do. Also %option nodefault can help you find lex specification bugs, since it will complain if some input does not have a matching rule. (The default (f)lex action on unrecognised input is to simply write the unmatched character to standard output. That's not usually very helpful, and unlike an error message, it is very easy to miss.) Finally, %option 8bit is only necessary if you request a lexer optimised for speed rather than table-size. But it doesn't hurt to add it, and it might save you from an embarrassing bug if you (or someone) someday decides to try the faster scanner skeleton. (Not recommended, except in very special circumstances.)

Yacc and Lex "syntax error"

When I'm trying to check the expression "boolean x;" I'm getting "syntax error" and I can't understand why.
When I'm checking the expression "x = 3;" or "2 = 1;", the abstract syntax tree is generated and no errors are presented.
(I'm not allowed to use anything beside Lex and Yacc in this project and I'm using Ubuntu)
Lex file:
%%
[\n\t ]+;
boolean {return BOOL;}
TRUE {return TRUE;}
FALSE {return FALSE;}
[0-9]+ {return NUM;}
[a-zA-Z][0-9a-zA-Z]* {return ID;}
. {return yytext[0];}
%%
Yacc file:
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct node{
struct node *left;
struct node *right;
char *token;
} node;
node *mknode(node *left, node *right, char *token);
void printtree(node *tree);
#define YYSTYPE struct node *
%}
%start code
%token ID,NUM,TRUE,FALSE,BOOL
%right '='
%%
code:lines{printtree($1); printf("\n");}
lines:calcExp';'|assignExp';'|boolExp ';'{$$ = $1;}
boolExp: boolST id{$$=$2;}
calcExp: number '+' number {$$ = mknode($1,$3,"+");}
assignExp: id '=' number{$$ = mknode($1,$3,"=");}
boolSt : BOOL;
id : ID {$$ = mknode(0,0,yytext);}
number : NUM{$$ = mknode(0,0,yytext);}
%%
#include "lex.yy.c"
int main (void) {return yyparse();}
node *mknode(node *left, node *right, char *token){
node *newnode = (node *)malloc(sizeof(node));
char *newstr = (char *)malloc(strlen(token)+1);
strcpy(newstr, token);
newnode->left = left;
newnode->right = right;
newnode->token = newstr;
return newnode;
}
void printtree(node *tree){
if (tree->left || tree->right)
printf("(");
printf(" %s ", tree->token);
if(tree->left)
printtree(tree->left);
if(tree->right)
printtree(tree->right);
if(tree->left || tree->right)
printf(")");
}
void yyerror (char *s) {
fprintf (stderr, "%s\n",s);}
The first step to debug syntax errors is to enable %error-verbose in the bison file. Now instead of just saying "syntax errors", it tells us there was an unexpected character after the boolean keyword when it expected an identifier.
So let's add a print statement to the . rule in the lexer that prints the matched character, so that we can see where it produces unexpected characters. Now we see that it prints a space, but spaces should have been ignored, right? So let's look at the rule that's supposed to do that:
[\n\t ]+;
If your editor has proper syntax highlighting for flex files, the problem should become apparent now: The ; is seen as part of the rule, not the action. That is, the rule matches white space, followed by a semicolon, instead of just matching white space.
So remove the semicolon and it should work.

flex atoi(yytext) does not assign value to a variable

I am making simple lexer using flex. I want to read yytext value and save it as an integer in variable t, but when I compile it it shows me following error:
error: stray ‘\35’ in program
t = atoi(yytext);
Here is the code:
%{
#include "global.h"//contains stdlib
int t=0;
%}
DIGIT [0-9]
%%
{DIGIT} {
printf("found an integer, = %d \n", atoi( yytext));//this compiles without errors
t = atoi(yytext); //here I have error
//...rest of code
}
%%
main(){
yylex();
}
Typically:
error: stray ‘\35’ in program
are connect to the use of wrong quoation marks "
Example:
`a` ‘a’ instead of 'a'
”a“ ... instead of "a"
See if this appears in your "global.h"

Reading new line giving syntax error in LEX YACC

I am trying to parse a code, and for that i have written LEX and YACC file which will given below. first line it is reading correctly but after that it is giving syntax error, it is not reading next line,should i modify input and unput function,i am reading from file and writing my output in a file.....i have just started using LEX YACC, need some of the idea.
input file
b_7 = _6 + b_3;
a_8 = b_7 - c_5;
lex file
%{
/*
parser for ssa;
*/
#include<stdio.h>
#include<stdlib.h>
#include"y.tab.h"
%}
%%
[\t]+ ;
\n ;
[if]+ printf("first input\n");
[else]+ return(op);
[=]+ return(equal);
[+]+ return(op);
[*]+ return(op);
[-]+ return(op);
[\<][b][b][ ]+[1-9][\>] {return(bblock);}
([[_][a-z]])|([a-z][_][0-9]+)|([0-9]+) {return(var);}
. ;
%%
yacc file
%{
/* lexer for ssa gramer to use for recognizing operations*/
#include<stdio.h>
char add_graph(char,char,...);
%}
%token opif opelse equal op bblock var
%%
sentence: var equal var op var { add_graph($1,$2,$3,$4,$5);}
;
%%
extern FILE *yyin;
main(argc,argv)
int argc;
char **argv;
{
if(argc > 1) {
FILE *file;
file=fopen(argv[1],"r");
if(file==NULL) {
fprintf(stderr,"couldnot open%s\n",argv[0]);
exit(1);
}
yyin=file;
}
do
{
yyparse();
}while (!feof(yyin));
fclose(yyin);
}
char add_graph(something)
{
.....
.....
}
yyerror(s)
char *s;
{
fprintf(stderr,"%s there is error\n",s);
}
yywrap()
{
printf("the output");
}
Lots of problems here:
your grammar is expecting the token op, but your lexer will never produce it, instead producing opadd opmul etc
your example has ; at the end of lines, but neither your lexer nor parser deal with them. The default lexer action of copying to stdout is almost never what you want.
your yacc file tries to use \\ as some sort of comment marker, but yacc doesn't understand that. Some versions of yacc understand C++-style // as a comment, but not all
your grammar only allows for one sentence in the input
your sentence has a spurious op at the end (on the next line), which is not a separate sentence rule -- you need | to separate rules.
you attempt to loop if you haven't reached the eof when yyparse returns, but if there's an error, its likely that the input will still have some cruft that will cause an immediate error, resulting in an error storm -- probably not what you want.
Your grammar only permits one sentence. So if there is any input after the first sentence, an error will be raised. You want to permit one or more sentences. Try this in your .y file:
%%
sentences : sentences sentence
| sentence
;
sentence : var equal var op var { add_graph($1,$2,$3,$4,$5);}
;
%%
DAVID IS SAYING CORRECT BUT ONE MORE MODIFICATION NEED TO BE MADE
ADD
";" ;
SEE IF THIS CAN HELP.acknowledge me if i am wrong.

printout lexemes and tokens of a C code with using flex

i'm trying to print out lexemes and tokens with using lexical analyzer "flex" and the problem is i can find lexemes and can just print tokens not lexemes. this is the simple code which i use as you can see below
%{
#include<stdio.h>
char RW[] = "RESERVE_WORD";
%}
int [i][n][t]
%%
int printf("%s --> %s\n", yylex(), RW);
.|\n { /* Ignore all other */}
%%
int main(int argc, char *argv[]) {
yyin = fopen(argv[1], "r");
yylex();
fclose(yyin);
return 0;
}
when i make a lexical analysis this yylex() function returns "null" and it says
example5.l:8:1: warning: format ‘%s’ expects argument of type ‘char *’, but argument 2 has type ‘int’ [-Wformat].
i will be glad if you can help me. and thanks anyway
ok i handled the problem. so the thing is we should use yytext variable which contains the last token of the lexical analyzer as a string. In Addition, yylex() function will return either the value of the next token or a number <= 0 indicating EOF.

Resources