Parsing a statement and Adding Paranthesis to them - parsing

I am trying to parse logical BNF statements , and trying to apply paranthesis to them.
For example:
I am trying to parse a statement a=>b<=>c&d as ((a)=>(b))<=>((c)&(d)), and similar statements as well.
Problem Facing: Some of the statements are working fine, and while some are not. The example provided above is not working and the solution is printing as ((c)&(d))<=>((c)&(d)) The second expr seems to be overriding the first one.
Conditions that are working: While other simple examples like a<=>b , a|(b&c) are working fine.
I think I have made some basic error in my code, which I cannot figure out.
Here is my code
lex file
letters [a-zA-Z]
identifier {letters}+
operator (?:<=>|=>|\||&|!)
separator [\(\)]
%%
{identifier} {
yylval.s = strdup(yytext);
return IDENTIFIER; }
{operator} { return *yytext; }
{separator} { return *yytext; }
[\n] { return *yytext; }
%%
yacc file
%start program
%union {char* s;}
%type <s> program expr IDENTIFIER
%token IDENTIFIER
%left '<=>'
%left '=>'
%left '|'
%left '&'
%right '!'
%left '(' ')'
%%
program : expr '\n'
{
cout<<$$;
exit(0);
}
;
expr : IDENTIFIER {
cout<<" Atom ";
cout<<$1<<endl;
string s1 = string($1);
cout<<$$<<endl;
}
| expr '<=>' expr {
cout<<"Inside <=>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"<=>"+"(" + s2 +")";
$$ = (char * )s3.c_str();
cout<<s3<<endl;
}
| expr '=>' expr {
cout<<"Inside =>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"=>"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '|' expr {
cout<<"Inside |\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"|"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '&' expr {
cout<<"Inside &\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"&"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| '!' expr {
cout<<"Inside !\n";
string s1 = string($2);
cout<<s1<<endl;
string s2 = "!" + s1;
$$ = (char *)s2.c_str();
cout<<$$<<endl;
}
| '(' expr ')' { $$ = $2; cout<<"INSIDE BRACKETS"; }
;
%%
Please let me know the mistake I have made.
Thank you

The basic problem you have is that you save the pointer returned by string::c_str() on the yacc value stack, but after the action finishes and the string object is destroyed, that pointer is no longer valid.
To fix this you need to either not use std::string at all, or change your %union to be { std::string *s; } (instead of char *). In either case you have issues with memory leaks. If you are using Linux, the former is pretty easy. Your actions would become something like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
asprintf(&$$, "(%s)<=>(%s)", $1, $3);
cout<<$$<<endl;
free($1);
free($3);
}
for the latter, the action would look like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
$$ = new string("(" + *$1 +")" +"<=>"+"(" + *$2 +")");
cout<<$$<<endl;
delete $1;
delete $3;
}

Related

Does YACC (LALR) parses left recursive grammars

I was doing a simple C parser project.
This problem occurred when I was writing the grammar for if-else construct.
the grammar that I have written is as following:
iexp: IF OP exp CP block eixp END{
printf("Valid if-else ladder\n");
};
eixp:
|ELSE iexp
|ELSE block;
exp: id
|NUMBER
|exp EQU exp
|exp LESS exp
|exp GRT exp
|OP exp CP;
block: statement
|iexp
|OCP statement CCP
|OCP iexp CCP;
statement: id ASSIGN NUMBER SEMICOL;
id: VAR;
where the lex part looks something like this
"if" {return IF;}
"else" {return ELSE;}
[0-9]+ {return NUMBER;}
">" {return GRT;}
"<" {return LESS;}
"==" {return EQU;}
"{" {return OCP;}
"}" {return CCP;}
"(" {return OP;}
")" {return CP;}
"$" {return END;}
";" {return SEMICOL;}
"=" {return ASSIGN;}
[a-zA-Z]+ {return VAR;}
. {;}
I am getting o/p as
yacc: 9 shift/reduce conflicts, 1 reduce/reduce conflict.
When I eliminate the left recursion on exp derivation the conflicts vanish but why it's so ?
the revised grammar after eliminating left recursion was :
exp: id
|NUMBER
|id EQU exp
|id LESS exp
|id GRT exp
|OP exp CP;
I was able to parse successfully the grammar for the evaluation of arithmetic expressions. Is it so that %right, %left made it successful
%token ID
%left '+' '-'
%left '*' '/'
%right NEGATIVE
%%
S:E {
printf("\nexpression : %s\nResult=%d\n", buf, $$);
buf[0] = '\0';
};
E: E '+' E {
printf("+");
($$ = $1 + $3);
} |
E '-' E {
printf("-");
($$ = $1 - $3);
} |
E '*' E {
printf("*");
($$ = $1 * $3);
} |
E '/' E {
printf("/");
($$ = $1 / $3);
} |
'(' E ')' {
($$ = $2);
} |
ID {
/*do nothing done by lex*/
};

Using bison rule inside another rule

Let's assume I have grammar like this:
expr : expr '-' expr { $$ = $1 - $3; }
| "Function" '(' expr ',' expr ')' { $$ = ($3 - $5) * 2; }
| NUMBER { $$ = $1; };
How can use rule
expr : expr '-' expr { $$ = $1 - $3; }
inside
expr : "Function" '(' expr ',' expr ')' { $$ = ($3 - $5) * 2; }
Because implementation of $1 - $3 is repeated? It would be much better if I can use already implemented subtraction from rule one and only add multiplication with 2. This is just the basic example, but I have very big grammar with lot of repeating calculations.

Grammar ambiguity: why? (problem is: "(a)" vs "(a-z)")

So I am trying to implement a pretty simple grammar for one-line statements:
# Grammar
c : Character c [a-z0-9-]
(v) : Vowel (= [a,e,u,i,o])
(c) : Consonant
(?) : Any character (incl. number)
(l) : Any alpha char (= [a-z])
(n) : Any integer (= [0-9])
(c1-c2) : Range from char c1 to char c2
(c1,c2,c3) : List including chars c1, c2 and c3
Examples:
h(v)(c)no(l)(l)jj-k(n)
h(v)(c)no(l)(l)(a)(a)(n)
h(e-g)allo
h(e,f,g)allo
h(x,y,z)uul
h(x,y,z)(x,y,z)(x,y,z)(x,y,z)uul
I am using the Happy parser generator (http://www.haskell.org/happy/) but for some reason there seems to be some ambiguity problem.
The error message is: "shift/reduce conflicts: 1"
I think the ambiguity is with these two lines:
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket char hyphen char rBracket { TRange $2 $4 }
An example case is: "(a)" vs "(a-z)"
The lexer would give the following for the two cases:
(a) : [CLBracket, CChar 'a', CRBracket]
(a-z) : [CLBracket, CChar 'a', CHyphen, CChar 'z', CRBracket]
What I don't understand is how this can be ambiguous with an LL[2] parser.
In case it helps here is the entire Happy grammar definition:
{
module XHappyParser where
import Data.Char
import Prelude hiding (lex)
import XLexer
import XString
}
%name parse
%tokentype { Character }
%error { parseError }
%token
lBracket { CLBracket }
rBracket { CRBracket }
hyphen { CHyphen }
question { CQuestion }
comma { CComma }
char { CChar $$ }
%%
xstring : tokens { XString (reverse $1) }
tokens : token { [$1] }
| tokens token { $2 : $1 }
token : char { TLiteral $1 }
| hyphen { TLiteral '-' }
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket question rBracket { TAny }
| lBracket char hyphen char rBracket { TRange $2 $4 }
| lBracket listitems rBracket { TList $2 }
listitems : char { [$1] }
| listitems comma char { $1 ++ [$3] }
{
parseError :: [Character] -> a
parseError _ = error "parse error"
}
Thank you!
Here's the ambiguity:
token : [...]
| lBracket char rBracket
| [...]
| lBracket listitems rBracket
listitems : char
| [...]
Your parser could accept (v) as both TString [TVowel] and TString [TList ['v']], not to mention the missing characters in that case expression.
One possible way of solving it would be to modify your grammar so lists are at least two items, or have some different notation for vowels, consonants, etc.
The problem seems to be:
| lBracket char rBracket
...
| lBracket listitems rBracket
or in cleaner syntax:
(c)
Can be a TVowel, TConsonant, TLetter, TNumber (as you know) or a singleton TList.
As the happy manual says, shift reduce usually isn't an issue. You can us precedence to force behavior/remove the warning if you'd like.

Lua long strings in fslex

I've been working on a Lua fslex lexer in my spare time, using the ocamllex manual as a reference.
I hit a few snags while trying to tokenize long strings correctly. "Long strings" are delimited by '[' ('=')* '[' and ']' ('=')* ']' tokens; the number of = signs must be the same.
In the first implementation, the lexer seemed to not recognize [[ patterns, producing two LBRACKET tokens despite the longest match rule, whereas [=[ and variations where recognized correctly. In addition, the regular expression failed to ensure that the correct closing token is used, stopping at the first ']' ('=')* ']' capture, no matter the actual long string "level". Also, fslex does not seem to support "as" constructs in regular expressions.
let lualongstring = '[' ('=')* '[' ( escapeseq | [^ '\\' '[' ] )* ']' ('=')* ']'
(* ... *)
| lualongstring { (* ... *) }
| '[' { LBRACKET }
| ']' { RBRACKET }
(* ... *)
I've been trying to solve the issue with another rule in the lexer:
rule tokenize = parse
(* ... *)
| '[' ('=')* '[' { longstring (getLongStringLevel(lexeme lexbuf)) lexbuf }
(* ... *)
and longstring level = parse
| ']' ('=')* ']' { (* check level, do something *) }
| _ { (* aggregate other chars *) }
(* or *)
| _ {
let c = lexbuf.LexerChar(0);
(* ... *)
}
But I'm stuck, for two reasons: first, I don't think I can "push", so to speak, a token to the next rule once I'm done reading the long string; second, I don't like the idea of reading char by char until the right closing token is found, making the current design useless.
How can I tokenize Lua long strings in fslex? Thanks for reading.
Apologies if I answer my own question, but I'd like to contribute with my own solution to the problem for future reference.
I am keeping state across lexer function calls with the LexBuffer<_>.BufferLocalStore property, which is simply a writeable IDictionary instance.
Note: long brackets are used both by long string and multiline comments. This is often an overlooked part of the Lua grammar.
let beginlongbracket = '[' ('=')* '['
let endlongbracket = ']' ('=')* ']'
rule tokenize = parse
| beginlongbracket
{ longstring (longBracketLevel(lexeme lexbuf)) lexbuf }
(* ... *)
and longstring level = parse
| endlongbracket
{ if longBracketLevel(lexeme lexbuf) = level then
LUASTRING(endLongString(lexbuf))
else
longstring level lexbuf
}
| _
{ toLongString lexbuf (lexeme lexbuf); longstring level lexbuf }
| eof
{ failwith "Unexpected end of file in string." }
Here are the functions I use to simplify storing data into the BufferLocalStore:
let longBracketLevel (str : string) =
str.Count(fun c -> c = '=')
let createLongStringStorage (lexbuf : LexBuffer<_>) =
let sb = new StringBuilder(1000)
lexbuf.BufferLocalStore.["longstring"] <- box sb
sb
let toLongString (lexbuf : LexBuffer<_>) (c : string) =
let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
let storage = if hasString then (sb :?> StringBuilder) else (createLongStringStorage lexbuf)
storage.Append(c.[0]) |> ignore
let endLongString (lexbuf : LexBuffer<_>) : string =
let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
let ret = if not hasString then "" else (sb :?> StringBuilder).ToString()
lexbuf.BufferLocalStore.Remove("longstring") |> ignore
ret
Perhaps it's not very functional, but it seems to be getting the job done.
use the tokenize rule until the beginning of a long bracket is found
switch to the longstring rule and loop until a closing long bracket of the same level is found
store every lexeme that does not match a closing long bracket of the same level into a StringBuilder, which is in turn stored into the LexBuffer BufferLocalStore.
once the longstring is over, clear the BufferLocalStore.
Edit: You can find the project at http://ironlua.codeplex.com. Lexing and parsing should be okay. I am planning on using the DLR. Comments and constructive criticism welcome.

Making YACC output an AST (token tree)

Is it possible to make YACC (or I'm my case MPPG) output an Abstract Syntax Tree (AST).
All the stuff I'm reading suggests its simple to make YACC do this, but I'm struggling to see how you know when to move up a node in the tree as your building it.
Expanding on Hao's point and from the manual, you want to do something like the following:
Assuming you have your abstract syntax tree with function node which creates an object in the tree:
expr : expr '+' expr
{
$$ = node( '+', $1, $3 );
}
This code translates to "When parsing expression with a plus, take the left and right descendants $1/$3 and use them as arguments to node. Save the output to $$ (the return value) of the expression.
$$ (from the manual):
To return a value, the action normally
sets the pseudovariable ``$$'' to some
value.
Have you looked at the manual (search for "parse tree" to find the spot)? It suggests putting node creation in an action with your left and right descendants being $1 and $3, or whatever they may be. In this case, yacc would be moving up the tree on your behalf rather than your doing it manually.
The other answers propose to modify the grammar, this isn't doable when playing with a C++ grammer (several hundred of rules..)
Fortunately, we can do it automatically, by redefining the debug macros.
In this code, we are redefining YY_SYMBOL_PRINT actived with YYDEBUG :
%{
typedef struct tree_t {
struct tree_t **links;
int nb_links;
char* type; // the grammar rule
};
#define YYDEBUG 1
//int yydebug = 1;
tree_t *_C_treeRoot;
%}
%union tree_t
%start program
%token IDENTIFIER
%token CONSTANT
%left '+' '-'
%left '*' '/'
%right '^'
%%
progam: exprs { _C_treeRoot = &$1.t; }
|
| hack
;
exprs:
expr ';'
| exprs expr ';'
;
number:
IDENTIFIER
| '-' IDENTIFIER
| CONSTANT
| '-' CONSTANT
;
expr:
number
| '(' expr ')'
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '^' expr
;
hack:
{
// called at each reduction in YYDEBUG mode
#undef YY_SYMBOL_PRINT
#define YY_SYMBOL_PRINT(A,B,C,D) \
do { \
int n = yyr2[yyn]; \
int i; \
yyval.t.nb_links = n; \
yyval.t.links = malloc(sizeof *yyval.t.links * yyval.t.nb_links);\
yyval.t.str = NULL; \
yyval.t.type = yytname[yyr1[yyn]]; \
for (i = 0; i < n; i++) { \
yyval.t.links[i] = malloc(sizeof (YYSTYPE)); \
memcpy(yyval.t.links[i], &yyvsp[(i + 1) - n], sizeof(YYSTYPE)); \
} \
} while (0)
}
;
%%
#include "lexer.c"
int yyerror(char *s) {
printf("ERROR : %s [ligne %d]\n",s, num_ligne);
return 0;
}
int doParse(char *buffer)
{
mon_yybuffer = buffer;
tmp_buffer_ptr = buffer;
tree_t *_C_treeRoot = NULL;
num_ligne = 1;
mon_yyptr = 0;
int ret = !yyparse();
/////////****
here access and print the tree from _C_treeRoot
***///////////
}
char *tokenStrings[300] = {NULL};
char *charTokenStrings[512];
void initYaccTokenStrings()
{
int k;
for (k = 0; k < 256; k++)
{
charTokenStrings[2*k] = (char)k;
charTokenStrings[2*k+1] = 0;
tokenStrings[k] = &charTokenStrings[2*k];
}
tokenStrings[CONSTANT] = "CONSTANT";
tokenStrings[IDENTIFIER] = "IDENTIFIER";
extern char space_string[256];
for (k = 0; k < 256; k++)
{
space_string[k] = ' ';
}
}
the leafs are created just before the RETURN in the FLEX lexer

Resources