Whitespaces and line in BYacc

Whitespaces and line in BYacc - parsing

Hy guys.
What do you think about these two rules to parse whitespaces and to recognize the different lines of the file I must translate?
1.
line: NEW_LINE {$$ = System.lineSeparator();}
| line NEW_LINE {$$ = $1 + System.lineSeparator();}
where:
NEW_LINE = \r\n|\n|\r in Jflex
2.
whitespace: WHITESPACE {$$ = " ";}
| whitespace WHITESPACE {$$ = $1 + " ";}
where:
WHITESPACE = [ \t] in Jflex
Are they correct? Thanks of all

line: NEW_LINE {$$ = System.lineSeparator();}
| line NEW_LINE {$$ = $1 + System.lineSeparator();}
where:
NEW_LINE = \r\n|\n|\r in Jflex
If you don't really care about multliple newlines, as this grammar suggests, collect them all in the lexer:
NEW_LINE = (\r\n|\n|\r)+ return NEW_LINE;
and not in the parser:
line : NEW_LINE { $$ = System.lineSeparator(); }
Whitespace normally includes line terminators, unless they are significant in your grammar, which they seem to be, but also formfeeds:
WHITESPACE [ \t\f]
and again it is much more efficient to collect it all in the lexer rather than the parser:
WHITESPACE [ \t\f]+
whitespace: WHITESPACE { $$ = strdup(yytext); }
Note that this has to be free()-d whenever it reappears as $1, $2, etc, and isn't copied directly to $$.
But then usually whitespace doesn't appear in the grammar at all, it is just ignored by the lexer:
WHITESPACE [ \t\f]+ ;
unless again you really really need it in the grammar. This is pretty unlikely. You should just be able to work with the non-whitespace tokens the lexer returns to you.

Related

Why does my "equation" grammar break the parser?

Currently, my parser file looks like this:
%{
#include <stdio.h>
#include <math.h>
int yylex();
void yyerror (const char *s);
%}
%union {
long num;
char* str;
}
%start line
%token print
%token exit_cmd
%token <str> identifier
%token <str> string
%token <num> number
%%
line: assignment {;}
| exit_stmt {;}
| print_stmt {;}
| line assignment {;}
| line exit_stmt {;}
| line print_stmt {;}
;
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
exit_stmt: exit_cmd {exit(0);}
;
print_stmt: print print_expr {;}
;
print_expr: string {printf("%s\n", $1);}
| number {printf("%d\n", $1);}
;
%%
int main(void)
{
return yyparse();
}
void yyerror (const char *s) {fprintf(stderr, "%s\n", s);}
Giving the input: myvar = 3 gives the output Assigning var myvar = 3 to value 3, as expected. However, modifying the code to include an equation grammar rule breaks such assignments.
Equation grammar:
equation: number '+' number {$$ = $1 + $3;}
| number '-' number {$$ = $1 - $3;}
| number '*' number {$$ = $1 * $3;}
| number '/' number {$$ = $1 / $3;}
| number '^' number {$$ = pow($1, $3);}
| equation '+' number {$$ = $1 + $3;}
| equation '-' number {$$ = $1 - $3;}
| equation '*' number {$$ = $1 * $3;}
| equation '/' number {$$ = $1 / $3;}
| equation '^' number {$$ = pow($1, $3);}
;
Modifying the assignment grammar accordingly as well:
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' equation {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
And giving the equation rule the type of num in the parser's first section:
%type <num> equation
Giving the same input: var = 3 freezes the program.
I know this is a long question but can anyone please explain what is going on here?
Also, here's the lexer in case you wanna take a look.

It doesn't "freeze the program". The program is just waiting for more input.
In your first grammar, var = 3 is a complete statement which cannot be extended. But in your second grammar, it could be the beginning of var = 3 + 4, for example. So the parser needs to read another token after the 3. If you want input lines to be terminated by a newline, you will need to modify your scanner to send a newline character as a token, and then modify your grammar to expect a newline token at the end of every statement. If you intend to allow statements to be spread out over several lines, you"ll need to be aware of that fact while typing input.
There are several problems with your grammar, and also with your parser. (Flex doesn't implement non-greedy repetition, for example.) Please look at the examples in the bison and flex manuals

Odd bison parsing

To pretext this, I understand that the formatting of the parse structure is weird, the teacher wanted it to be ~roughly~ in this format.
I'm making a simple "calculator" parser form an assignment using flex and bison, but I am getting odd or unusual output for the answer when using modulus. IT seems to work fine for all other operations
Input: "10 % 5"
Output: " % 10"
Input: "101 % 12"
Output: " % 101"
Input: "2^(-1 + 15/5) - 3*(4-1) + (-6)"
Output: "-11" //Correct
Relevant section of bison.y
command : pexpri {printf("%d\n", $1); return;}
;
pexpri : '-' expri '+' termi {$$ = -$2 + $4;} /* Super glued on unary, also reduce conflict, TODO: find bug */
| '-' expri '-' termi {$$ = -$2 - $4;}
| '-' expri {$$ = -$2;}
| expri {$$ = $1;}
;
expri : expri '+' termi {$$ = $1 + $3;} /* Addition subtraction level operations*/
| expri '-' termi {$$ = $1 - $3;}
| termi {$$ = $1;}
;
termi : termi '*' factori {$$ = $1 * $3;} /* Multiplication division level operations*/
| termi '/' factori {$$ = $1 / $3;}
| termi '%' factori {$$ = $1 % $3;}
| factori {$$ = $1;}
;
factori : factori '^' parti {$$ = pow($1, $3);} /* Exponentiation level operations */
| parti {$$ = $1;}
;
parti : '(' pexpri ')' {$$ = $2;} /* Parentheses handling or terminal, also adds even more reduction errors.... */
| INTEGER
;
Relevant section tokenizer.l
0 { /* To avoid useless trailing zeros. */
yylval.iVal = atoi(yytext);
return INTEGER;
}
[1-9][0-9]* {
yylval.iVal = atoi(yytext);
return INTEGER;
}
[-()^\+\*/] {return *yytext;}
The main function is essentially just a wrapper for yyparse.
I don't understand how or why it is printing the modulus symbol in the output because the ONLY print in the entire code is in the command section. I understand that the code isn't the best (in fact, it is awful), but any insight is much appreciated.
Also, if anybody can help me figure out how to manage unary negation in a more elegant way (Hopefully without spoiling much), that would also be super appreciated. (I cant just use %precidence or %left) The way I have it currently set up is ambiguous and is causing reduction errors.

If you look closely at
[-()^\+\*/] {return *yytext;}
you'll notice that it is not going to match %. The most likely consequence is that (f)lex's default fallback rule will apply. That rule matches any single character and uses ECHO to copy the matched token to the output stream.
It looks to me like whitespace characters might also be falling through to the default rule. They should be ignored explicitly.
By the way, it is not necessary to backslash-escape regular expression operators inside character classes, since they have no special meaning in that context. Hence a correct and easier to read rule would be
[-+*/%^()] {return *yytext;}
However, I strongly recommend using a fallback rule instead of listing all the possible single-character tokens. If an invalid single-character token is handled by a fallback rule, then the parser will respond by flagging an error.
[[:space:]]+ { /* Ignore whitespace*/ }
0|[1-9][0-9]* { yylval.iVal = atoi(yytext); return INTEGER; }
. { return *yytext; /* Fallback rule */ }
The default fallback rule is rarely useful in parsing, and I find it useful to add
%option nodefault
to my flex prolog, which will cause flex to produce an error message if a fallback rule is required.

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?

Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Exponent operator does not work when no space added? Whats wrong with my grammar

I am trying to write an expression evaluator in which I am trying to add underscore _ as a reserve word which would denote a certain constant value.
Here is my grammar, it successfully parses 5 ^ _ but it fails to parse _^ 5 (without space). It only acts up that way for ^ operator.
COMPILER Formula
CHARACTERS
digit = '0'..'9'.
letter = 'A'..'z'.
TOKENS
number = digit {digit}.
identifier = letter {letter|digit}.
self = '_'.
IGNORE '\r' + '\n'
PRODUCTIONS
Formula = Term{ ( '+' | '-') Term}.
Term = Factor {( '*' | "/" |'%' | '^' ) Factor}.
Factor = number | Self.
Self = self.
END Formula.
What am I missing? I am using Coco/R compiler generator.

Your current definition of the token letter causes this issue because the range A..z includes the _ character and ^ character.

You can rewrite the Formula and Term rules like this:
Formula = Formula ( '+' | '-') Term | Term
Term = Term ( '*' | "/" |'%' | '^' ) Factor | Factor
e.g. https://metacpan.org/pod/distribution/Marpa-R2/pod/Marpa_R2.pod#Synopsis

Simple Instaparse parser with sub-expression syntax

I'm using Instaparse to parse expressions like:
$(foo bar baz $(frob))
into something like:
[:expr "foo" "bar" "baz" [:expr "frob"]]
I've almost got it, but having trouble with ambiguity. Here's a simplified version of my grammar that repros, attempting to rely on negative lookahead.
(def simple
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'.+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple "$(foo bar)")
which errors:
Parse error at line 1, column 11:
$(foo bar)
^
Expected one of:
")"
#"\s+"
Here I've said a word can be any char, in order to support expressions like:
$(foo () `bar` b-a-z)
etc. Note a word can contain () but it cannot contain $(). Not sure how to express this in the grammar. Seems the problem is <word> is too greedy, consuming the last ) instead of letting expr have it.
Update removed whitespace from word:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'[^ ]+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple2 "$(foo bar)")
; Parse error at line 1, column 11:
; $(foo bar)
; ^
; Expected one of:
; ")"
; #"\s+"
(simple2 "$(foo () bar)")
; Parse error at line 1, column 14:
; $(foo () bar)
; ^
; Expected one of:
; ")"
; #"\s+"
Update 2 more test cases
(simple2 "$(foo bar ())")
(simple2 "$((foo bar baz))")
Update 3 full working parser
For anyone curious, the full working parser, which was outside the scope of this question is:
(def parse
"expr - the top-level expression made up of cmds and sub-exprs. When multiple
cmds are present, it implies they should be sucessively piped.
cmd - a single command consisting of words.
sub-expr - a backticked or $(..)-style sub-expression to be evaluated inline.
parened - a grouping of words wrapped in parenthesis, explicitly tokenized to
allow parenthesis in cmds and disambiguate between sub-expression
syntax."
(insta/parser
"expr = cmd (<space> <pipe> <space> cmd)*
cmd = words
<sub-expr> = <backtick> expr <backtick> | nestable-sub-expr
<nestable-sub-expr> = <dollar> <lparen> expr <rparen>
words = word (<space>* word)*
<word> = sub-expr | parened | word-chars
<word-chars> = #'[^ `$()|]+'
parened = lparen words rparen
<space> = #'[ ]+'
<pipe> = #'[|]'
<dollar> = <'$'>
<lparen> = '('
<rparen> = ')'
<backtick> = <'`'>"))
Example usage:
(parse "foo bar (qux) $(clj (map (partial * $(js 45 * 2)) (range 10))) `frob`")
Parses to:
[:expr [:cmd [:words "foo" "bar" [:parened "(" [:words "qux"] ")"] [:expr [:cmd [:words "clj" [:parened "(" [:words "map" [:parened "(" [:words "partial" "*" [:expr [:cmd [:words "js" "45" "*" "2"]]]] ")"] [:parened "(" [:words "range" "10"] ")"]] ")"]]]] [:expr [:cmd [:words "frob"]]]]]]
This is a parser for a chatbot I wrote, yetibot. It replaces the previous mess of regex-based, by-hand parsing.

I don't really know instaparser, so I just read enough documentation to give me a false sense of security. I also didn't test, and I don't really know what your requirements are.
In particular, I don't know:
1) Whether $() can nest (your grammar makes that impossible, I think, but it seems odd to me)
2) Whether () can contain whitespace without being parsed as more than one word
3) Whether () can contain $()
You'll need to be clear on things like this in order to write the grammar (or, as it happens, to ask for advice).
Update: Revised the grammar based on comments. I removed the productions for $ ( and ) because they seemed unnecessary, and this way the angle-brackets feel easier to deal with.
The following is based on answering the above questions "yes, no, yes" and some random assumptions about regex format. (I'm not totally clear on how angle-brackets work, but I don't think it will be easy to make parentheses output the way you want; I settled for just outputting them as single elements. If I figure out something, I'll edit it.)
<sequence> = element (<space> element)*
<element> = expr | paren_sequence | word
expr = <'$'> <'('> sequence <')'>
<word> = !('$'? '(') #'([^ $()]|\$[^(])+'
<paren_sequence> = '(' sequence ')'
<space> = #'\\s+'
Hope that helps a bit.

Well there are two changes you have to make in order to get both of your examples to work.
1) Add Negative Lookbehind
First, you will need a negative lookbehind in the regex for <word>. That way it will drop all the occurrences of ) as the last character:
<word> = !(dollar lparen) #'[^ ]+(?<!\\))'
So this will fix your first test case:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
2) Add grammar for the last word
Now if you run your second test case it will fail:
(simple2 "$(foo () bar)")
=> Parse error at line 1, column 8:
$(foo () bar)
^
Expected one of:
")" (followed by end-of-string)
#"\s+"
This fails because we have told our grammar to drop the last ) in all instances of <word>. We now have to tell our grammar how to differentiate between the last instance of <word> and other instances. We'll do this by adding a specific <lastword> grammar, and make all other instances of <word> optional. The full grammar would look like this:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word* lastword <rparen>
<word> = !(dollar lparen) #'[^ ]+' <space>+
<lastword> = !(dollar lparen) #'[^ ]+(?<!\\))'
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
And your two test cases should work fine:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
(simple2 "$(foo () bar)")
=> [:expr "foo" "()" "bar"]
Hope this helps.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart