PEG Grammar Parsing, error when expression starts with negative number

PEG Grammar Parsing, error when expression starts with negative number - parsing

I have the following PEG grammar defined:
Program = _{ SOI ~ Expr ~ EOF }
Expr = { UnaryExpr | BinaryExpr }
Term = _{Int | "(" ~ Expr ~ ")" }
UnaryExpr = { Operator ~ Term }
BinaryExpr = { Term ~ (Operator ~ Term)* }
Operator = { "+" | "-" | "*" | "^" }
Int = #{ Operator? ~ ASCII_DIGIT+ }
WHITESPACE = _{ " " | "\t" }
EOF = _{ EOI | ";" }
And the following expressions are all parsed correctly:
1 + 2
1 - 2
1 + -2
1 - -2
1
+1
-1
But any expression that begins with a negative number errors
-1 + 2
errors with
--> 1:4
|
1 | -1 + 2
| ^---
|
= expected EOI
What I expect (what I would like) is for -1 + 2 to be treated the same as 1 + -2, that is a Binary expression that is made up of two Unary Expressions.
I have toyed around with a lot of variations with no success. And, I'm open to using an entirely different paradigm if I need to, but I'd really like to keep the UnaryExpression idea since I've already built my parser around it.
I'm new to PEG, so I'd appreciate any help.
For what its worth, I'm using Rust v1.59 and https://pest.rs/ to both parse and test my expressions.

You have a small error in the Expr logic. The first part before the | takes precedence if both match.
And -1 is a valid UnaryExpr so the program as a whole is expected to match SOI ~ UnaryExpr ~ EOF in this case. But there is additional data (+ 2) which leads to this error.
If you reverse the possibilities of Expr so that Expr = { BinaryExpr | UnaryExpr } the example works. The reason for that is that first BinaryExpr will be checked and only if that fails UnaryExpr.

Related

How do I preform 'lookahead' in an OCaml lexer / how do I rollback a lexeme?

Well, I'm writing my first parser, in OCaml, and I immediately somehow managed to make one with an infinite-loop.
Of particular note, I'm trying to lex identifiers according to the rules of the Scheme specification (I have no idea what I'm doing, obviously) — and there's some language in there about identifiers requiring that they are followed by a delimiter. My approach, right now, is to have a delimited_identifier regex that includes one of the delimiter characters, that should not be consumed by the main lexer … and then once that's been matched, the reading of that lexeme is reverted by Sedlexing.rollback (well, my wrapper thereof), before being passed to a sublexer that only eats the actual identifier, hopefully leaving the delimiter in the buffer to be eaten as a different lexeme by the parent lexer.
I'm using Menhir and Sedlex, mostly synthesizing the examples from #smolkaj's ocaml-parsing example-repo and RWO's parsing chapter; here's the simplest reduction of my current parser and lexer:
%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)
%start <Parser.AST.t> program
%%
program:
| p = list(expression); EOF { p }
;
expression:
| i = IDENTIFIER { Parser.AST.Atom i }
%%
… and …
(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]
let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']
let special_initial = [%sedlex.regexp?
'!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]
let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '#' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]
let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]
(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
match%sedlex buf with
| Plus whitespace -> swallow_atmosphere buf
| ";" -> swallow_comment buf
| _ -> ()
and swallow_comment buf =
match%sedlex buf with
| newline -> swallow_atmosphere buf
| any -> swallow_comment buf
| _ -> assert false
(** Return the next token. *)
let rec token buf =
swallow_atmosphere buf;
match%sedlex buf with
| eof -> EOF
| delimited_identifier ->
Sedlexing.rollback buf;
identifier buf
| '(' -> LPAR
| ')' -> RPAR
| _ -> illegal buf (Char.chr (next buf))
and identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
(Yes, it's basically a no-op / the simplest thing possible rn. I'm trying to learn! :x)
Unfortunately, this combination results in an infinite loop in the parsing automaton:
State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...
I'm new to parsing and lexing and all this; any advice would be welcome. I'm sure it's just a newbie mistake, but …
Thanks!

As said before, implementing too much logic inside the lexer is a bad idea.
However, the infinite loop does not come from the rollback but from your definition of identifier:
identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
within this definition _ matches the shortest possible words in the language consisting of all possible characters. In other words, _ always matches the empty word μ without consuming any part of its input, sending the parser into an infinite loop.

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?

Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Flex and Yacc Grammar Issue

Edit #1: I think the problem is in my .l file. I don't think the rules are being treated as rules, and I'm not sure how to treat the terminals of the rules as strings.
My last project for a compilers class is to write a .l and a .y file for a simple SQL grammar. I have no experience with Flex or Yacc, so everything I have written I have pieced together. I only have a basic understanding of how these files work, so if you spot my problem can you also explain what that section of the file is supposed to do? I'm not even sure what the '%' symbols do.
Basically some rules just do not work when I try to parse something. Some rules hang and others reject when they should accept. I need to implement the following grammar:
start
::= expression
expression
::= one-relation-expression | two-relation-expression
one-relation-expression
::= renaming | restriction | projection
renaming
::= term RENAME attribute AS attribute
term
::= relation | ( expression )
restriction
::= term WHERE comparison
projection
::= term | term [ attribute-commalist ]
attribute-commalist
::= attribute | attribute , attribute-commalist
two-relation-expression
::= projection binary-operation expression
binary-operation
::= UNION | INTERSECT | MINUS | TIMES | JOIN | DIVIDEBY
comparison
::= attribute compare number
compare
::= < | > | <= | >= | = | <>
number
::= val | val number
val
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
attribute
::= CNO | CITY | CNAME | SNO | PNO | TQTY |
SNAME | QUOTA | PNAME | COST | AVQTY |
S# | STATUS | P# | COLOR | WEIGHT | QTY
relation
::= S | P | SP | PRDCT | CUST | ORDERS
Here is my .l file:
%{
#include <stdio.h>
#include "p5.tab.h"
%}
binaryOperation UINION|INTERSECT|MINUS|TIMES|JOIN|DIVIDEBY
compare <|>|<=|>=|=|<>
attribute CNO|CITY|CNAME|SNO|PNO|TQTY|SNAME|QUOTA|PNAME|COST|AVQTY|S#|STATUS|P#|COLOR|WEIGHT|QTY
relation S|P|SP|PRDCT|CUST|ORDERS
%%
[ \t\n]+ ;
{binaryOperation} return binaryOperation;
{compare} return compare;
[0-9]+ return val;
{attribute} return attribute;
{relation} return relation;
"RENAME" return RENAME;
"AS" return AS;
"WHERE" return WHERE;
"(" return '(';
")" return ')';
"[" return '[';
"]" return ']';
"," return ',';
. {printf("REJECT\n");
exit(0);}
%%
Here is my .y file:
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token RENAME attribute AS relation WHERE binaryOperation compare val
%%
start:
expression {printf("ACCEPT\n");}
;
expression:
oneRelationExpression
| twoRelationExpression
;
oneRelationExpression:
renaming
| restriction
| projection
;
renaming:
term RENAME attribute AS attribute
;
term:
relation
| '(' expression ')'
;
restriction:
term WHERE comparison
;
projection:
term
| term '[' attributeCommalist ']'
;
attributeCommalist:
attribute
| attribute ',' attributeCommalist
;
twoRelationExpression:
projection binaryOperation expression
;
comparison:
attribute compare number
;
number:
val
| val number
;
%%
yyerror() {
printf("REJECT\n");
exit(0);
}
main() {
yyparse();
}
yywrap() {}
Here is my makefile:
p5: p5.tab.c lex.yy.c
cc -o p5 p5.tab.c lex.yy.c
p5.tab.c: p5.y
bison -d p5.y
lex.yy.c: p5.l
flex p5.l
This works:
S RENAME CNO AS CITY
These do not:
S
S WHERE CNO = 5
I have not tested everything, but I think there is a common problem for these issues.

Your grammar is correct, the problem is that you are running interactively. When you call yyparse() it will attempt to read all input. Because the input
S
could be followed by either RENAME or WHERE it won't accept. Similarly,
S WHERE CNO = 5
could be followed by one or more numbers, so yyparse won't accept until it gets an EOF or an unexpected token.
What you want to do is follow the advice here and change p5.l to have these lines:
[ \t]+ ;
\n if (yyin==stdin) return 0;
That way when you are running interactively it will take the ENTER key to be the end of input.
Also, you want to use left recursion for number:
number:
val
| number val
;

Exponent operator does not work when no space added? Whats wrong with my grammar

I am trying to write an expression evaluator in which I am trying to add underscore _ as a reserve word which would denote a certain constant value.
Here is my grammar, it successfully parses 5 ^ _ but it fails to parse _^ 5 (without space). It only acts up that way for ^ operator.
COMPILER Formula
CHARACTERS
digit = '0'..'9'.
letter = 'A'..'z'.
TOKENS
number = digit {digit}.
identifier = letter {letter|digit}.
self = '_'.
IGNORE '\r' + '\n'
PRODUCTIONS
Formula = Term{ ( '+' | '-') Term}.
Term = Factor {( '*' | "/" |'%' | '^' ) Factor}.
Factor = number | Self.
Self = self.
END Formula.
What am I missing? I am using Coco/R compiler generator.

Your current definition of the token letter causes this issue because the range A..z includes the _ character and ^ character.

You can rewrite the Formula and Term rules like this:
Formula = Formula ( '+' | '-') Term | Term
Term = Term ( '*' | "/" |'%' | '^' ) Factor | Factor
e.g. https://metacpan.org/pod/distribution/Marpa-R2/pod/Marpa_R2.pod#Synopsis

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.

You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart