Failure to match character that matches rule with PEG parser

Failure to match character that matches rule with PEG parser - parsing

I'm trying to parse Java-style floating point numbers (accepting underscores in the middle of digits) and have simplified the grammar presented in the Java spec:
float_lit = [[DIGITS] '.'] DIGITS [FLOAT_EXP] [FLOAT_SUFFIX] ;
DIGITS = /\d[\d_]*\d/ | /\d/ ;
FLOAT_EXP = ( 'e' | 'E' ) [ '+' | '-' ] DIGITS ;
FLOAT_SUFFIX = 'f' | 'F' | 'd' | 'D' ;
Unfortunately, this doesn't accept the "1e10" input, weirdly failing to match the 'e' within FLOAT_EXP as shown in the trace below:
<float_lit ~1:1
1e10
<DIGITS<float_lit ~1:1
1e10
!'' /\d[\d_]*?\d/
1e10
>'1' /\d/
e10
>DIGITS<float_lit ~1:2
e10
!'.'
e10
<DIGITS<float_lit ~1:1
1e10
>DIGITS<float_lit ~1:2
e10
<FLOAT_EXP<float_lit ~1:2
e10
!'e'
e10
!'E'
e10
!FLOAT_EXP<float_lit ~1:2
e10
<FLOAT_SUFFIX<float_lit ~1:2
e10
!'f'
e10
!'F'
e10
!'d'
e10
!'D'
e10
!FLOAT_SUFFIX<float_lit ~1:2
e10
>float_lit ~1:2
e10
'1'
Can anyone point what I'm doing wrong?

The issue here was Tatsu's nameguard for tokens. Since the character following the token was alphanumeric, it doesn't match to prevent an eager consumption of tokens.
The solution is using regexps instead of token choices to match these characters:
float_lit = [[DIGITS] '.'] DIGITS [FLOAT_EXP] [FLOAT_SUFFIX] ;
DIGITS = /\d[\d_]*\d/ | /\d/ ;
FLOAT_EXP = /[eE][+-]?/ DIGITS ;
FLOAT_SUFFIX = /[fFdD]/ ;

Related

YACC Parser getting an error at line 1 even if test program is empty?

I have been trying do my homework which is language design with Lex and YACC. My assignment is building a simple parser with YACC. But my problem is the test file I send to my parser always returns a syntax error on the first line. As far as I realised the parser does not even reach the first statement of the example. The other problem with my program it returns with an error message even if the test program is empty . Here is my Lex and YACC code and the example program I am using.
My Lex Code:
%{
#include <stdio.h>
#include "y.tab.h"
void yyerror(char *);
%}
lowerLetter [a-z]
letter [a-zA-Z]
digit [0-9]
signs [+-]
integer {signs}?{digit}+
double {signs}?{digit}*(\.)?{digit}+
word {lowerLetter}+({letter}*{digit}*)*
string \"[^\"]*\"
day ("Monday"|"Tuesday"|"Wednesday"|"Thursday"|"Friday"|"Saturday"|"Sunday")
month ("January"|"February"|"March"|"April"|"May"|"June"|"July"|"August"|"September"|"October"|"November"|"December")
time {day}(\,)(\ )[0-3][1-9](\ ){month}(\ ){digit}*(\ )[0-2][0-9](\:)[0-5][0-9](\:)[0-5][0-9]((\ )("GMT")(((\+)|(\-)){integer}(\:){integer})?)?
sensor (\$)("s"){digit}*
switch (\$)("sw"){digit}
url ("http")("s")?("://")("www.")?[a-zA-Z0-9]*(\.)(.)*
type ("integer"|"double"|"string"|"sensor"|"switch"|"url"|"boolean"|"time"|"letter")
boolean ("true")|("false")|([a-zA-Z0-9]+((\ )*)?(((\=|\<|\>|\!)(\=))|(\>|\<))+((\ )*)?[a-zA-Z0-9]+)
identifier ({letter}*|{digit}*|(\_)*)*
%%
while return WHILE;
for return FOR;
return return RETURN;
sysin return SYSIN;
sysout return SYSOUT;
if return IF;
main return MAIN;
end return END;
else return ELSE;
fun return FUNCTION_IDENTIFIER;
cons return CONS;
isURL return IS_URL;
connect return CONNECT;
send return SEND;
receive return RECEIVE;
\\\n return NL;
\. return DOT;
\, return COMMA;
\: return COLON;
\; return SEMICOLON;
\+ return PLUS_OP;
\- return MINUS_OP;
\* return MULTIPLY_OP;
\/ return DIVIDE_OP;
\% return MOD_OP;
\# return HASHTAG;
\$ return SENSOR_IDENTIFIER;
\^ return POWER_OP;
\_ return UNDER_SCORE;
\? return QUESTION;
\! return NOT;
\( return LP;
\) return RP;
\{ return LCB;
\} return RCB;
\[ return LSB;
\] return RSB;
\= return ASSN_OP;
\> return GT;
\< return LT;
\=\= return EQ;
\>\= return GEQ;
\<\= return LEQ;
\!\= return NE;
\&\& return AND;
\|\| return OR;
\/\/ return DS;
{day} return DAY;
{month} return MONTH;
{time} return TIME;
{type} return TYPE;
{lowerLetter} return LOWER_LETTER;
{letter} return LETTER;
{integer} return INTEGER;
{string} return STRING;
{sensor} return SENSOR;
{switch} return SWITCH;
{double} return DOUBLE;
{boolean} return BOOLEAN;
{url} return URL;
{word} return WORD;
{identifier} return IDENTIFIER;
[ \t] ;
%%
int yywrap(void)
{
return 1;
}
My YACC code:
%{
#include "stdio.h"
#include <stdlib.h>
void yyerror(char *);
extern int yylineno;
#include "y.tab.h"
int yylex(void);
%}
%token MAIN
%token DOT COMMA COLON SEMICOLON UNDER_SCORE QUESTION LP RP LCB RCB LSB RSB DS NL
%token WHILE FOR RETURN SYSIN SYSOUT IF END ELSE FUNCTION_IDENTIFIER NOT CONS PLUS_OP MINUS_OP MULTIPLY_OP DIVIDE_OP MOD_OP HASHTAG SENSOR_IDENTIFIER POWER_OP
%token ASSN_OP GT LT EQ GEQ LEQ NE AND OR DAY MONTH TIME TYPE IS_URL CONNECT SEND RECEIVE
%token LOWER_LETTER LETTER WORD STRING INTEGER DOUBLE BOOLEAN SENSOR SWITCH URL IDENTIFIER
%nonassoc ELSE
%left PLUS_OP MINUS_OP
%left MULTIPLY_OP DIVIDE_OP
%left POWER_OP MOD_OP
%start program
%%
program:
stmts {printf("\rProgram is valid.\n");};
stmts: stmt
| stmts stmt;
stmt: if_stmt
| non_if_stmt ;
if_stmt: IF LP logical_expr RP LCB stmts RCB
| IF LP logical_expr RP LCB stmts RCB ELSE LCB stmts RCB;
non_if_stmt: loops
| aritmetic_op
| func_call
| func_dec
| initialize
| decl
| decl_ini
| input_stmt
| output_stmt
//| comment
| url_checker
| send
| receive
| connect;
loops: while_loop
| for_loop;
while_loop: WHILE LP logical_expr RP LCB stmts RCB;
for_loop: FOR LP decl_ini SEMICOLON logical_expr SEMICOLON aritmetic_op RP LCB stmts RCB;
initialize: IDENTIFIER ASSN_OP value SEMICOLON;
logical_expr: logical_term logical_op logical_term
| logical_expr logical_connector logical_term
| boolean_stmt;
boolean_stmt: BOOLEAN
| NOT boolean_stmt;
logical_term: term;
//| logical_term AND term;
logical_connector: AND
| OR;
term: IDENTIFIER
| constant;
constant: CONS IDENTIFIER;
logical_op: EQ
| NE
| LT
| GT
| LEQ
| GEQ;
decl: TYPE term SEMICOLON;
decl_ini: TYPE term ASSN_OP value SEMICOLON;
value: number
| STRING
| SENSOR
| SWITCH
| URL
| BOOLEAN
| LETTER
| TIME;
number: DOUBLE
| INTEGER;
/*
value_list: value
| value_list COMMA value;
array_decl: TYPE term LSB RSB SEMICOLON;
array_ini: TYPE term LSB RSB ASSN_OP LSB value_list RSB SEMICOLON
| term ASSN_OP LSB value_list RSB SEMICOLON;
get_array_val: term LSB INTEGER RSB;
*/
func_dec: FUNCTION_IDENTIFIER TYPE IDENTIFIER LP arguments RP LCB function_block RCB;
func_call: IDENTIFIER LP arguments RP SEMICOLON;
function_block: stmts
| stmts RETURN value
| stmts RETURN term;
arguments: TYPE term
| arguments COMMA TYPE term;
aritmetic_op: addition
| subtraction
| multiplication
| division
| modulo
| power;
addition: aritmetic_op PLUS_OP term
| term PLUS_OP term
| term PLUS_OP aritmetic_op;
subtraction: aritmetic_op MINUS_OP term
| term MINUS_OP term
| term MINUS_OP aritmetic_op;
multiplication: aritmetic_op MULTIPLY_OP term
| term MULTIPLY_OP term
| term MULTIPLY_OP aritmetic_op;
division: aritmetic_op DIVIDE_OP term
| term DIVIDE_OP term
| term DIVIDE_OP aritmetic_op;
modulo: aritmetic_op MOD_OP term
| term MOD_OP term
| term MOD_OP aritmetic_op;
power: aritmetic_op POWER_OP term
| term POWER_OP term
| term POWER_OP aritmetic_op;
/*
comment:
| DS sentence NL;
sentence:
| IDENTIFIER
| DOT | COMMA | COLON | SEMICOLON | PLUS_OP | MINUS_OP | MULTIPLY_OP | DIVIDE_OP
| MOD_OP | HASHTAG | SENSOR_IDENTIFIER | POWER_OP | UNDER_SCORE | QUESTION | NOT
| LP | RP | LCB | RCB | ASSN_OP | GT | LT
| sentence sentence;
*/
input_stmt: TYPE term ASSN_OP SYSIN LP RP SEMICOLON
| SYSIN LP RP SEMICOLON;
output_stmt: SYSOUT LP output RP SEMICOLON;
output: term
| value
| aritmetic_op
| output COMMA term;
url_checker: IS_URL LP STRING RP
| IS_URL LP term RP;
connect: CONNECT LP URL RP;
send: SEND LP number COMMA URL RP;
receive: RECEIVE LP URL RP;
%%
void yyerror(char *s)
{
fprintf(stderr, "syntax error at line: %d %s\n", yylineno, s);
}
int main(void){
yyparse();
if(yynerrs < 1) printf("there are no syntax errors!!\n");
}
My Simple Test Program:
integer a = 5;
double e = 5.5;
double f = 3.0;
sadasd
My Extended Test Program
sensor a = $s1;
switch b = $sw1;
integer c = -5;
integer d = 75;
double e = 5.5;
double f = 3.0;
c = c + d;
f = f + e;
sysout (e + f);
integer g = sysin();
url k = https://www.cs.bilkent.edu.tr/~guvenir/courses/CS315/Pr1.htm;
url l = https://docs.oracle.com/cd/E19504-01/802-5880/lex-6/index.html;
if(isURL(k)){connect(k);} //Connecting to URL k after checking if its a URL
send(g, k); //Sending the integer g to the URL k
integer h = receive(l); //Receiving the integer h from URL l
sysout (h); //Printing the integer h which we took from the URL l
if (e > f) {
e = f + e;
}else{
e = f - e;
}
time t = Friday, 14 October 2022 18:05:52
time t = Monday, 17 October 2022 12:05:52 GMT
time t = Saturday, 22 October 2022 20:05:52 GMT+03:00 //Defining time in different ways
while (e > f) {
e = e - f;
}
for (integer i = 0, e <= f, i = i + 1) {
e = e + i;
}
fun boolean isGreater(integer x, integer y){
boolean z = (x > y);
return z;
}
boolean g = isGreater(e, f);
sysout (k);
sysout (l);
And My Make file:
LEX = lex
YACC = yacc -d
CC = gcc
all: parser clean
parser: y.tab.o lex.yy.o
$(CC) -o parser y.tab.o lex.yy.o
./parser < test.txt
lex.yy.o: lex.yy.c y.tab.h
lex.yy.o y.tab.o: y.tab.c
y.tab.c y.tab.h: y.y
$(YACC) -v y.y
lex.yy.c: lex.l
$(LEX) lex.l
clean:
-rm -f *.o lex.yy.c *.tab.* parser *.output

How to parse decimal values correctly?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

Grammar for expressions which disallows outer parentheses

I have the following grammar for expressions involving binary operators (| ^ & << >> + - * /):
expression : expression BITWISE_OR xor_expression
| xor_expression
xor_expression : xor_expression BITWISE_XOR and_expression
| and_expression
and_expression : and_expression BITWISE_AND shift_expression
| shift_expression
shift_expression : shift_expression LEFT_SHIFT arith_expression
| shift_expression RIGHT_SHIFT arith_expression
| arith_expression
arith_expression : arith_expression PLUS term
| arith_expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This seems to work fine, but doesn't quite match my needs because it allows outer parentheses e.g. ((3 + 4) * 2).
How can I change the grammar to disallow outer parentheses, while still allowing them within expressions e.g. (3 + 4) * 2, even redundantly e.g. (3 * 4) + 2?

Add this rule to your grammar:
top_level : expression BITWISE_OR xor_expression
| xor_expression BITWISE_XOR and_expression
| and_expression BITWISE_AND shift_expression
| shift_expression LEFT_SHIFT arith_expression
| shift_expression RIGHT_SHIFT arith_expression
| arith_expression PLUS term
| arith_expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
and use top_level where you want expressions without outer parens.

ANTLR grammar for SMT formulae

I am trying to make a grammar for SMT formulae and this is what I have so far
grammar Z3input;
startRule : formulaList? EOF;
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
DIGIT : [0-9];
INTEGER : '0' | [1-9] DIGIT*;
FLOAT : DIGIT+ '.' DIGIT+;
NUMERICAL_LITERAL : FLOAT | INTEGER;
BOOLEAN_LITERAL : 'True' | 'False';
LITERAL : MINUS? NUMERICAL_LITERAL | BOOLEAN_LITERAL;
COMPARISON_OPERATOR : '>' | '<' | '>=' | '<=' | '!=' | '==';
WHITESPACE: [ \t\n\r]+ -> skip;
IDENTIFIER : [a-uw-zB-DF-Z]+ ([a-zA-Z0-9]? [a-uw-zB-DF-Z])*; // omits 'v', 'A', 'E' and cannot end in those characters
IMPLIES : '->' | '-->' | 'implies';
AND : '&' | 'and' | '^';
OR : 'or' | 'v' | '|';
NOT : '~' | '!' | 'not';
QUANTIFIER : 'A' | 'E' | 'forall' | 'exists';
formulaList : formula ( SEMICOLON formula )*;
argumentList : expression ( COMMA expression )*;
formula : formulaConjunction
| LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN
| formula IMPLIES LEFT_PAREN formulaConjunction RIGHT_PAREN;
formulaConjunction : formulaNegation | formulaConjunction AND formulaNegation;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| QUANTIFIER '.' LEFT_PAREN formulaAtom RIGHT_PAREN
| compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor | term TIMES factor | term DIVIDE factor;
factor : primary | MINUS factor;
primary : LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| LEFT_PAREN expression RIGHT_PAREN;
SMT formulae are formulae of first-order logic with function symbols (identifiers which can be called with however many arguments), variables, comparison of either boolean literals (I.e. 'True' or 'False') or numeric literals or function calls or variables, arithmetic with operators '+', '*', '-', and '/'. Essentially these formulae are first-order logic over some signature and for my purposes I've chosen for this signature to be the theory of rationals.
I can get a proper interpretation of something like 'True ^ True' but anything more complicated, including even 'True | True', seems to always result in something along the lines of
... mismatched input '|' expecting {<EOF>, ';', IMPLIES, AND}
so I would like some help with correcting the grammar. And for the record I would prefer to keep the grammar run-time independent.

Your formula rule seems to be causing the issue here: LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN.
That's saying that only formulas of the form (FORMULA)|(CONJUNCTIVE) will be accepted by the language.
Instead, specify precedence rules for each operator, and use a nonterminal for each level of precedence. For example, your grammar might look something like the following:
formula : (QUANTIFIER IDENTIFIER '.')? formulaImplication;
formulaImplication : formulaConjunction (IMPLIES formula)?;
formulaConjunction : formulaDisjunction (AND formulaConjunction)?;
formulaDisjunction : formulaNegation (OR formulaDisjunction)?;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN formula RIGHT_PAREN | compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor ((TIMES | DIVIDE) term)?;
factor : primary | MINUS factor;
primary : LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN expression RIGHT_PAREN;

Removing ambiguity in bison

I am writing a simple parser in bison. The parser checks whether a program has any syntax errors with respect to my following grammar:
%{
#include <stdio.h>
void yyerror (const char *s) /* Called by yyparse on error */
{
printf ("%s\n", s);
}
%}
%token tNUM tINT tREAL tIDENT tINTTYPE tREALTYPE tINTMATRIXTYPE
%token tREALMATRIXTYPE tINTVECTORTYPE tREALVECTORTYPE tTRANSPOSE
%token tIF tENDIF tDOTPROD tEQ tNE tGTE tLTE tGT tLT tOR tAND
%left "(" ")" "[" "]"
%left "<" "<=" ">" ">="
%right "="
%left "+" "-"
%left "*" "/"
%left "||"
%left "&&"
%left "==" "!="
%% /* Grammar rules and actions follow */
prog: stmtlst ;
stmtlst: stmt | stmt stmtlst ;
stmt: decl | asgn | if;
decl: type vars "=" expr ";" ;
type: tINTTYPE | tINTVECTORTYPE | tINTMATRIXTYPE | tREALTYPE | tREALVECTORTYPE
| tREALMATRIXTYPE ;
vars: tIDENT | tIDENT "," vars ;
asgn: tIDENT "=" expr ";" ;
if: tIF "(" bool ")" stmtlst tENDIF ;
expr: tIDENT | tINT | tREAL | vectorLit | matrixLit | expr "+" expr| expr "-" expr
| expr "*" expr | expr "/" expr| expr tDOTPROD expr | transpose ;
transpose: tTRANSPOSE "(" expr ")" ;
vectorLit: "[" row "]" ;
matrixLit: "[" row ";" rows "]" ;
row: value | value "," row ;
rows: row | row ";" rows ;
value: tINT | tREAL | tIDENT ;
bool: comp | bool tAND bool | bool tOR bool ;
comp: expr relation expr ;
relation: tGT | tLT | tGTE | tLTE | tNE | tEQ ;
%%
int main ()
{
if (yyparse()) {
// parse error
printf("ERROR\n");
return 1;
}
else {
// successful parsing
printf("OK\n");
return 0;
}
}
The code may look long and complicated, but i think what i am going to ask does not need the full code, but in any case i preferred to write the code. I am sure my grammar is correct, but ambiguous. When i try to create the executable of the program by writing "bison -d filename.y", i get an error saying that conflicts: 13 shift/reduce. I defined the precedence of the operators at the beginning of this file, and i tried a lot of combinations of these precedences, but i still get this error. How can i remove this ambiguity? Thank you

tOR, tAND, and tDOTPROD need to have their precedence specified as well.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Failure to match character that matches rule with PEG parser - parsing

Related

YACC Parser getting an error at line 1 even if test program is empty?

How to parse decimal values correctly?

Grammar for expressions which disallows outer parentheses

ANTLR grammar for SMT formulae

Removing ambiguity in bison

Categories

Resources