Flex and Yacc Grammar Issue - parsing

Edit #1: I think the problem is in my .l file. I don't think the rules are being treated as rules, and I'm not sure how to treat the terminals of the rules as strings.
My last project for a compilers class is to write a .l and a .y file for a simple SQL grammar. I have no experience with Flex or Yacc, so everything I have written I have pieced together. I only have a basic understanding of how these files work, so if you spot my problem can you also explain what that section of the file is supposed to do? I'm not even sure what the '%' symbols do.
Basically some rules just do not work when I try to parse something. Some rules hang and others reject when they should accept. I need to implement the following grammar:
start
::= expression
expression
::= one-relation-expression | two-relation-expression
one-relation-expression
::= renaming | restriction | projection
renaming
::= term RENAME attribute AS attribute
term
::= relation | ( expression )
restriction
::= term WHERE comparison
projection
::= term | term [ attribute-commalist ]
attribute-commalist
::= attribute | attribute , attribute-commalist
two-relation-expression
::= projection binary-operation expression
binary-operation
::= UNION | INTERSECT | MINUS | TIMES | JOIN | DIVIDEBY
comparison
::= attribute compare number
compare
::= < | > | <= | >= | = | <>
number
::= val | val number
val
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
attribute
::= CNO | CITY | CNAME | SNO | PNO | TQTY |
SNAME | QUOTA | PNAME | COST | AVQTY |
S# | STATUS | P# | COLOR | WEIGHT | QTY
relation
::= S | P | SP | PRDCT | CUST | ORDERS
Here is my .l file:
%{
#include <stdio.h>
#include "p5.tab.h"
%}
binaryOperation UINION|INTERSECT|MINUS|TIMES|JOIN|DIVIDEBY
compare <|>|<=|>=|=|<>
attribute CNO|CITY|CNAME|SNO|PNO|TQTY|SNAME|QUOTA|PNAME|COST|AVQTY|S#|STATUS|P#|COLOR|WEIGHT|QTY
relation S|P|SP|PRDCT|CUST|ORDERS
%%
[ \t\n]+ ;
{binaryOperation} return binaryOperation;
{compare} return compare;
[0-9]+ return val;
{attribute} return attribute;
{relation} return relation;
"RENAME" return RENAME;
"AS" return AS;
"WHERE" return WHERE;
"(" return '(';
")" return ')';
"[" return '[';
"]" return ']';
"," return ',';
. {printf("REJECT\n");
exit(0);}
%%
Here is my .y file:
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token RENAME attribute AS relation WHERE binaryOperation compare val
%%
start:
expression {printf("ACCEPT\n");}
;
expression:
oneRelationExpression
| twoRelationExpression
;
oneRelationExpression:
renaming
| restriction
| projection
;
renaming:
term RENAME attribute AS attribute
;
term:
relation
| '(' expression ')'
;
restriction:
term WHERE comparison
;
projection:
term
| term '[' attributeCommalist ']'
;
attributeCommalist:
attribute
| attribute ',' attributeCommalist
;
twoRelationExpression:
projection binaryOperation expression
;
comparison:
attribute compare number
;
number:
val
| val number
;
%%
yyerror() {
printf("REJECT\n");
exit(0);
}
main() {
yyparse();
}
yywrap() {}
Here is my makefile:
p5: p5.tab.c lex.yy.c
cc -o p5 p5.tab.c lex.yy.c
p5.tab.c: p5.y
bison -d p5.y
lex.yy.c: p5.l
flex p5.l
This works:
S RENAME CNO AS CITY
These do not:
S
S WHERE CNO = 5
I have not tested everything, but I think there is a common problem for these issues.

Your grammar is correct, the problem is that you are running interactively. When you call yyparse() it will attempt to read all input. Because the input
S
could be followed by either RENAME or WHERE it won't accept. Similarly,
S WHERE CNO = 5
could be followed by one or more numbers, so yyparse won't accept until it gets an EOF or an unexpected token.
What you want to do is follow the advice here and change p5.l to have these lines:
[ \t]+ ;
\n if (yyin==stdin) return 0;
That way when you are running interactively it will take the ENTER key to be the end of input.
Also, you want to use left recursion for number:
number:
val
| number val
;

Related

How to parse decimal values correctly?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

antlr4 does't parse obvious tree

I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.

Matching of tokens with Antlr4

I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).
I am trying to parse the string "p.A3L" in the following unit test.
#Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.
What is going wrong here and where can I learn how to fix this?
The grammar
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;
There are two problems:
Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule
The resulting grammar should look like:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).

Bison: Conflicts: 1 shift/reduce error

I'm trying to build a parser with bison and have narrowed all my errors down to one difficult one.
Here's the debug output of bison with the state where the error lies:
state 120
12 statement_list: statement_list . SEMICOLON statement
24 if_statement: IF conditional THEN statement_lists ELSE statement_list .
SEMICOLON shift, and go to state 50
SEMICOLON [reduce using rule 24 (if_statement)]
$default reduce using rule 24 (if_statement)
Here are the translation rules in the parser.y source
%%
program : ID COLON block ENDP ID POINT
;
block : CODE statement_list
| DECLARATIONS declaration_block CODE statement_list
;
declaration_block : id_list OF TYPE type SEMICOLON
| declaration_block id_list OF TYPE type SEMICOLON
;
id_list : ID
| ID COMMA id_list
;
type : CHARACTER
| INTEGER
| REAL
;
statement_list : statement
| statement_list SEMICOLON statement
;
statement_lists : statement
| statement_list SEMICOLON statement
;
statement : assignment_statement
| if_statement
| do_statement
| while_statement
| for_statement
| write_statement
| read_statement
;
assignment_statement : expression OUTPUTTO ID
;
if_statement : IF conditional THEN statement_lists ENDIF
| IF conditional THEN statement_lists ELSE statement_list
;
do_statement : DO statement_list WHILE conditional ENDDO
;
while_statement : WHILE conditional DO statement_list ENDWHILE
;
for_statement : FOR ID IS expression BY expressions TO expression DO statement_list ENDFOR
;
write_statement : WRITE BRA output_list KET
| NEWLINE
;
read_statement : READ BRA ID KET
;
output_list : value
| value COMMA output_list
;
condition : expression comparator expression
;
conditional : condition
| NOT conditional
| condition AND conditional
| condition OR conditional
;
comparator : ASSIGNMENT
| BETWEEN
| LT
| GT
| LESSEQUAL
| GREATEREQUAL
;
expression : term
| term PLUS expression
| term MINUS expression
;
expressions : term
| term PLUS expressions
| term MINUS expressions
;
term : value
| value MULTIPLY term
| value DIVIDE term
;
value : ID
| constant
| BRA expression KET
;
constant : number_constant
| CHARCONST
;
number_constant : NUMBER
| MINUS NUMBER
| NUMBER POINT NUMBER
| MINUS NUMBER POINT NUMBER
;
%%
When I remove the if_statement rule there are no errors, so I've narrowed it down considerably, but still can't solve the error.
Thanks for any help.
Consider this statement: if condition then s2 else s3; s4
There are two interpretations:
if condition then
s1;
else
s2;
s3;
The other one is:
if condition then
s1;
else
s2;
s3;
In the first one, the statment list is composed of an if statement and s3. While the other statement is composed of only one if statement. That's where the ambiguity comes from. Bison will prefer shift to reduce when a shift-reduce conflict exist, so in the above case, the parser will choose to shift s3.
Since you have an ENDIF in your if-then statement, consider to introduce an ENDIF in your if-then-else statement, then the problem is solved.
I think you are missing ENDIF in the IF-THEN-ELSE-ENDIF rule.

ANTLR assignment expression disambiguation

The following grammar works, but also gives a warning:
test.g
grammar test;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
program
: expr ';'!
;
term: ID | INT
;
assign
: term ('='^ expr)?
;
add : assign (('+' | '-')^ assign)*
;
expr: add
;
// T O K E N S
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS :
( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
DOT : '.' ;
fragment
LETTER : ('a'..'z'|'A'..'Z') ;
fragment
DIGIT : '0'..'9' ;
Warning
[15:08:20] warning(200): C:\Users\Charles\Desktop\test.g:21:34:
Decision can match input such as "'+'..'-'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Again, it does produce a tree the way I want:
Input: 0 + a = 1 + b = 2 + 3;
ANTLR produces | ... but I think it
this tree: | gives the warning
| because it _could_
+ | also be parsed this
/ \ | way:
0 = |
/ \ | +
a + | / \
/ \ | + 3
1 = | / \
/ \ | + =
b + | / \ / \
/ \ | 0 = b 2
2 3 | / \
| a 1
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
Charles wrote:
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
You shouldn't create two separate rules for assign and add. As your rules are now, assign has precedence over add, which you don't want: they should have equal precedence by looking at your desired AST. So, you need to wrap all operators +, - and = in one rule:
program
: expr ';'!
;
expr
: term (('+' | '-' | '=')^ expr)*
;
But now the grammar is still ambiguous. You'll need to "help" the parser to look beyond this ambiguity to assure there really is operator expr ahead when parsing (('+' | '-' | '=') expr)*. This can be done using a syntactic predicate, which looks like this:
(look_ahead_rule(s)_in_here)=> rule(s)_to_actually_parse
(the ( ... )=> is the predicate syntax)
A little demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
program
: expr ';'!
;
expr
: term ((op expr)=> op^ expr)*
;
op
: '+'
| '-'
| '='
;
term
: ID
| INT
;
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "0 + a = 1 + b = 2 + 3;";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.program().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the output of the Main class corresponds to the following AST:
which is created without any warnings from ANTLR.

Resources