Is something bad with my grammar - parsing

I am using jison and I saw the documentation of ebnf grammars but I can't make my grammar works:
Here are the images of my grammar, input and error
In the error, the grammar is recognizing just one line but kleen star should recognize 0 to several instances.
I am new in jison so maybe the way to use ebnf is not as i'm doing it, if you can help i'd be so grateful
The minimal complete version of my grammar:
METODO
: 'void' id '(' ')' '{' INSTR '}'
;
INSTR
: INSTRUCCION*
;
INSTRUCCION
: IF
| id '=' EXP ';'
| id ':' INSTR
;
Input:
void metodo_1(){
t2 = p + 1;
l2:
t6 = heap[t4];
print("%c", t6);
t5 = t5 + 1;
if t6 != 0 goto l2;
l0: }
Error:
Error
I added %ebnf at the beginning of my parser

Related

Verifying if an expression conforms to restrictive context-free grammar

I'm trying to write a parser that accepts a toy language for a software project class. Part of the production rules relevant to the question in EBNF-like syntax is given here (there's way more relational operators, but I've removed some of them to keep it simple):
cond_expr = rel_expr
| '!' '(' cond_expr ')'
| '(' cond_expr ')' '&&' '(' cond_expr ')' ;
rel_expr = rel_factor '==' rel_factor
| rel_factor '!=' rel_factor ;
rel_factor = VAR | INTEGER | expr ;
expr = expr '+' term
| expr '-' term
| expr ;
term = term '*' factor
| term '/' factor
| factor ;
factor = VAR | INTEGER | '(' expr ')' ;
VAR = [a-zA-Z][a-zA-Z0-9]* ;
INTEGER = '0' | [1-9][0-9]* ;
I've written more or less the entire parser already. I used recursive descent for majority of the language except for expressions, which I decided to use the shunting yard algorithm to parse (because I couldn't get recursive descent to work even after left recursion elimination/left factoring).
The real problem I have is in the cond_expr rule; shunting yard is too powerful for this grammar i.e the grammar can't accept certain conditional expressions. For example, the expression (x == 1) is not accepted, neither is !(x == 1) || (y == 1). I would use the recursive descent method to check if the expression can be accepted, but the issue is with the rel_expr in cond_expr, rel_expr can be substituted with rel_factor '==' rel_factor or rel_factor '!=' rel_factor, and each rel_factor can be substituted with '(' expr ')'. This leads to ambiguity (idk if that's the correct term) when deciding what branch to take in the cond_expr method upon seeing a '(' token. Something like the below:
Expression cond_expr() {
if (next() == "!") {
expect("!");
expect("(");
auto cond = cond_expr();
expect(")");
return cond;
} else if (next() == "(") {
// this will fail for e.g (x + 1) == 2
expect("(");
auto cond1 = cond_expr();
expect(")");
expect("&&");
expect("(");
auto cond2 = cond_expr();
expect(")");
return Node("&&", cond1, cond2);
} else {
return rel_expr();
}
}
My current strategy I'm attempting is to first validate that the expression can be accepted by the grammar using some subroutine, then calling the shunting yard algorithm to parse it into the required AST. However, I'm having a lot of trouble writing this validation subroutine. Anyone have any suggestions on any methods to solve this?

Creating Bison File for Simple Grammar

I have the following simple grammar:
E -> T | ^ v . E
T -> F T1
T1 -> F T1 | epsilon
F -> ( E ) | v
I'm pretty new to Bison, so I was hoping someone could help show me how to write it out in that format. All I have so far is the following, but I'm not sure if it's correct:
%left '.'
%left 'v'
%% /* The grammar follows. */
exp:
term {printf("1");}
| '^' 'v' '.' exp {printf("2");}
;
term:
factor term1 {printf("3");}
;
term1:
factor term1 {printf("4");}
| {printf("5");}
;
factor:
'(' exp ')' {printf("6");}
| 'v' {printf("7");}
;
%%
You are missing the closing semicolon from several of the productions. There's nothing in the source grammar to suggest you need the productions about lines.

Antlr parser for and/or logic - how to get expressions between logic operators?

I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.
You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Resources