I try to make an ANTLR grammar for a data, which contains a part, where I can skip NewLine, and a part, where it may be important. More specifically, I'm interested in skipping NewLine inside parentheses and want to realize that using lexer modes. But there is the problem: in DEFAULT_MODE there are a lot of lexer rules, and tokens, that are described in these rules, can appear inside
parentheses too. How can I solve the problem?
Maybe the current state of my code will help to understand the question
// ...
LPAREN : '(' -> pushMode(InsideParen) ;
// ...
mode InsideParen ;
InsideParenNewLine : ('\r'? '\n') -> skip ;
// here I want somehow recognize all tokens from DEFAULT_MODE without rewriting all rules
RPAREN: ')' -> popMode ;
Thank you in advance.
As soon as I saw this question, I thought that your problem resembled that for Python newline handling. But then I noticed you were using pushMode, which is not an ANTLR4 construct...
If you are willing up upgrade to ANTLR4 however, you can take advantage of stuff like:
LINENDING: (('\r'? '\n')+ {self._lineContinuation=False}
| '\\' [ \t]* ('\r'? '\n') {self._lineContinuation=True})
{
if self._openBRCount == 0 and not self._lineContinuation:
if not self._suppressNewlines:
self.emitNewline()
self._suppressNewlines = True
la = self._input.LA(1)
if la not in [ord(' '), ord('\t'), ord('#')]:
self._suppressNewlines = False
self.emitFullDedent()
} -> channel(HIDDEN)
;
OPEN_PAREN: '(' {self._openBRCount += 1};
CLOSE_PAREN: ')' {self._openBRCount -= 1};
OPEN_BRACE: '{' {self._openBRCount += 1};
CLOSE_BRACE: '}' {self._openBRCount -= 1};
OPEN_BRACKET: '[' {self._openBRCount += 1};
CLOSE_BRACKET: ']' {self._openBRCount -= 1};
UNKNOWN: . -> skip;
This will make your grammar act like Python with regard to whitespace, and maybe some tweaks where you act on parenthese instead of line continuation characters. See this python grammar.
Related
I'm trying to write a parser that accepts a toy language for a software project class. Part of the production rules relevant to the question in EBNF-like syntax is given here (there's way more relational operators, but I've removed some of them to keep it simple):
cond_expr = rel_expr
| '!' '(' cond_expr ')'
| '(' cond_expr ')' '&&' '(' cond_expr ')' ;
rel_expr = rel_factor '==' rel_factor
| rel_factor '!=' rel_factor ;
rel_factor = VAR | INTEGER | expr ;
expr = expr '+' term
| expr '-' term
| expr ;
term = term '*' factor
| term '/' factor
| factor ;
factor = VAR | INTEGER | '(' expr ')' ;
VAR = [a-zA-Z][a-zA-Z0-9]* ;
INTEGER = '0' | [1-9][0-9]* ;
I've written more or less the entire parser already. I used recursive descent for majority of the language except for expressions, which I decided to use the shunting yard algorithm to parse (because I couldn't get recursive descent to work even after left recursion elimination/left factoring).
The real problem I have is in the cond_expr rule; shunting yard is too powerful for this grammar i.e the grammar can't accept certain conditional expressions. For example, the expression (x == 1) is not accepted, neither is !(x == 1) || (y == 1). I would use the recursive descent method to check if the expression can be accepted, but the issue is with the rel_expr in cond_expr, rel_expr can be substituted with rel_factor '==' rel_factor or rel_factor '!=' rel_factor, and each rel_factor can be substituted with '(' expr ')'. This leads to ambiguity (idk if that's the correct term) when deciding what branch to take in the cond_expr method upon seeing a '(' token. Something like the below:
Expression cond_expr() {
if (next() == "!") {
expect("!");
expect("(");
auto cond = cond_expr();
expect(")");
return cond;
} else if (next() == "(") {
// this will fail for e.g (x + 1) == 2
expect("(");
auto cond1 = cond_expr();
expect(")");
expect("&&");
expect("(");
auto cond2 = cond_expr();
expect(")");
return Node("&&", cond1, cond2);
} else {
return rel_expr();
}
}
My current strategy I'm attempting is to first validate that the expression can be accepted by the grammar using some subroutine, then calling the shunting yard algorithm to parse it into the required AST. However, I'm having a lot of trouble writing this validation subroutine. Anyone have any suggestions on any methods to solve this?
I am using jison and I saw the documentation of ebnf grammars but I can't make my grammar works:
Here are the images of my grammar, input and error
In the error, the grammar is recognizing just one line but kleen star should recognize 0 to several instances.
I am new in jison so maybe the way to use ebnf is not as i'm doing it, if you can help i'd be so grateful
The minimal complete version of my grammar:
METODO
: 'void' id '(' ')' '{' INSTR '}'
;
INSTR
: INSTRUCCION*
;
INSTRUCCION
: IF
| id '=' EXP ';'
| id ':' INSTR
;
Input:
void metodo_1(){
t2 = p + 1;
l2:
t6 = heap[t4];
print("%c", t6);
t5 = t5 + 1;
if t6 != 0 goto l2;
l0: }
Error:
Error
I added %ebnf at the beginning of my parser
I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?
Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)
i was trying to code a parser using yacc and lex that count the number of nested loops (while or for).I started the implementation for just while loops.But for some reason the parser gives me an error at the end of a closing brace.
Here is the code.
%{
#include<stdio.h>
/*parser for counting while loops*/
extern int yyerror(char* error);
int while_count=0;
extern int yylex();
%}
%token NUMBER
%token VAR
%token WHILE
%%
statement_list : statement'\n'
| statement_list statement'\n'
;
statement :
while_stmt '\n''{' statement_list '}'
| VAR '=' NUMBER ';'
;
while_stmt :
WHILE '('condition')' {while_count++;}
;
condition :
VAR cond_op VAR
;
cond_op : '>'
| '<'
| '=''='
| '!''='
;
%%
int main(void){
yyparse();
printf("while count:%d\n",while_count);
}
int yyerror(char *s){
printf("Error:%s\n",s);
return 1;
}
what is wrong with that code.And is there a way in yacc to mention optional arguments? like the "\n" after while?
here is the lexer code
%{
#include"y.tab.h"
/*lexer for scanning nested while loops*/
%}
%%
[\t ] ; /*ignore white spaces*/
"while" {return WHILE;}
[a-zA-Z]+ {return VAR;}
[0-9]+ {return NUMBER;}
'$' {return 0;}
'\n' {return '\n' ;}
. {return yytext[0];}
%%
VAR is a variable name with just ascii characters and WHILE is the keyword while.type is not taken into consideration on variable assignments
The problem you seem to be having is with empty loop bodies, not nested loops. As written, your grammar requires at least one statement in the while loop body. You can fix this by allowing empty statement lists:
statement_list: /* empty */
| statement_list statement '\n'
;
You also ask about making newlines optional. The easiest way is to make the lexer simply discard newlines (as whitespace) rather than returning them. Then just get rid of the newlines in the grammar, and newlines can appear between any two tokens and will be ignored.
If you really must have newlines in the grammar for some reason, you can add a rule like:
opt_newlines: /* empty */ | opt_newlines '\n' ;
and then use this rule wherever you want to allow for newlines (replace all the literal '\n' in your grammar.) You have to be careful not to use it redundantly, however. If you do something like:
statement_list: /* empty */
| statement_list statement opt_newlines
;
while_stmt opt_newlines '{' opt_newlines statement_list opt_newlines '}'
you'll get shift/reduce conflicts as newlines before the } in a loop could be either part of the opt_newlines in the while or the opt_newlines in the statement_list. Its pretty easy to deal such conflicts by just removing the redundant opt_newlines.
I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.
You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule: