JISON Recursion To read entire input text after a Token - parsing

I'm very much new to parser/cfg or jison. What I want my grammar to do is:
Read everything after Token ADDRESS to EOF
There can be multiple ADDRESS tokens between "ADDRESS TO EOF"(from step 1)
My sample input looks like :
...abc xyz address 101 My Street, Austin, CO 12345 is abc xyz my name is govind my address is 102 My Street,Austin, CO 12345 and here it is end of file.
The output I'm expecting is
address 101 My Street, Austin, CO 12345 is abc xyz my name is govind my address is 102 My Street,Austin, CO 12345 and here it is end of file.`
The code I'm trying is
/* lexical grammar */
%lex
%options flex
%{
if (!('chars' in yy)) {
yy.temp = 0;
}
%}
%%
\s+ /* skip whitespace */
(address|Address) return 'ADDRESS'
<<EOF>> return 'EOF'
[A-Za-z0-9]+ return 'VARIABLE'
. /*skip */
/lex
%start expressions
%% /* language grammar */
expressions
: other EOF
{return $1;}
;
other
:VARIABLE{$$=$1;}
|other ADDRESS other {$$=$1+"-"+$2+"-"+$3;}
;
What I think there should be few more expression to achieve the output as other ADDRESS other is throwing S/R conflict. Can anyone suggest me that how can I skip all the inputs before occurrence of first ADDRESS token and than put all other inputs in $$.
Thanks.

As a general principle, when you want to recognize only the first X in a list of X or Y, you need something like this:
list: head X tail;
tail: | tail X | tail Y;
head: | head Y;
Here head matches any number (including 0) of Y and tail matches any number (including 0) of either X or Y. Consequently, the X matched by list must be the first X in the input, and there is no ambiguity.
The tail non-terminal here is unnecessary in this case, but it is often useful to produce a correct parse tree. You could write the above grammar:
list: head X | list X | list Y;
head: | head Y;
If you also wanted to match lists without any X, you could add the production list: head:
list: head | head X tail;
tail: | tail X | tail Y;
head: | head Y;

Related

What decides which production the parser tries?

I am trying to build a parser for a desk calculator and am using the following bison code for it.
%union{
float f;
char c;
// int
}
%token <f> NUM
%token <c> ID
%type <f> S E T F G
%%
C : S ';'
| C S ';'
;
S : ID '=' E {fprintf(debug,"13\n");printf("%c has been assigned the value %f.",$1,$3);symbolTable[$1]=$3;}
| E {fprintf(debug,"12\n");result = $$;}
;
E : E '+' T {fprintf(debug,"11\n");$$ = $1+$3;}
| E '-' T {fprintf(debug,"10\n");$$ = $1-$3;}
| T {fprintf(debug,"9\n");$$ = $1;}
;
T : T '*' F {fprintf(debug,"7\n");$$ = $1*$3;}
| T '/' F {fprintf(debug,"6\n");$$ = $1/$3;}
| F {fprintf(debug,"5\n");$$ = $1;}
;
F : G '#' F {fprintf(debug,"4\n");$$ = pow($1,$3);}
| G {fprintf(debug,"3\n");$$ = $1;}
;
G : '(' E ')' {fprintf(debug,"2\n");$$ = $2;}
| NUM {fprintf(debug,"1\n");$$ = $1;}
| ID {fprintf(debug,"0\n");$$ = symbolTable[$1];}
;
%%
My LEX rules are
digit [0-9]
num {digit}+
alpha [A-Za-z]
id {alpha}({alpha}|{digit})*
white [\ \t]
%%
let {printf("let");return LET;}
{num} {yylval.f = atoi(yytext);return NUM;}
{alpha} {yylval.c = yytext[0];return ID;}
[+\-\*/#\(\)] {return yytext[0];}
. {}
%%
The input I gave is a=2+3
When the lexer returns an ID(for 'a'), the parser is going for the production with fprintf(debug,"0\n"). But I want it to go for the production fprintf(debug,"13\n").
So, I am wondering what made my parser go for a reduction on production 0, instead of shifting = to stack, and how do I control it?
What you actually specified is a translation grammar, given by the following:
C → S ';' 14 | C S ';' 8
S → ID '=' E 13 | E 12
E → E '+' T 11 | E '-' T 10 | T 9
T → T '*' F 7 | T "/" F 6 | F 5
F → G '#' F 4 | G 3
G → '(' E ')' 2 | NUM 1 | ID 0
with top-level/start configuration C. (For completeness, I added in 8 and 14).
There is only one word generated from C, by this translation grammar, containing ID '=' NUM '+' NUM as the subword of input tokens, and that is ID ('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14, which is equal to the input-output pair (ID '=' NUM '+' NUM ';', 1 3 5 9 1 3 5 11 13 14). So, the sequence 1 3 5 9 1 3 5 11 13 14 is the one and only translation. Provided the grammar is LALR(1), then this translation will be produced, as a result; and the grammar is LALR(1).
If you're not getting this result, then that can only mean that you implemented wrong whatever you left out of your description: i.e. the lexer ... or that your grammar processor has a bug or your machine has a failure.
And, no; actually what you did is the better way to see what's going on - just stick in a single printf statement to the right hand side of each rule and run it that way to see what translation sequences are produced. The "trace" facility in the parser generator is superfluous for that very reason ... at least the way it is usually implemented (more on that below). In addition, you can get a direct view of everything with the -v option, which produces the LR(0) tables with LALR(1) annotations.
The kind of built-in testing facility that would actually be more helpful - especially for examples like this - is just what I described: one that echoes the inputs interleaved with the output actions. So, when you run it on "a = 2 + 3 ;", it would give you ID('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14 with echo turned on, and just 1 3 5 9 1 3 5 11 13 14 with echo turned off. That would actually be more useful to have as a built-in capability, instead of the trace mode you normally see in implementations of yacc.
The POSIX specification actually leaves open the issue of how "YYDEBUG", "yydebug" and "-t" are to be implemented in a compliant implementation of yacc, to make room for alternative approaches like this.
Well, it turns out that the problem is I am not identifying = as a token here, in my LEX.
As silly as it sounds, it points out a very important concept of yacc/Bison. The question of whether to shift or reduce is answered by checking the next symbol, also called the lookahead. In this case, the lookahead was NUM(for 2) and not =, because of my faulty LEX code. Since there is no production involving ID followed by NUM, it is going for a reduction to G.
And about how I figured it out, it turns out bison has a built-in trace feature. It lays out neatly like a diary entry, whatever it does while parsing. each and every step is written down.
To enable it,
Run bison with -Dparse.trace option.
bison calc.y -d -Dparse.trace
In the main function of parser grab the extern yydebug and set it to non-zero value.
int main(){
extern int yydebug;
yydebug = 1;
.
.
.
}

xtext not accepting string constant - expecting RULE_ID

I have tried to cut down my problem to the simplest problem I can in xtext - I would like to use the following grammar:
M: lines += T*;
T:
DT
| BDT
| N
;
BDT:
name = ('a' | 'b' | 'c')
;
DT:
'd' name=ID
('(' (ts += BDT (','ts += BDT)*) ')')?
;
N:
'n' name=ID ':' type=[T]
;
I am intending to parse expressions of the form d f(a,b,b) for example which works fine. I would also like to be able to parse n g:f which also works, but not n g:a - where a here is part of the BDT rule. The error given is "Missing RULE_ID at 'a'".
I'd like to allow the grammar to parse n g:a for example, and I'd be very grateful if anyone could point out where I'm going wrong here on this very simple grammar.
Lexing is done context free. A keyword can never be an ID. You can address this trough parser rules.
You can introduce a datatype rule
MyID: ID | "a" | ... | "c";
And use it where you use ID

Antlr4: Another "No Viable Alternative Error"

I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.

Problems with left-recursion

I have a little grammar containing a few commands which have to be used with Numbers and some of these commands return Numbers as well.
My grammar snippet looks like this:
Command:
name Numbers
| Numbers "test"
;
name:
"abs"
| "acos"
;
Numbers:
NUMBER
| numberReturn
;
numberReturn:
name Numbers
;
terminal NUMBER:
('0'..'9')+("."("0".."9")+)?
;
After having inserted the "Numbers 'test'" part in rule command the compiler complains about non-LL() decicions and tells me I have to work around these (left-factoring, syntactic predicates, backtracking) but my problem is that I have no idea what kind of input wouldn't be non-LL() in this case nor do I have an idea how to left-factor my grammar (I don't want toturn on backtracking).
EDIT:
A few examples of what this grammar should match:
abs 3;
acos abs 4; //interpreted as "acos (abs 4)"
acos 3 test; //(acos 3) test
Best regards
Raven
The grammar you are trying to achieve is left-recursive; that means the parser does not know how to tell between (acos 10) test and acos (10 test) (without the parentheses). However, you can give the parser some hints for it to know the correct order, such as parenthesized expressions.
This would be a valid Xtext grammar, with testparenthesized expressions:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model
: operations += UnaryOperation*
;
UnaryOperation returns Expression
: 'abs' exp = Primary
| 'acos' exp = Primary
| '(' exp = Primary 'test' ')'
;
Primary returns Expression
: NumberLiteral
| UnaryOperation
;
NumberLiteral
: value = INT
;
The parser will correctly recognize expressions such as:
(acos abs (20 test) test)
acos abs 20
acos 20
(20 test)
These articles may be helpful for you:
https://dslmeinte.wordpress.com/tag/unary-operator/
http://blog.efftinge.de/2010/08/parsing-expressions-with-xtext.html

Location in syntax trees

When writing a parser, I want to remember the location of lexemes found, so that I can report useful error messages to the programmer, as in “if-less else on line 23” or ”unexpected character on line 45, character 6” or “variable not defined” or something similar. But once I have built the syntax tree, I will transform it in several ways, optimizing or expanding some kind of macros. The transformations produce or rearrange lexemes which do not have a meaningful location.
Therefore it seems that the type representing the syntax tree should come in two flavor, a flavor with locations decorating lexemes and a flavor without lexemes. Ideally we would like to work with a purely abstract syntax tree, as defined in the OCaml book:
# type unr_op = UMINUS | NOT ;;
# type bin_op = PLUS | MINUS | MULT | DIV | MOD
| EQUAL | LESS | LESSEQ | GREAT | GREATEQ | DIFF
| AND | OR ;;
# type expression =
ExpInt of int
| ExpVar of string
| ExpStr of string
| ExpUnr of unr_op * expression
| ExpBin of expression * bin_op * expression ;;
# type command =
Rem of string
| Goto of int
| Print of expression
| Input of string
| If of expression * int
| Let of string * expression ;;
# type line = { num : int ; cmd : command } ;;
# type program = line list ;;
We should be allowed to totally forget about locations when working on that tree and have special functions to map an expression back to its location (for instance), that we could use in case of emergency.
What is the best way to define such a type in OCaml or to handle lexeme positions?
The best way is to work always with AST nodes fully annotated with the locations. For example:
type expression = {
expr_desc : expr_desc;
expr_loc : Lexing.position * Lexing.position; (* start and end *)
}
and expr_desc =
ExpInt of int
| ExpVar of string
| ExpStr of string
| ExpUnr of unr_op * expression
| ExpBin of expression * bin_op * expression
Your idea, keeping the AST free of locations and writing a function to retrieve the missing locations is not a good idea, I believe. Such a function should require searching by pointer equivalence of AST nodes or something similar, which does not really scale.
I strongly recommend to look though OCaml compiler's parser.mly which is a full scale example of AST with locations.

Resources