ambiguous or conflict in LL1 grammar for a shell in C - parsing

i'm implementing a LL(1) parser for a project of doing a shell implementation.
i'm stuck trying to resolve conflicts in my grammar :
Parsing mode: LL(1).
Grammar:
1. COMMAND_LINE -> COMPLETE_COMMAND PIPED_CMD
2. PIPED_CMD -> PIPE COMPLETE_COMMAND PIPED_CMD
3. | ε
4. COMPLETE_COMMAND -> CMD_PREFIX CMD CMD_SUFFIX
5. CMD_PREFIX -> REDIRECTION CMD_PREFIX
6. | ε
7. CMD_SUFFIX -> REDIRECTION CMD_SUFFIX
8. | CMD_ARG CMD_SUFFIX
9. | ε
10. REDIRECTION -> REDIRECTION_OP WORD
11. | ε
12. CMD -> WORD
13. CMD_ARG -> WORD CMD_ARG
14. | SINGLE_QUOTE WORD DOUBLE_QUOTE CMD_ARG
15. | DOUBLE_QUOTE WORD DOUBLE_QUOTE CMD_ARG
16. | ε
17. REDIRECTION_OP -> HERE_DOC
18. | APPEND
19. | INFILE
20. | OUTFILE
i use syntax-cli to check my grammar, and the ll(1) parser is a home made implementation, i can link my implementation of the parser if needed.
the conflict detected by syntax-cli are :
PIPE
WORD
SINGLE_QUOTE
DOUBLE_QUOTE
HERE_DOC
APPEND
INFILE
OUTFILE
$
CMD_SUFFIX
9
7/8
7/8
7/8
7/8
7/8
7/8
7/8
9
REDIRECTION
11
11
11
11
10/11
10/11
10/11
10/11
11
CMD_ARG
16
13/16
14/16
15/16
16
16
16
16
16
i've also tried this grammar :
COMMAND_LINE
: COMPLETE_COMMAND PIPED_CMD
;
PIPED_CMD
: PIPE COMPLETE_COMMAND PIPED_CMD
|
;
COMPLETE_COMMAND
: REDIRECTION CMD REDIRECTION CMD_ARG REDIRECTION
;
REDIRECTION
: REDIRECTION_OP WORD
|
;
CMD
: WORD
;
CMD_ARG
: WORD REDIRECTION CMD_ARG
| SINGLE_QUOTE WORD DOUBLE_QUOTE REDIRECTION CMD_ARG
| DOUBLE_QUOTE WORD DOUBLE_QUOTE REDIRECTION CMD_ARG
| REDIRECTION
;
REDIRECTION_OP
: HERE_DOC
| APPEND
| INFILE
| OUTFILE
;
but the parser don't work when using multiple redirections ...

Without more specification on your behalf, can't be sure to have it all. But indeed, this grammar is ambiguous.
To build a LL(1) analyzer, you must be able to say, for any combination of symbol on the analyzer stack (symbol being either a terminal or non-terminal yet to read) and any word from the input buffer, what rule should apply.
Put yourself in the situation where you code starts with a WORD (that is first thing that is in input buffer)
You start by trying to analyze COMMAND_LINE
If input buffer starts with WORD, then only one rule can lead to COMMAND_LINE, that is the rule COMPLETE_COMMAND PIPED_CMD (anyway, whatever input, there is only this rule. Either we can apply it, or it is a syntax error. But for now, no reason to raise a syntax error, this rule is compatible with a start with WORD).
So, now, on your stack you have COMPLETE_COMMAND PIPED_CMD, and in input buffer, still the same WORD.
The only possible rule for the top of the stack is COMPLETE_COMMAND -> CMD_PREFIX CMD CMD_SUFFIX
So, now, on your stack you have CMD_PREFIX CMD CMD_SUFFIX PIPED_CMD.
And waiting in input buffer WORD
2 rules can be applied from CMD_PREFIX :
CMD_PREFIX -> REDIRECTION CMD_PREFIX
or CMD_PREFIX -> ε
None of them can start with WORD. So either we say that what we have here is an empty CMD_PREFIX (followed by a CMD starting with WORD)
Or we can see it as a REDIRECTION followed by an empty prefix. REDIRECTION can be REDIRECTION -> ε
So both are possible at this point. Either we have a CMD_PREFIX(ε) or we have a CMD_PREFIX(REDIRECTION(ε), ε) (or even more recursions).
For the grammar to be LL(1), we should not have to go deeper to decide. From this point, with the only knowledge that next lexem is WORD, we should be able to choose among those too. We aren't.
(In fact, even with other grammar than LL(1), we couldn't decide)

Related

What decides which production the parser tries?

I am trying to build a parser for a desk calculator and am using the following bison code for it.
%union{
float f;
char c;
// int
}
%token <f> NUM
%token <c> ID
%type <f> S E T F G
%%
C : S ';'
| C S ';'
;
S : ID '=' E {fprintf(debug,"13\n");printf("%c has been assigned the value %f.",$1,$3);symbolTable[$1]=$3;}
| E {fprintf(debug,"12\n");result = $$;}
;
E : E '+' T {fprintf(debug,"11\n");$$ = $1+$3;}
| E '-' T {fprintf(debug,"10\n");$$ = $1-$3;}
| T {fprintf(debug,"9\n");$$ = $1;}
;
T : T '*' F {fprintf(debug,"7\n");$$ = $1*$3;}
| T '/' F {fprintf(debug,"6\n");$$ = $1/$3;}
| F {fprintf(debug,"5\n");$$ = $1;}
;
F : G '#' F {fprintf(debug,"4\n");$$ = pow($1,$3);}
| G {fprintf(debug,"3\n");$$ = $1;}
;
G : '(' E ')' {fprintf(debug,"2\n");$$ = $2;}
| NUM {fprintf(debug,"1\n");$$ = $1;}
| ID {fprintf(debug,"0\n");$$ = symbolTable[$1];}
;
%%
My LEX rules are
digit [0-9]
num {digit}+
alpha [A-Za-z]
id {alpha}({alpha}|{digit})*
white [\ \t]
%%
let {printf("let");return LET;}
{num} {yylval.f = atoi(yytext);return NUM;}
{alpha} {yylval.c = yytext[0];return ID;}
[+\-\*/#\(\)] {return yytext[0];}
. {}
%%
The input I gave is a=2+3
When the lexer returns an ID(for 'a'), the parser is going for the production with fprintf(debug,"0\n"). But I want it to go for the production fprintf(debug,"13\n").
So, I am wondering what made my parser go for a reduction on production 0, instead of shifting = to stack, and how do I control it?
What you actually specified is a translation grammar, given by the following:
C → S ';' 14 | C S ';' 8
S → ID '=' E 13 | E 12
E → E '+' T 11 | E '-' T 10 | T 9
T → T '*' F 7 | T "/" F 6 | F 5
F → G '#' F 4 | G 3
G → '(' E ')' 2 | NUM 1 | ID 0
with top-level/start configuration C. (For completeness, I added in 8 and 14).
There is only one word generated from C, by this translation grammar, containing ID '=' NUM '+' NUM as the subword of input tokens, and that is ID ('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14, which is equal to the input-output pair (ID '=' NUM '+' NUM ';', 1 3 5 9 1 3 5 11 13 14). So, the sequence 1 3 5 9 1 3 5 11 13 14 is the one and only translation. Provided the grammar is LALR(1), then this translation will be produced, as a result; and the grammar is LALR(1).
If you're not getting this result, then that can only mean that you implemented wrong whatever you left out of your description: i.e. the lexer ... or that your grammar processor has a bug or your machine has a failure.
And, no; actually what you did is the better way to see what's going on - just stick in a single printf statement to the right hand side of each rule and run it that way to see what translation sequences are produced. The "trace" facility in the parser generator is superfluous for that very reason ... at least the way it is usually implemented (more on that below). In addition, you can get a direct view of everything with the -v option, which produces the LR(0) tables with LALR(1) annotations.
The kind of built-in testing facility that would actually be more helpful - especially for examples like this - is just what I described: one that echoes the inputs interleaved with the output actions. So, when you run it on "a = 2 + 3 ;", it would give you ID('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14 with echo turned on, and just 1 3 5 9 1 3 5 11 13 14 with echo turned off. That would actually be more useful to have as a built-in capability, instead of the trace mode you normally see in implementations of yacc.
The POSIX specification actually leaves open the issue of how "YYDEBUG", "yydebug" and "-t" are to be implemented in a compliant implementation of yacc, to make room for alternative approaches like this.
Well, it turns out that the problem is I am not identifying = as a token here, in my LEX.
As silly as it sounds, it points out a very important concept of yacc/Bison. The question of whether to shift or reduce is answered by checking the next symbol, also called the lookahead. In this case, the lookahead was NUM(for 2) and not =, because of my faulty LEX code. Since there is no production involving ID followed by NUM, it is going for a reduction to G.
And about how I figured it out, it turns out bison has a built-in trace feature. It lays out neatly like a diary entry, whatever it does while parsing. each and every step is written down.
To enable it,
Run bison with -Dparse.trace option.
bison calc.y -d -Dparse.trace
In the main function of parser grab the extern yydebug and set it to non-zero value.
int main(){
extern int yydebug;
yydebug = 1;
.
.
.
}

How can I check if first character of a line is "*" in ANTLR4?

I am trying to write a parser for a relatively simple but idiosyncratic language.
Simply put, one of the rules is that comment lines are denoted by an asterisk only if that asterisk is the first character of the line. How might I go about formalising such a rule in ANTLR4? I thought about using:
START_LINE_COMMENT: '\n*' .*? '\n' -> skip;
But I am certain this won't work with more than one line comment in a row, as the newline at the end will be consumed as part of the START_LINE_COMMENTtoken, meaning any subsequent comment lines will be missing the required initial newline character, which won't work. Is there a way I can perhaps check if the line starts with a '*' without needing to consume the prior '\n'?
Matching a comment line is not easy. As I write one grammar per year, I had to grab to The Definitive ANTLR Reference to refresh my brain. Try this :
grammar Question;
/* Comment line having an * in column 1. */
question
: line+
;
line
// : ( ID | INT )+
: ( ID | INT | MULT )+
;
LINE_COMMENT
: '*' {getCharPositionInLine() == 1}? ~[\r\n]* -> channel(HIDDEN) ;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
//WS : [ \t\r\n]+ -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> skip ;
MULT : '*' ;
Compile and execute :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar:
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens data.txt
[#0,0:3='line',<ID>,1:0]
[#1,5:5='1',<INT>,1:5]
[#2,9:12='line',<ID>,2:2]
[#3,14:14='2',<INT>,2:7]
[#4,16:26='* comment 1',<LINE_COMMENT>,channel=1,3:0]
[#5,32:35='line',<ID>,4:4]
[#6,37:37='4',<INT>,4:9]
[#7,39:48='*comment 2',<LINE_COMMENT>,channel=1,5:0]
[#8,51:78='* comment 3 after empty line',<LINE_COMMENT>,channel=1,7:0]
[#9,81:81='*',<'*'>,8:1]
[#10,83:85='not',<ID>,8:3]
[#11,87:87='a',<ID>,8:7]
[#12,89:95='comment',<ID>,8:9]
[#13,97:100='line',<ID>,9:0]
[#14,102:102='9',<INT>,9:5]
[#15,107:107='*',<'*'>,9:10]
[#16,109:110='no',<ID>,9:12]
[#17,112:118='comment',<ID>,9:15]
[#18,120:119='<EOF>',<EOF>,10:0]
with the following data.text file :
line 1
line 2
* comment 1
line 4
*comment 2
* comment 3 after empty line
* not a comment
line 9 * no comment
Note that without the MULT token or '*' somewhere in a parser rule, the asterisk is not listed in the tokens, but the parser complains :
line 8:1 token recognition error at: '*'
If you display the parsing tree
$ grun Question question -gui data.txt
you'll see that the whole file is absorbed by one line rule. If you need to recognize lines, change the line and white space rules like so :
line
: ( ID | INT | MULT )+ NL
| NL
;
//WS : [ \t\r\n]+ -> skip ;
NL : [\r\n] ;
WS : [ \t]+ -> skip ;

Antlr4: Another "No Viable Alternative Error"

I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'

FSLex Unknown Error

I got some problem with my FSLex which I can't solve... All I know is that fslex.exe exited with code 1...
The F# code at the top was tested in F# Interactive, so the problem isn't there (I can't see how).
Lexer:
http://pastebin.com/qnDnUh59
And Parser.fsi:
http://pastebin.com/sGyLqZbN
Thanks,
Ramon.
Non-zero error means the lexer failed, usually it'll describe the failure too. When I compile, I get exited with code 1 along with this:
Unexpected character '\'
let id = [\w'.']+
----------^
Lexer doesn't like char literals outside of quotes, and it doesn't understand the meaning of \w either. According to FsLex source code, FsLex only understands the following escape sequences:
let escape c =
match c with
| '\\' -> '\\'
| '\'' -> '\''
| 'n' -> '\n'
| 't' -> '\t'
| 'b' -> '\b'
| 'r' -> '\r'
| c -> c
This fixed version of your lexer compiles fine for me: http://pastebin.com/QGNk3VKD

Resources