Bison terminating instead of shifting error - parsing

I have a grammar that works well except that it doesn't tolerate syntax errors. I'm trying to work in error tokens so that it can gracefully recover. I've read through the Bison manual on error recovery but something is not adding up.
Here's a snippet from the grammar:
%start start
%token WORD WORDB SP CRLF
%%
start : A B C
| error CRLF start
A : WORD SP WORD CRLF
...
Here's a snippet of the output file that bison produces describing the grammar
State 0
0 $accept: . start $end
error shift, and go to state 1
WORD shift, and go to state 2
start go to state 3
A go to state 4
State 1
2 start: error . CRLF start
CRLF shift, and go to state 5
State 5
2 start: error CRLF . start
error shift, and go to state 1
WORD shift, and go to state 2
start go to state 25
A go to state 4
Given the input tokens WORDB CRLF WORD SP WORD CRLF ..... I would expect the state transitions to be 0 -> 1 -> 5 -> 2 -> ..., but when I run the parser it actually produces the following:
--(end of buffer or a NUL)
--accepting rule at line 49 ("WORDB")
Starting parse
Entering state 0
Reading a token: Next token is token WORDB ()
syntax error, unexpected WORDB, expecting WORD
As best I can tell, if the parser is in State 0 and it sees a token other than WORD it should interpret the token as if it was error and should go to State 1. In practice it is just hard failing.

The error transition does not suppress the call to yyerror(), so if your yyerror implementation does something like call exit(), error recovery will not be able to proceed.

Related

ANTLR4 lexer rules not matching correct block of text

I am trying to understand how ANTLR4 works based on lexer and parser rules but I am missing something in the following example:
I am trying to parse a file and match all mathematic additions (eg 1+2+3 etc.). My file contains the following text:
start
4 + 5 + 22 + 1
other text other text test test
test test other text
55 other text
another text 2 + 4 + 255
number 44
end
and I would like to match
4 + 5 + 22 + 1
and
2 + 4 + 255
My grammar is as follows:
grammar Hello;
hi : expr+ EOF;
expr : NUM (PLUS NUM)+;
PLUS : '+' ;
NUM : [0-9]+ ;
SPACE : [\n\r\t ]+ ->skip;
OTHER : [a-z]+ ;
My abstract Syntax Tree is visualized as
Why does rule 'expr' matches the text 'start'? I also get an error "extraneous input 'start' expecting NUM"
If i make the following change in my grammar
OTHER : [a-z]+ ->skip;
the error is gone. In addition in the image above text '55 other text
another text' matches the expression as a node in the AST. Why is this happening?
All the above have to do with the way lexer matches an input? I know that lexer looks for the first longest matching rule but how can I change my grammar so as to match only the additions?
Why does rule 'expr' matches the text 'start'?
It doesn't. When a token shows up red in the tree, that indicates an error. The token did not match any of the possible alternatives, so an error was produced and the parser continued with the next token.
In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?
After you skipped the OTHER tokens, your input basically looks like this:
4 + 5 + 22 + 1 55 2 + 4 + 255 44
4 + 5 + 22 + 1 can be parsed as an expression, no problem. After that the parser either expects a + (continuing the expression) or a number (starting a new expression). So when it sees 55, that indicates the start of a new expression. Now it expects a + (because the grammar says that PLUS NUM must appear at least once after the first number in an expression). What it actually gets is the number 2. So it produces an error and ignores that token. Then it sees a +, which is what it expected. And then it continues that way until the 44, which again starts a new expression. Since that isn't followed by a +, that's another error.
All the above have to do with the way lexer matches an input?
Not really. The token sequence for "start 4 + 5" is OTHER NUM PLUS NUM, or just NUM PLUS NUM if you skip the OTHERs. The token sequence for "55 skippedtext 2 + 4" is NUM NUM PLUS NUM. I assume that's exactly what you'd expect.
Instead what seems to be confusing you is how ANTLR recovers from errors (or maybe that it recovers from errors).

\[$end\] lookaheads in LALR

I am trying to understand, how bison builds tables for this simple grammar:
input: rule ;
rule: rule '+' '1'
| '1' ;
I was able to calculate LR(1) transition table and item sets, but I don't understand how state 3 is built and works:
State 3
1 input: rule . [$end]
2 rule: rule . '+' '1'
'+' shift, and go to state 5
$default reduce using rule 1 (input)
For default reduce rule I should put 'r1' into all cells of GOTO table for each symbol. But for shift rule I should put 's5' into column for '+' terminal (this cell already contains 'r1'). For me this looks like shift/reduce conflict. But not for bison. Please explain how that '[$end]' lookahead symbol appeared in this state, and how this state is processed in overall by LR state machine.
default means "everything else", not "everything". In other words, you first fill in the specified actions, and then use the default action for any other lookahead symbol.
If there is no default action, the action for any unspecified lookahead symbol is an error. Default reduce actions are often used to reduce table size where the algorithm would otherwise trigger an error. This optimization may have the result of delaying the reporting of an error, but the error will always be detected before another input is consumed, precisely because an error action is never replaced with a default shift. (Indeed, many parser generators never use default shift actions at all.)
If you look at the grammar shown at the beginning of the .output file, you'll see that it has been augmented with the production:
0 $accept: input $end
Yacc/bison always adds a production like that to ensure that the complete input matches the start symbol. (The non-terminal input will, of course, be replaced by whatever start symbol has been declared with the %start directive, or with the first non-terminal in the grammar.)
There is nothing special about this rule aside from the fact that reducing the $accept symbol causes the input to be accepted. (You can see that in state 4).
$end is a special terminal symbol automatically generated when EOF is detected. (To be more precise, it is the terminal whose token value is 0, which the scanner returns to indicate end of file: (f)lex-generated scanners do this automatically.

F(Lex) WARNING , rule cannot be matched

EOL \n
WS(" "|\t|\n)
WSS {WS}*
NEWSS {WSS}+
NAME [a-zA-z_][a-zA-z0-9_-]*
WORD [^;]+
IMPORT {NEWSS}'{NAME}'{WSS};
VAL [a-zA-z0-9]+
CONTENT [^}]+
MIX {NEWSS}{NAME}{WSS}[(]
INCLUDE {WSS}{NAME}{WSS}[{]
%s DOTAIM
%s NAMESTATE
%s NAMER
%s CONTENT
%s VALUE
%s INC
%%
${NAME} {key=yytext;BEGIN(NAMESTATE);}
. {output+=yytext;}
\n {output+=yytext;}
45) <NAMESTATE>; {if(var.find(key)==var.end()){output="Unknown variable";return 1;};output+=(var[key]+yytext);BEGIN(INITIAL);}
<NAMESTATE>{WSS}:{WSS} {BEGIN(DOTAIM);}
<DOTAIM>{WORD}{WSS} {val=trim(yytext); var[key]=val;}
48) <DOTAIM>; {BEGIN(INITIAL);}
This is my code and I keep getting this warning:
hello.lex:45: warning, rule cannot be matched
hello.lex:48: warning, rule cannot be matched
Does anyone know why? Because these are in states and line 43 is not preventing them to match.
You declare your start conditions as inclusive (%s): as the manual indicates, "If the start condition is inclusive, then rules with no start conditions at all will also be active."
So the . at line 43 will be active and prevent the ; from matching.
Moving the fallback rule to the end of the rules would fix the problem, and it is generally best style even if you have start conditions.

Antlr4 token existence messing up parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.
The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

ANTRL 3 grammar omitted part of input source code

I am using this ANTLR 3 grammar and ANTLRWorks for testing that grammar.
But I can't figure out why some parts of my input text are omitted.
I would like to rewrite this grammar and display every element (lparen, keywords, semicolon,..) of the source file (input) in AST / CST.
I've tried everything, but without success. Can someone who is experienced with ANTLR help me?
Parse tree:
I've managed to narrow it down to the semic rule:
/*
This rule handles semicolons reported by the lexer and situations where the ECMA 3 specification states there should be semicolons automaticly inserted.
The auto semicolons are not actually inserted but this rule behaves as if they were.
In the following situations an ECMA 3 parser should auto insert absent but grammaticly required semicolons:
- the current token is a right brace
- the current token is the end of file (EOF) token
- there is at least one end of line (EOL) token between the current token and the previous token.
The RBRACE is handled by matching it but not consuming it.
The EOF needs no further handling because it is not consumed by default.
The EOL situation is handled by promoting the EOL or MultiLineComment with an EOL present from off channel to on channel
and thus making it parseable instead of handling it as white space. This promoting is done in the action promoteEOL.
*/
semic
#init
{
// Mark current position so we can unconsume a RBRACE.
int marker = input.mark();
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| RBRACE { input.rewind(marker); }
| EOL | MultiLineComment // (with EOL in it)
;
So, the EVIL semicolon insertion strikes again!
I'm not really sure, but I think these mark/rewind calls are getting out of sync. The #init block is executed when the rule is entered for branch selection and for actual matching. It's actually creating a lot of marks but not cleaning them up. But I don't know why it messes up the parse tree like that.
Anyway, here's a working version of the same rule:
semic
#init
{
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| { int pos = input.index(); } RBRACE { input.seek(pos); }
| EOL | MultiLineComment // (with EOL in it)
;
It's much simpler and doesn't use the mark/rewind mechanism.
But there's a catch: the semic rule in the parse tree will have a child node } in the case of a semicolon insertion before a closing brace. Try to remove the semicolon after i-- and see the result. You'll have to detect this and handle it in your code. semic should either contain a ; token, or contain EOL (which means a semicolon got silently inserted at this point).

Resources