Skipping tokens in yacc - token

I want to have a grammar rule like below in my yacc file:
insert_statement: INSERT INTO NAME (any_token)* ';'
We can skip all the tokens until a given token at an error, in yacc as follows:
stat: error ';'
Is there any mechanism to skip any number of characters in yacc, when there is no error?

After sometime I could solve my problem the following way and would like to mention it as it would be helpful to someone:
Add a token definition to lex including the characters that should be in a skipping token:
<*>[A-Za-z0-9_:.-]* { return SKIPPINGTOKS; }
(this would identify any token like a, 1, hello, hello123 etc.)
Then add the following such rules to yacc as required:
insert_statement: INSERT INTO NAME skipping_portion ';'
skipping_portion: SKIPPINGTOKS | skipping_portion SKIPPINGTOKS
Hope this may help someone...

I think you would want to do something like this. It skips any and all tokens that are not the semicolon.
insert_statement: INSERT INTO NAME discardable_tokens_or_epsilon ';' ;
discardable_tokens_or_epsilon: discardable_tokens
| epsilon
discardable_tokens: discardable_tokens discardable_token
| discardable_token
discardable_token: FOO
| BLETCH cetera... anything other than a semicolon
epsilon: ;

Simply don't specify a production rule containing those tokens, you'd like to skip.


Ambiguous ANTLR parser rule

I have a very simple example text which I want to parse with ANTLR, and yet I'm getting wrong results due to ambiguous definition of the rule.
Here is the grammar:
grammar SimpleExampleGrammar;
prog : event EOF;
event : DEFINE EVT_HEADER eventName=eventNameRule;
eventNameRule : DIGIT+;
DEFINE : '#define';
DIGIT : [0-9a-zA-Z_];
WS : ('' | ' ' | '\r' | '\n' | '\t') -> channel(HIDDEN);
First text example:
#define EVT_EX1
Second text example:
#define EVT_EX1
#define EVT_EX2
So, the first example is parsed correctly.
However, the second example doesn't work, as the eventNameRule matches the next "#define ..." and the parse tree is incorrect
Appreciate any help to change the grammar to parse this correctly.
Beside the missing loop specifier you also have a problem in your WS rule. The first alt matches anything. Remove that. And, btw, give your DIGIT rule a different name. It matches more than just digits.
As Adrian pointed out, my main mistake here is that in the initial rule (prog) I used "event" and not "event+" this will solve the issue.
Thanks Adrian.

ANTRL 3 grammar omitted part of input source code

I am using this ANTLR 3 grammar and ANTLRWorks for testing that grammar.
But I can't figure out why some parts of my input text are omitted.
I would like to rewrite this grammar and display every element (lparen, keywords, semicolon,..) of the source file (input) in AST / CST.
I've tried everything, but without success. Can someone who is experienced with ANTLR help me?
Parse tree:
I've managed to narrow it down to the semic rule:
This rule handles semicolons reported by the lexer and situations where the ECMA 3 specification states there should be semicolons automaticly inserted.
The auto semicolons are not actually inserted but this rule behaves as if they were.
In the following situations an ECMA 3 parser should auto insert absent but grammaticly required semicolons:
- the current token is a right brace
- the current token is the end of file (EOF) token
- there is at least one end of line (EOL) token between the current token and the previous token.
The RBRACE is handled by matching it but not consuming it.
The EOF needs no further handling because it is not consumed by default.
The EOL situation is handled by promoting the EOL or MultiLineComment with an EOL present from off channel to on channel
and thus making it parseable instead of handling it as white space. This promoting is done in the action promoteEOL.
// Mark current position so we can unconsume a RBRACE.
int marker = input.mark();
// Promote EOL if appropriate
| RBRACE { input.rewind(marker); }
| EOL | MultiLineComment // (with EOL in it)
So, the EVIL semicolon insertion strikes again!
I'm not really sure, but I think these mark/rewind calls are getting out of sync. The #init block is executed when the rule is entered for branch selection and for actual matching. It's actually creating a lot of marks but not cleaning them up. But I don't know why it messes up the parse tree like that.
Anyway, here's a working version of the same rule:
// Promote EOL if appropriate
| { int pos = input.index(); } RBRACE {; }
| EOL | MultiLineComment // (with EOL in it)
It's much simpler and doesn't use the mark/rewind mechanism.
But there's a catch: the semic rule in the parse tree will have a child node } in the case of a semicolon insertion before a closing brace. Try to remove the semicolon after i-- and see the result. You'll have to detect this and handle it in your code. semic should either contain a ; token, or contain EOL (which means a semicolon got silently inserted at this point).

Parsing of optionals with PEG (Grako) falling short?

My colleague PaulS asked me the following:
I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:
[[data_type] identifier ':' ] 'coverpoint' identifier ';'
'int' | 'float' | identifier
The problem is that when parsing the following legal string:
anIdentifier: coverpoint another_identifier;
anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.
I can re-write the rule as follows,
[data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';'
but I wonder if:
this is intentional and
if there's a better syntax?
Is this a PEG-in-general issue, or a tool (Grako) one?
It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.
In your first example [data_type] succeeds parsing id, so it fails when it finds : instead of another identifier.
That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.
In [data_type identifier ':' | identifier ':' ] the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.

Antlr mismatched '>' for include macro

I started to work with antlr a few days ago. I'd like to use it to parse #include macros in c. Only includes are to my interest, all other parts are irrelevant. here i wrote a simple grammar file:
... parser part omitted...
INCLUDE : '#include';
FILE_NAME: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|' ')+;
MACROS: '#'('if' | 'ifdef' | 'define' | 'endif' | 'undef' | 'elif' | 'else' );
//MACROS: '#'('a'..'z'|'A'..'Z')+;
OPERATORS: ('+'|'-'|'*'|'/'|'='|'=='|'!='|'>'|'>='|'<'|'<='|'>>'|'<<'|'<<<'|'|'|'&'|','|';'|'.'|'->'|'#');
... other supporting tokens like ID, WS and COMMENT ...
This grammar produces ambiguity when such statement are encountered:
output: mismatched character ';' expecting '>'
Seems it's trying to match INCLUDE_FILE_ANGLE instead of treating the ";" as OPERATORS.
I heard there's an operator called syntactic predicate, but im not sure how to properly use it in this case.
How can i solve this problem in an Antlr encouraged way?
Looks like there's not lots of activity about antlr here.
Anyway i figured this out.
INCLUDE_MACRO: ('#include')=>'#include';
VERSION_MACRO: ('#version')=>'#version';
This only solves first half of the problem. Secondly, one cannot use the INCLUDE_FILE_ANGLE to match the desired string in the #include directive.
The '<'FILE_NAME'>' stuffs creates ambiguity and must be broken down to basic tokens from lexer or use more advanced context-aware checks. Im not familiar with the later technique, So i wrote this in the parser rule:
include_statement :
INCLUDE_MACRO include_file
-> ^(INCLUDE_MACRO include_file);
Though this works , but it admittedly looks ugly.
I hope experienced users can comment with much better solution.

bison error recovery

I have found out that I can use 'error' in the grammar rule as a mechanism for error recovery. So if there was an error, the parser must discard the current line and resume parsing from the next line. An example from bison manual to achieve this could be something like this:
|stmts exp
| error '\n'
But I cannot use that; because I had to make flex ignores '\n' in my scannar, so that an expression is not restricted to be expressed in one line. How can I make the parser -when encountering an error- continue parsing to the following line, given that there is no special character (i.e. semicolon) to indicate an end of expression and there is no 'newline' token?
Since you've eliminated the marker used by the example, you're going to have to pull a stunt to get the equivalent effect.
I think you can use this:
| stmts exp
| error { eat_to_newline(); }
Where eat_to_newline() is a function in the scanner (source file) that arranges to discard any saved tokens and read up to the next newline.
extern void eat_to_newline(void);
void eat_to_newline(void)
int c;
while ((c = getchar()) != EOF && c != '\n')
It probably needs to be a little more complex than that, but not a lot more complex than that. You might need to use yyerrok; (and, as the comment reminds me, yyclearin; too) after calling eat_to_newline().
