bison syntax for token declaration - token

I have to read a bison grammar file and do not understand the following declaration:
The grammer has a union declaration
%union {
int i;
char *s;
}
The token declaration looks like this:
%token
TOK0 TOK1 TOK2
TOK3 TOK4 TOK5
TOK6
TOK7
%token <s> TOK8
%token <i> TOK9
My expectation is that because of the union declaration a type must be provided for every token declaration. However TOK0 to TOK7 do not have a type provided. Also I was wondering about the tabular layout of the declaration for TOK0 to TOK7. Is any special meaning given to this layout? I was only finding this source of information about the token declaration ( https://www.gnu.org/software/bison/manual/html_node/Token-Decl.html#Token-Decl ) and it seems it does not cover my usecase.

The tabular layout has no meaning AFAIK. You don't need to assign types to tokens if you don't need their type. Almost always tokens like open_bracket, close_bracket or other things whose value you never need are left type less. You can specify open_bracket as <s> but it is not required and for readability I won't do that.

Related

ANTLR4: implicit or explicit token definition

What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.
E.g.:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name ';' ;
library_name: IDENTIFIER;
module: MODULE module_name ';' ;
module_name: IDENTIFIER;
MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;
The generated tokens are:
T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3
If I am not interested in the 'library' "token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?
Actually, there is a difference between implicit and explicit tokens:
From "The Definitive ANTLR4 Reference", page 76:
ANTLR collects and separates all of the string literals and lexer
rules from the parser rules. Literals such as 'enum' become lexical
rules and go immediately after the parser rules but before the
explicit lexical rules.
ANTLR lexers resolve ambiguities between
lexical rules by favoring the rule specified first.
Highlight from me.
The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.
There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.
ExplicitRegExp: [A-Z][a-z]+; // lexer rule starts with uppercase letter
ExplicitFixed: 'fixed';
parserRule: 'implicit' ExplicitRegExp; // parser rules starts with lowercase letter
When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say ExplicitRegExp becomes 1 and ExplicitFixed becomes 2. But the parser will also need the implicit tokens to be able to parse the grammar correctly, so the implicit token is assigned the code 3 implicitly.
How is that bad? You may have typos in different parts of the grammar:
a : 'implicit' c;
b : 'implcit' d; // typo here
And your grammar will not work as expected, because implcit will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, like T___0. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).
The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.
To let Antlr be happy, do it the verbose way and declare all of your tokens:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name=IDENTIFIER ';' ; // I'm using aliasing instead of different parser rule here, just a preference
module: 'module' module_name=IDENTIFIER ';' ;
MODULE: 'module' ;
LIBRARY: 'library' ;
IDENTIFIER: [a-zA-Z0-9]+;
Then it makes no difference if you reference a fixed token by it's explicit name (like MODULE) or by its content (like 'module').

Jison Lexer - Detect Certain Keyword as an Identifier at Certain Times

"end" { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
: IDENT ArgumentList
{{ $$ = ['CallExpr', $1, $2]; }}
| IDENT
{{ $$ = ['CallExprNoArgs', $1]; }}
;
CallArray
: CallElement
{{ $$ = ['CallArray', $1]; }}
;
CallElement
: CallElement "." Call
{{ $$ = ['CallElement', $1, $3]; }}
| Call
;
Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!
edit: It's a C-like programming language.
There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.
Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.
That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.
We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:
%x selector
%%
/* White space and comment rules need to explicitly include
* the selector condition
*/
<INITIAL,selector>\s+ ;
/* Other rules, including keywords, are unmodified */
"end" return "END";
/* The dot rule triggers a new start condition */
"." this.begin("selector"); return ".";
/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w* yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
* the start condition is changed back. Anything else is an error.
*/
<selector>[A-Za-z_]\w* yylval = yytext; this.popState(); return "ID";
<selector>. parse_error("Expecting identifier");
Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).
When lexing, have the lexer ask the parser what set of tokens it expects next.
Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".
This may or may not be convenient to get your parser to do. But you need something like this.
We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.
You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.

how typedef-name - identifier issue is resolved in C?

I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).
I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):
typedef int AA;
void foo() {
AA aa; /* OK - define variable aa of type AA */
float AA; /* OK - define variable AA of type float */
}
we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.
I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:
declarator:
after_type_declarator
| notype_declarator
;
after_type_declarator:
...
| TYPENAME
;
notype_declarator:
...
| IDENTIFIER
;
fndef:
declspecs_ts setspecs declarator
// some action code
// the rest of production
...
setspecs: /* empty */
// some action code
declspecs_ts means declaration_specifiers where
"Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."
From declspecs_ts we can reach
typespec_nonreserved_nonattr:
TYPENAME
...
;
At the first glance I can't believe how shift/reduce conflicts does not appear!
setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.
Can anyone explain this briefly (or even precisely). Thanks in advance!
EDIT:
Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens
I can't speak for the specific code.
But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.
How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.
You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.
This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.

Bison grammar warnings

I am writing a parser with Bison and I am getting the following warnings.
fol.y:42 parser name defined to default :"parse"
fol.y:61: warning: type clash ('' 'pred') on default action
I have been using Google to search for a way to get rid of them, but have pretty much come up empty handed on what they mean (much less how to fix them) since every post I found with them has a compilation error and the warnings them selves aren't addressed. Could someone tell me what they mean and how to fix them? The relevant code is below. Line 61 is the last semicolon. I cut out the rest of the grammar since it is incredibly verbose.
%union {
char* var;
char* name;
char* pred;
}
%token <var> VARIABLE
%token <name> NAME
%token <pred> PRED
%%
fol:
declines clauses {cout << "Done parsing with file" << endl;}
;
declines:
declines decline
|decline
;
decline:
PRED decs
;
The first message is likely just a warning that you didn't include %start parse in the grammar specification.
The second means that somewhere you have rule that is supposed to return a value but you haven't properly specified which type of value it is to return. The PRED returns the pred element of your union; the problem might be that you've not created %type entries for decline and declines. If you have a union, you have to specify the type for most, if not all, rules — or maybe just rules that don't have an explicit action (so as to override the default $$ = $1; action).
I'm not convinced that the problem is in the line you specify, and because we don't have a complete, minimal reproduction of your problem, we can't investigate for you to validate it. The specification for decs may be relevant (I'm not convinced it is, but it might be).
You may get more information from the output of bison -v, which is the y.output file (or something similar).
Finally found it.
To fix this:
fol.y:42 parser name defined to default :"parse"
Add %name parse before %token
Eg:
%name parse
%token NUM
(From: https://bdhacker.wordpress.com/2012/05/05/flex-bison-in-ubuntu/#comment-2669)

Access two or more tokens in one line using Bison

I am using bison to implement a simple parser. And one line of the syntax looks like:
prefix_definition : PREFIX IDENTIFIER IDENTIFIER ABBR IDENTIFIER ';'
I am unsure how to access the 1st, 2nd and 3rd IDENTIFIER separately. My flex file reads the IDENTIFIER like this:
IDENTIFIER_REGEX (_|[A_Za-z])(_|[0-9A-Za-z])*
{IDENTIFIER_REGEX} { yylval.identifier=strdup(yytext); return IDENTIFIER; }
I could not use simply yylval.identifier. I tried $2.identifier or so but it simply does not work(and it is not supposed to be work anyway). Is there any way of solving this problem?
I am considering to use a FIFO queue if the bison/flex does not support such access. Is this a good solution?
You can specify the type of a token while declaring it (in the bison file) the same way you would for nonterminals (where you'd use %type) like so:
%token <identifier> IDENTIFIER
(where identifier is one of the fields declared in the %union). Then $2, $3 and so on will point to the right thing, without needing to go through yylval (i.e. they will be char *s in your case).

Resources