Access two or more tokens in one line using Bison

Access two or more tokens in one line using Bison - parsing

I am using bison to implement a simple parser. And one line of the syntax looks like:
prefix_definition : PREFIX IDENTIFIER IDENTIFIER ABBR IDENTIFIER ';'
I am unsure how to access the 1st, 2nd and 3rd IDENTIFIER separately. My flex file reads the IDENTIFIER like this:
IDENTIFIER_REGEX (_|[A_Za-z])(_|[0-9A-Za-z])*
{IDENTIFIER_REGEX} { yylval.identifier=strdup(yytext); return IDENTIFIER; }
I could not use simply yylval.identifier. I tried $2.identifier or so but it simply does not work(and it is not supposed to be work anyway). Is there any way of solving this problem?
I am considering to use a FIFO queue if the bison/flex does not support such access. Is this a good solution?

You can specify the type of a token while declaring it (in the bison file) the same way you would for nonterminals (where you'd use %type) like so:
%token <identifier> IDENTIFIER
(where identifier is one of the fields declared in the %union). Then $2, $3 and so on will point to the right thing, without needing to go through yylval (i.e. they will be char *s in your case).

Related

bison syntax for token declaration

I have to read a bison grammar file and do not understand the following declaration:
The grammer has a union declaration
%union {
int i;
char *s;
}
The token declaration looks like this:
%token
TOK0 TOK1 TOK2
TOK3 TOK4 TOK5
TOK6
TOK7
%token <s> TOK8
%token <i> TOK9
My expectation is that because of the union declaration a type must be provided for every token declaration. However TOK0 to TOK7 do not have a type provided. Also I was wondering about the tabular layout of the declaration for TOK0 to TOK7. Is any special meaning given to this layout? I was only finding this source of information about the token declaration ( https://www.gnu.org/software/bison/manual/html_node/Token-Decl.html#Token-Decl ) and it seems it does not cover my usecase.

The tabular layout has no meaning AFAIK. You don't need to assign types to tokens if you don't need their type. Almost always tokens like open_bracket, close_bracket or other things whose value you never need are left type less. You can specify open_bracket as <s> but it is not required and for readability I won't do that.

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?

The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

how typedef-name - identifier issue is resolved in C?

I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).
I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):
typedef int AA;
void foo() {
AA aa; /* OK - define variable aa of type AA */
float AA; /* OK - define variable AA of type float */
}
we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.
I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:
declarator:
after_type_declarator
| notype_declarator
;
after_type_declarator:
...
| TYPENAME
;
notype_declarator:
...
| IDENTIFIER
;
fndef:
declspecs_ts setspecs declarator
// some action code
// the rest of production
...
setspecs: /* empty */
// some action code
declspecs_ts means declaration_specifiers where
"Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."
From declspecs_ts we can reach
typespec_nonreserved_nonattr:
TYPENAME
...
;
At the first glance I can't believe how shift/reduce conflicts does not appear!
setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.
Can anyone explain this briefly (or even precisely). Thanks in advance!
EDIT:
Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens

I can't speak for the specific code.
But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.
How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.
You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.
This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.

Bison grammar warnings

I am writing a parser with Bison and I am getting the following warnings.
fol.y:42 parser name defined to default :"parse"
fol.y:61: warning: type clash ('' 'pred') on default action
I have been using Google to search for a way to get rid of them, but have pretty much come up empty handed on what they mean (much less how to fix them) since every post I found with them has a compilation error and the warnings them selves aren't addressed. Could someone tell me what they mean and how to fix them? The relevant code is below. Line 61 is the last semicolon. I cut out the rest of the grammar since it is incredibly verbose.
%union {
char* var;
char* name;
char* pred;
}
%token <var> VARIABLE
%token <name> NAME
%token <pred> PRED
%%
fol:
declines clauses {cout << "Done parsing with file" << endl;}
;
declines:
declines decline
|decline
;
decline:
PRED decs
;

The first message is likely just a warning that you didn't include %start parse in the grammar specification.
The second means that somewhere you have rule that is supposed to return a value but you haven't properly specified which type of value it is to return. The PRED returns the pred element of your union; the problem might be that you've not created %type entries for decline and declines. If you have a union, you have to specify the type for most, if not all, rules — or maybe just rules that don't have an explicit action (so as to override the default $$ = $1; action).
I'm not convinced that the problem is in the line you specify, and because we don't have a complete, minimal reproduction of your problem, we can't investigate for you to validate it. The specification for decs may be relevant (I'm not convinced it is, but it might be).
You may get more information from the output of bison -v, which is the y.output file (or something similar).

Finally found it.
To fix this:
fol.y:42 parser name defined to default :"parse"
Add %name parse before %token
Eg:
%name parse
%token NUM
(From: https://bdhacker.wordpress.com/2012/05/05/flex-bison-in-ubuntu/#comment-2669)

Create a Print Function

I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.

You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Access two or more tokens in one line using Bison - parsing

Related

bison syntax for token declaration

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

how typedef-name - identifier issue is resolved in C?

Bison grammar warnings

Create a Print Function

Categories

Resources