Use yylex() to get the list of token types from an input string

Use yylex() to get the list of token types from an input string - parsing

I have a CLI that was made using Bison and Flex which has grown large and complicated, and I'm trying to get the complete sequence of tokens (yytokentype or the corresponding yytranslate Bison symbol numbers) for a given input string to the parser.
Ideally, every time yyerror() is called I want to store the sequence of tokens that were identified during parse. I don't need to know the yylval's, states, actions, etc, just the token list resulting from the string input to the buffer.
If a straightforward way of doing this doesn't exist, then just a stand-alone way of going from string --> yytokentypes will work.
The below code just has debugging printouts, which I'll change to storing it in the place I want as soon as I figure out how to get the tokens.
// When an error condition is reached, yylex() to get the yytokentypes
void yyerror(const char *s)
{
std::cerr<<"LEX\n";
int tok; // yytokentype
do
{
tok = yylex();
std::cerr<<tok<<",";
}while(tok);
std::cerr<<"LEX\n";
}

A simpler solution is to just change the name of the lexer using the YY_DECL macro and then add a definition of yylex at the end:
%{
// ...
#include "parser.tab.h"
#define YY_DECL static int wrapped_lexer(void)
%}
%%
/* rules */
%%
int yylex(void) {
int token = wrapped_lexer();
/* do something with the token */
return token;
}
Having said that, unless the source code is read-once for some reason, it's probably faster on the whole to rescan the input only if an error is encountered rather than saving the token list in case an error is an encountered. Lexing is really pretty fast, and in many use cases, syntactically correct inputs are more common than erroneous ones.

OK I figured a way to do this without having to re-tokenize the input string. Flex allows you to define YY_DECL, which by default is found in the generated lexer file to produce the yylex() declaration:
#ifndef YY_DECL
//some other stuff
#define YY_DECL int yylex (void)
#endif /* !YY_DECL */
And this goes in place
/** The main scanner function which does all the work.
*/
YY_DECL
{
// Body of yylex() which returns the yytokentype
}
A tricky thing that I'm able to do is re-define yylex() via YY_DECL to capture every token before it gets returned to the caller. This allows me to store the yytokentype for every call without changing the parser's behavior one bit. Below I'm just printing it out here for testing:
#define YY_DECL \
int yylex2(void); \
int yylex (void) \
{ \
int ret; \
ret = yylex2(); \
std::cerr<<"yylex2 returns: "<<ret<<"\n"; \
return ret; \
} \

Related

Does the using declaration allow for incomplete types in all cases?

I'm a bit confused about the implications of the using declaration. The keyword implies that a new type is merely declared. This would allow for incomplete types. However, in some cases it is also a definition, no? Compare the following code:
#include <variant>
#include <iostream>
struct box;
using val = std::variant<std::monostate, box, int, char>;
struct box
{
int a;
long b;
double c;
box(std::initializer_list<val>) {
}
};
int main()
{
std::cout << sizeof(val) << std::endl;
}
In this case I'm defining val to be some instantiation of variant. Is this undefined behaviour? If the using-declaration is in fact a declaration and not a definition, incomplete types such as box would be allowed to instantiate the variant type. However, if it is also a definition, it would be UB no?
For the record, both gcc and clang both create "32" as output.

Since you've not included language-lawyer, I'm attempting a non-lawyer answer.
Why should that be UB?
With a using delcaration, you're just providing a synonym for std::variant<whatever>. That doesn't require an instantiation of the object, nor of the class std::variant, pretty much like a function declaration with a parameter of that class doesn't require it:
void f(val); // just fine
The problem would occur as soon as you give to that function a definition (if val is still incomplete because box is still incomplete):
void f(val) {}
But it's enough just to change val to val& for allowing a definition,
void f(val&) {}
because the compiler doesn't need to know anything else of val than its name.
Furthermore, and here I'm really inventing, "incomplete type" means that some definition is lacking at the point it's needed, so I expect you should discover such an issue at compile/link time, and not by being hit by UB. As in, how can the compiler and linker even finish their job succesfully if a definition to do something wasn't found?

Can I get bison to make yytname externally visible?

Bison generates at table of tag names when processing my grammar, something like
static const char *const yytname[] =
{
"$end", "error", "$undefined", "TAG", "SCORE",
...
}
The static keyword keeps yytname from being visible to other parts of the code.
This would normally be harmless, but I want to format my own syntax error messages instead of relying on the ones provided to my yyerror function.
My makefile includes the following rule:
chess1.tab.c: chess.tab.c
sed '/^static const.*yytname/s/static//' $? > $#
This works, but it's not what I'd call elegant.
Is there a better way to get at the table of tag names?

You can export the table using a function which you add to your parser file:
%token-table
%code provides {
const char* const* get_yytname(void);
}
...
%%
...
%%
const char* const* get_yytname(void) { return yytname; }
You probably also want to re-export some of the associated constants.
Alternatively, you could write a function which takes a token number and returns the token name. That does a better job of encapsulation; the existence of the string table and its precise type are implementation details.

How to use myname2.lex in the flex examples?

I see the following example in flex examples. I am able to compile it. But I am not sure what input I should give to it. Could anybody let me know? Thanks.
/*
* myname2.lex : A sample Flex program
* that does token replacement.
*/
%{
#include <stdio.h>
%}
%x STRING
%%
\" ECHO; BEGIN(STRING);
<STRING>[^\"\n]* ECHO;
<STRING>\" ECHO; BEGIN(INITIAL);
%NAME { printf("%s",getenv("LOGNAME")); }
%HOST { printf("%s",getenv("HOST")); }
%HOSTTYPE { printf("%s",getenv("HOSTTYPE"));}
%HOME { printf("%s",getenv("HOME")); }

The rules %NAME, %HOST, %HOSTTYPE and %HOME match those exact strings respectively. So you could enter those and see their corresponding actions execute.
You could also enter one of them surrounded by quotes (e.g. "%HOST") and observe that its action will not be executed because the whole thing was seen as a string literal.

Why call yylex() only once in main()

When I write a yylex() for a yacc parser, the yylex() usually return symbol at a time, that is, the yylex() must be called muti-times until the file to an end.
But when I write a main function for a lex scanner, I just call the yylex() once, but the whole file still fully scanned.
void main(int argc, char* argv[]) {
printf("start\n");
yyin = fopen(argv[1], "r");
yylex();
printf("word count: %d\n", wordCount);
fclose(yyin);
}
Why?

Sorry for asking a silly question, I have read the c file generated by lex, and find that the action code is pasted in a switch case segment, so, as #rici said, it is very much depend on what you write in the action, since my code in action does not return, so one call for yylex will go through the stream. When there's a return, I should use a while() to call yylex.

flex 2.5.35 gives error when ctrl-M used in lex file

I have a simple lex file.
%{
#include <stdio.h>
%}
space_char [ \t\^M]
space {space_char}+
%%
%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
When I compile this file with flex-2.5.35, it gives following errors:
lex.l:5: bad character:
lex.l:5: name defined twice
But, with flex-2.5.4, it runs fine.
I understand this error is due to special character ctrl-m (carriage-return). I want to know if flex-2.5.35 doesn't support special characters like ctrl-l, ctrl-m? And if so, then what's the alternate way? Please note, I am restricted with the use of 2.5.35 only.
Thanks.

As in C, you can use \r for the carriage return character.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Use yylex() to get the list of token types from an input string - parsing

Related

Does the using declaration allow for incomplete types in all cases?

Can I get bison to make yytname externally visible?

How to use myname2.lex in the flex examples?

Why call yylex() only once in main()

flex 2.5.35 gives error when ctrl-M used in lex file

Categories

Resources