Can I get bison to make yytname externally visible? - parsing

Bison generates at table of tag names when processing my grammar, something like
static const char *const yytname[] =
{
"$end", "error", "$undefined", "TAG", "SCORE",
...
}
The static keyword keeps yytname from being visible to other parts of the code.
This would normally be harmless, but I want to format my own syntax error messages instead of relying on the ones provided to my yyerror function.
My makefile includes the following rule:
chess1.tab.c: chess.tab.c
sed '/^static const.*yytname/s/static//' $? > $#
This works, but it's not what I'd call elegant.
Is there a better way to get at the table of tag names?

You can export the table using a function which you add to your parser file:
%token-table
%code provides {
const char* const* get_yytname(void);
}
...
%%
...
%%
const char* const* get_yytname(void) { return yytname; }
You probably also want to re-export some of the associated constants.
Alternatively, you could write a function which takes a token number and returns the token name. That does a better job of encapsulation; the existence of the string table and its precise type are implementation details.

Related

Does the using declaration allow for incomplete types in all cases?

I'm a bit confused about the implications of the using declaration. The keyword implies that a new type is merely declared. This would allow for incomplete types. However, in some cases it is also a definition, no? Compare the following code:
#include <variant>
#include <iostream>
struct box;
using val = std::variant<std::monostate, box, int, char>;
struct box
{
int a;
long b;
double c;
box(std::initializer_list<val>) {
}
};
int main()
{
std::cout << sizeof(val) << std::endl;
}
In this case I'm defining val to be some instantiation of variant. Is this undefined behaviour? If the using-declaration is in fact a declaration and not a definition, incomplete types such as box would be allowed to instantiate the variant type. However, if it is also a definition, it would be UB no?
For the record, both gcc and clang both create "32" as output.
Since you've not included language-lawyer, I'm attempting a non-lawyer answer.
Why should that be UB?
With a using delcaration, you're just providing a synonym for std::variant<whatever>. That doesn't require an instantiation of the object, nor of the class std::variant, pretty much like a function declaration with a parameter of that class doesn't require it:
void f(val); // just fine
The problem would occur as soon as you give to that function a definition (if val is still incomplete because box is still incomplete):
void f(val) {}
But it's enough just to change val to val& for allowing a definition,
void f(val&) {}
because the compiler doesn't need to know anything else of val than its name.
Furthermore, and here I'm really inventing, "incomplete type" means that some definition is lacking at the point it's needed, so I expect you should discover such an issue at compile/link time, and not by being hit by UB. As in, how can the compiler and linker even finish their job succesfully if a definition to do something wasn't found?

How to declare const char* in .cpp file

There are many questions about declaring const string in .h files, this is not my case.
I need string (for serialization purposes if it is important) to use in
My current solution is
// file.cpp
static constexpr const char* const str = "some string key";
void MyClass::serialize()
{
// using str
}
void MyClass::deserialize()
{
// using str
}
Does it have any problems? (i.e. memory leaks, redefinitions, UB, side effects)?
P.S. is using #define KEY "key" could be better here (speed/memory/consistency)?
Since you mentioned C++17, the best way to do this is with:
constexpr std::string_view str = "some string key";
str will be substituted by the compiler to the places where it is used at compile time.
Memory-wise you got rid of storing the str in run-time since it is only available at compile time.
Speed-wise this is also marginally better because less indirections to get the data in runtime.
Consistency-wise it is also even better since constexpr is solely used for expressions that are immutable and available at compile time. Also string_view is solely used for immutable strings so you are using the exact data type needed for you.
constexpr implies the latter const, which in turn implies the static (for a namespace-scope variable). Aside from that redundancy, this is fine.

Use yylex() to get the list of token types from an input string

I have a CLI that was made using Bison and Flex which has grown large and complicated, and I'm trying to get the complete sequence of tokens (yytokentype or the corresponding yytranslate Bison symbol numbers) for a given input string to the parser.
Ideally, every time yyerror() is called I want to store the sequence of tokens that were identified during parse. I don't need to know the yylval's, states, actions, etc, just the token list resulting from the string input to the buffer.
If a straightforward way of doing this doesn't exist, then just a stand-alone way of going from string --> yytokentypes will work.
The below code just has debugging printouts, which I'll change to storing it in the place I want as soon as I figure out how to get the tokens.
// When an error condition is reached, yylex() to get the yytokentypes
void yyerror(const char *s)
{
std::cerr<<"LEX\n";
int tok; // yytokentype
do
{
tok = yylex();
std::cerr<<tok<<",";
}while(tok);
std::cerr<<"LEX\n";
}
A simpler solution is to just change the name of the lexer using the YY_DECL macro and then add a definition of yylex at the end:
%{
// ...
#include "parser.tab.h"
#define YY_DECL static int wrapped_lexer(void)
%}
%%
/* rules */
%%
int yylex(void) {
int token = wrapped_lexer();
/* do something with the token */
return token;
}
Having said that, unless the source code is read-once for some reason, it's probably faster on the whole to rescan the input only if an error is encountered rather than saving the token list in case an error is an encountered. Lexing is really pretty fast, and in many use cases, syntactically correct inputs are more common than erroneous ones.
OK I figured a way to do this without having to re-tokenize the input string. Flex allows you to define YY_DECL, which by default is found in the generated lexer file to produce the yylex() declaration:
#ifndef YY_DECL
//some other stuff
#define YY_DECL int yylex (void)
#endif /* !YY_DECL */
And this goes in place
/** The main scanner function which does all the work.
*/
YY_DECL
{
// Body of yylex() which returns the yytokentype
}
A tricky thing that I'm able to do is re-define yylex() via YY_DECL to capture every token before it gets returned to the caller. This allows me to store the yytokentype for every call without changing the parser's behavior one bit. Below I'm just printing it out here for testing:
#define YY_DECL \
int yylex2(void); \
int yylex (void) \
{ \
int ret; \
ret = yylex2(); \
std::cerr<<"yylex2 returns: "<<ret<<"\n"; \
return ret; \
} \

Parse data structures clang/LLVM

I was wondering what is the best solution in order to parse and obtain data structures from C sources files. Suppose that I have:
typedef int M_Int;
typedef float* P_Float;
typedef struct Foo {
M_Int a;
P_Float p_f;
} Foo;
What is the best way to unfold the data structures in order to get the primitives of both variables a and p_f of struct Foo?
Parsing the AST, for very simple examples, could be the best way, but when the code becomes more complex, maybe it's better to work in a more low-level way with IR code?
You can use llvm debug info to grab the information you need. If you compile the C code with -g option, it generates debug info which contains all the information. Understanding llvm debuginfo is tricky mostly because there is not much documentation about their structure and how to access them. Here are some links:
1) http://llvm.org/docs/SourceLevelDebugging.html
2) Here is a link to a project that I am working on which uses debug info. This might not be too useful as there is not much documentation but it might be useful to see the usage of the debuginfo classes. We are trying to get field names for all pointer parameters (including field names in case of structure parameter) of a C function. All of the code related to debuginfo access is in this file: https://github.com/jiten-thakkar/DataStructureAnalysis/blob/dsa_llvm3.8/lib/dsaGenerator/DSAGenerator.cpp
To find the underlying types, the AST is a good level to work at. Clang can automate and scale this process with AST Matchers and Callbacks, used in conjunction with libtooling. For example, the AST matcher combination
fieldDecl( hasType( tyoedefType().bind("typedef") ) ).bind("field")
will match fields in C structs that are declared with a typedef instead of a built-in type. The bind() calls make AST nodes accessible to a Callback. Here's a Callback whose run() method gets the underlying type of the field declaration:
virtual void run(clang::ast_matchers::MatchFinder::MatchResult const & result) override
{
using namespace clang;
FieldDecl * f_decl =
const_cast<FieldDecl *>(result.Nodes.getNodeAs<FieldDecl>("field"));
TypedefType * tt = const_cast<TypedefType *>(
result.Nodes.getNodeAs<TypedefType>("typedef"));
if(f_decl && tt) {
QualType ut = tt->getDecl()->getUnderlyingType();
TypedefNameDecl * tnd = tt->getDecl();
std::string struct_name = f_decl->getParent()->getNameAsString();
std::string fld_name = f_decl->getNameAsString();
std::string ut_name = ut.getAsString();
std::string tnd_name = tnd->getNameAsString();
std::cout << "Struct '" << struct_name << "' declares field '"
<< fld_name << " with typedef name = '" << tnd_name << "'"
<< ", underlying type = '" << ut_name << "'" << std::endl;
}
else {
// error handling
}
return;
} // run
Once this is put into a Clang Tool and built, running
typedef-report Foo.h -- # Note two dashes
produces
Struct 'Foo' declares field 'a' with typedef name = 'M_Int', underlying type = 'int'
Struct 'Foo' declares field 'p_f' with typedef name = 'P_Float', underlying type = 'float *'
I put up a full working example app in a Code Analysis and Refactoring Examples with Clang Tools project (see apps/TypedefFinder.cc).

Aliasing frequently used patterns in Lex

I have one regexp, which is used in a several rules. Can I define alias for it, to keep this regexp definition in one place and just use it across the code?
Example:
[A-Za-z0-9].[A-Za-z0-9_-]* (expression) NAME (alias)
...
%%
NAME[=]NAME {
//Do something.
}
%%
It goes in the definitions section of your lex input file (before the %%) and you use it in a regular expression by putting the name inside curly braces ({…}). For example:
name [A-Za-z0-9][A-Za-z0-9_-]*
%%
{name}[=]{name} { /* Do something */ }

Resources