Why call yylex() only once in main() - flex-lexer

When I write a yylex() for a yacc parser, the yylex() usually return symbol at a time, that is, the yylex() must be called muti-times until the file to an end.
But when I write a main function for a lex scanner, I just call the yylex() once, but the whole file still fully scanned.
void main(int argc, char* argv[]) {
printf("start\n");
yyin = fopen(argv[1], "r");
yylex();
printf("word count: %d\n", wordCount);
fclose(yyin);
}
Why?

Sorry for asking a silly question, I have read the c file generated by lex, and find that the action code is pasted in a switch case segment, so, as #rici said, it is very much depend on what you write in the action, since my code in action does not return, so one call for yylex will go through the stream. When there's a return, I should use a while() to call yylex.

Related

How to detect if global variable is a string in LLVM?

In earlier releases of llvm/clang I was able to detect whether global variable was a string by using ie. the GlobalVar->getName() function and checking whether it ends with ".str". I've tried this in the llvm/clang 13 and 14 and it seems that all the names I'm getting are mangled names. Am I missing something?
For example, I have this basic C source code:
//compiled with: clang.exe -std=c99 helloCC.c -o helloCC.exe -mllvm -my_get_strings=1 -flegacy-pass-manager
#include <stdio.h>
char *xmy1 = "hello world";
int main(int argc, char *argv[]) {
printf("%s", xmy1);
return 0;
}
I've manually edited the llvm/clang code too trigger my function as one of the pass (clang executed with "-flegacy-pass-manager" and I've added my pass to PassManagerBuilder.cpp int the void PassManagerBuilder::populateModulePassManager(legacy::PassManagerBase &MPM) function.
Anyway my runOnModule handler executes, iterates over global variables (M.global_being() to M.global_end()) and all the names got by GlobalVar->getName() seems to be mangled:
found global = "??_C#_0M#LACCCNMM#hello?5world?$AA#"
Obviously now my previous theory to detect whether this is a string or not doesn't work. Is there any other better function to detect whether a global is a string / or I am doing something wrong?
I've tried demangling the name, well I can demangle it but I still don't know how to verify whether this is a string or nor. Is there any LLVM function for it?
Well, the main question here is what do you mean by "global variable is string". If you're meaning C-style strings, then you'd just take initializer (which is Constant) and check if this is a C-style string using isCString method (https://llvm.org/doxygen/classllvm_1_1ConstantDataSequential.html#aecff3ad6cfa0e4abfd4fc9484d973e7d)

LLVM, Get first usage of a global variable

I'm new to LLVM and I'm stuck on something that might seem basic.
I'm writing a LLVM pass to apply some transformations to global variables before they are use.
I would like to detect somehow when is the first usage of a global variable to only apply the transformation there, and not in all places where the global variable is used. But it must be the first time it is used otherwise the program crashes.
I have been reading about the AnalysisManager, and I would say that I want something similar to DominatorTree which is used for basic blocks in a function.
So the idea is to get the DominatorTree of a GlobalVariable to get the first time it is used in the code and apply there my transformation.
Given the following example
int MyGlobal = 30;
void foo()
{
printf("%s\n", MyGlobal);
}
int main()
{
printf("%s\n", MyGlobal);
foo();
}
In the example above, I only want to apply the transformation just before the first printf in the main function
Given the following example
int MyGlobal = 30;
void foo()
{
printf("%s\n", MyGlobal);
}
int main()
{
foo();
printf("%s\n", MyGlobal);
}
For the example above I would like to apply the transformation inside the foo function.
I want to avoid to create a stub function at the beginning of the program to process all globals before start running (This is what actually Im doing)
Does LLVM provide something that can help me doing this? or what should be the best approach to implement it?

Use yylex() to get the list of token types from an input string

I have a CLI that was made using Bison and Flex which has grown large and complicated, and I'm trying to get the complete sequence of tokens (yytokentype or the corresponding yytranslate Bison symbol numbers) for a given input string to the parser.
Ideally, every time yyerror() is called I want to store the sequence of tokens that were identified during parse. I don't need to know the yylval's, states, actions, etc, just the token list resulting from the string input to the buffer.
If a straightforward way of doing this doesn't exist, then just a stand-alone way of going from string --> yytokentypes will work.
The below code just has debugging printouts, which I'll change to storing it in the place I want as soon as I figure out how to get the tokens.
// When an error condition is reached, yylex() to get the yytokentypes
void yyerror(const char *s)
{
std::cerr<<"LEX\n";
int tok; // yytokentype
do
{
tok = yylex();
std::cerr<<tok<<",";
}while(tok);
std::cerr<<"LEX\n";
}
A simpler solution is to just change the name of the lexer using the YY_DECL macro and then add a definition of yylex at the end:
%{
// ...
#include "parser.tab.h"
#define YY_DECL static int wrapped_lexer(void)
%}
%%
/* rules */
%%
int yylex(void) {
int token = wrapped_lexer();
/* do something with the token */
return token;
}
Having said that, unless the source code is read-once for some reason, it's probably faster on the whole to rescan the input only if an error is encountered rather than saving the token list in case an error is an encountered. Lexing is really pretty fast, and in many use cases, syntactically correct inputs are more common than erroneous ones.
OK I figured a way to do this without having to re-tokenize the input string. Flex allows you to define YY_DECL, which by default is found in the generated lexer file to produce the yylex() declaration:
#ifndef YY_DECL
//some other stuff
#define YY_DECL int yylex (void)
#endif /* !YY_DECL */
And this goes in place
/** The main scanner function which does all the work.
*/
YY_DECL
{
// Body of yylex() which returns the yytokentype
}
A tricky thing that I'm able to do is re-define yylex() via YY_DECL to capture every token before it gets returned to the caller. This allows me to store the yytokentype for every call without changing the parser's behavior one bit. Below I'm just printing it out here for testing:
#define YY_DECL \
int yylex2(void); \
int yylex (void) \
{ \
int ret; \
ret = yylex2(); \
std::cerr<<"yylex2 returns: "<<ret<<"\n"; \
return ret; \
} \

flex 2.5.35 gives error when ctrl-M used in lex file

I have a simple lex file.
%{
#include <stdio.h>
%}
space_char [ \t\^M]
space {space_char}+
%%
%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
When I compile this file with flex-2.5.35, it gives following errors:
lex.l:5: bad character:
lex.l:5: name defined twice
But, with flex-2.5.4, it runs fine.
I understand this error is due to special character ctrl-m (carriage-return). I want to know if flex-2.5.35 doesn't support special characters like ctrl-l, ctrl-m? And if so, then what's the alternate way? Please note, I am restricted with the use of 2.5.35 only.
Thanks.
As in C, you can use \r for the carriage return character.

Deferred execution strategy in C++

I have a callback implementation in which an unknown third party calls a function pointer in my code.
However, an issue in a lot languages is triggering code after a function returns. For instance, when a callback is called and I have to delete the calling object (and, in this case, re-initialize it), returning from the callback would cause an exception.
Assuming I cannot hook and that I do not own/cannot modify the code calling the callback, what is the best way to execute code after a function returns?
The only real way I can think of doing this is to set up some sort of state machine and have a worker thread check the state. However, the issue I foresee with this is that of a race condition, where a callback is called between the time the reset callback returns and the point the calling object is reset.
Is there any sort of functionality I'm not aware of, or would this be the most efficient way of achieving such a result?
It requires c++11 or newer though. But this is how i would do it.
You can rewrite it to use function pointers instead so that it works on older c++ versions
#include <functional>
#include <iostream>
#define CONCACT_IMPL(x , y) x##y
#define CONCAT(x, y) CONCACT_IMPL(x, y)
#define deffered(x) auto CONCAT(__deffered, __COUNTER__) = Defer(x);
struct Defer {
Defer(std::function<void(void)> pFunc) : func(pFunc) {};
std::function<void(void)> func;
virtual ~Defer(){
func();
}
};
int main() {
deffered([] () {
std::cout << "deffered" << std::endl;
});
std::cout << "now" << std::endl;
}
outputs -->
now
deffered

Resources