Is it possible LibTooling doesn't change headers? - clang

I have an LibTooling (TimeFlag), which is used to add an flag for every forstmt/whilestmt. And I use ./TimeFlag lalala.cpp -- to insert flags in lalala.cpp
Unfortunately, this tool also will change the headers, even system library.
So is there some ways letting LibTooling just handle the input file?

Here are two possibilities: if using a RecursiveASTVisitor, one could use the SourceManager to determine if the location of the statement or declaration is in the main expansion file:
clang::SourceManager &sm(astContext->getSourceManager());
bool const inMainFile(
sm.isInMainFile( sm.getExpansionLoc( stmt->getLocStart())));
if(inMainFile){
/* process decl or stmt */
}
else{
std::cout << "'" << stmt->getNameAsString() << "' is not in main file\n";
}
There are several similar methods in SourceManager, such as isInSystemHeader to assist with this task.
If you are using AST matchers, you can use isExpansionInMainFile to narrow which nodes it matches:
auto matcher = forStmt( isExpansionInMainFile());
There is a similar matcher, isExpansionInSystemHeader.

Related

Build a Path of LLVM basic block

I have to create a LLVM analysis pass for an exam project which consist of printing the independent path of a function using the baseline method.
Currently, I am struggling on how can I build the baseline path traversing the various basic block. Furthermore, I know that basic block are already organized in a CFG but checking the documentation I can't find any useful method to build a linked list of basic block representing a path from the entry point to the end point of a function. I am not an expert with the LLVM environment and I want to ask if someone with more knowledge knows how to build this kind of path.
Thank you everyone.
Update: i followed the advice of the answer to this post and i made this code for building a path:
#include "llvm/Support/raw_ostream.h"
#include "llvm/IR/CFG.h"
#include <set>
#include <list>
using namespace llvm;
using namespace std;
void Build_Baseline_path(BasicBlock *Start, set<BasicBlock *> Explored, list<BasicBlock *> Decision_points, list<BasicBlock *>Path) {
for (BasicBlock *Successor : successors(Start)) {
Instruction *Teriminator = Successor->getTerminator();
const char *Instruction_string = Teriminator->getOpcodeName();
if (Instruction_string == "br" || Instruction_string == "switch") {
errs() << "Decision point found" << "\n";
Decision_points.push_back(Successor);
}
if (Instruction_string == "ret") {
if (Explored.find(Successor) == Explored.end()) {
errs() << "Added node to the baseline path" << "\n";
Path.push_back(Successor);
return;
}
return;
}
if (Explored.find(Successor) == Explored.end()) {
Path.push_back(Successor);
Build_Baseline_path(Successor,Explored,Decision_points,Path);
}
}
}
This is a code that wrote in another file .cpp and i include it in my Function Pass, but when i run the pass with this function, everything is blocked and seems like that my pc is crashing when i run this pass. I tried to comment the call of this function in the pass to see if the problem is somewhere else, but everything works fine so the problem is in this code, what is wrong in this code? I am sorry but i am a novice with c++, i can't figure out how to solve this.
First off, there isn't a single end point. At least four kinds of instructions may be end points: return, unreachable and in some cases call/invoke (when the called function throws and the exception isn't caught in this function).
Accordingly, there are many possible paths. The number of possible paths is not even sure to be countable, depending on how you treat loops.
If you regard loops in a simplistic way and ignore exceptions, then it's simple to construct a list of paths. There exists an iterator called successors() which you can use as in this answer. You can use successors() in a recursive function to process successors, and when you reach a return or something like that, you act on the path you've built.

Parse data structures clang/LLVM

I was wondering what is the best solution in order to parse and obtain data structures from C sources files. Suppose that I have:
typedef int M_Int;
typedef float* P_Float;
typedef struct Foo {
M_Int a;
P_Float p_f;
} Foo;
What is the best way to unfold the data structures in order to get the primitives of both variables a and p_f of struct Foo?
Parsing the AST, for very simple examples, could be the best way, but when the code becomes more complex, maybe it's better to work in a more low-level way with IR code?
You can use llvm debug info to grab the information you need. If you compile the C code with -g option, it generates debug info which contains all the information. Understanding llvm debuginfo is tricky mostly because there is not much documentation about their structure and how to access them. Here are some links:
1) http://llvm.org/docs/SourceLevelDebugging.html
2) Here is a link to a project that I am working on which uses debug info. This might not be too useful as there is not much documentation but it might be useful to see the usage of the debuginfo classes. We are trying to get field names for all pointer parameters (including field names in case of structure parameter) of a C function. All of the code related to debuginfo access is in this file: https://github.com/jiten-thakkar/DataStructureAnalysis/blob/dsa_llvm3.8/lib/dsaGenerator/DSAGenerator.cpp
To find the underlying types, the AST is a good level to work at. Clang can automate and scale this process with AST Matchers and Callbacks, used in conjunction with libtooling. For example, the AST matcher combination
fieldDecl( hasType( tyoedefType().bind("typedef") ) ).bind("field")
will match fields in C structs that are declared with a typedef instead of a built-in type. The bind() calls make AST nodes accessible to a Callback. Here's a Callback whose run() method gets the underlying type of the field declaration:
virtual void run(clang::ast_matchers::MatchFinder::MatchResult const & result) override
{
using namespace clang;
FieldDecl * f_decl =
const_cast<FieldDecl *>(result.Nodes.getNodeAs<FieldDecl>("field"));
TypedefType * tt = const_cast<TypedefType *>(
result.Nodes.getNodeAs<TypedefType>("typedef"));
if(f_decl && tt) {
QualType ut = tt->getDecl()->getUnderlyingType();
TypedefNameDecl * tnd = tt->getDecl();
std::string struct_name = f_decl->getParent()->getNameAsString();
std::string fld_name = f_decl->getNameAsString();
std::string ut_name = ut.getAsString();
std::string tnd_name = tnd->getNameAsString();
std::cout << "Struct '" << struct_name << "' declares field '"
<< fld_name << " with typedef name = '" << tnd_name << "'"
<< ", underlying type = '" << ut_name << "'" << std::endl;
}
else {
// error handling
}
return;
} // run
Once this is put into a Clang Tool and built, running
typedef-report Foo.h -- # Note two dashes
produces
Struct 'Foo' declares field 'a' with typedef name = 'M_Int', underlying type = 'int'
Struct 'Foo' declares field 'p_f' with typedef name = 'P_Float', underlying type = 'float *'
I put up a full working example app in a Code Analysis and Refactoring Examples with Clang Tools project (see apps/TypedefFinder.cc).

Aliasing frequently used patterns in Lex

I have one regexp, which is used in a several rules. Can I define alias for it, to keep this regexp definition in one place and just use it across the code?
Example:
[A-Za-z0-9].[A-Za-z0-9_-]* (expression) NAME (alias)
...
%%
NAME[=]NAME {
//Do something.
}
%%
It goes in the definitions section of your lex input file (before the %%) and you use it in a regular expression by putting the name inside curly braces ({…}). For example:
name [A-Za-z0-9][A-Za-z0-9_-]*
%%
{name}[=]{name} { /* Do something */ }

Make a table containing tokens visible for both .mly and .mll by menhir

I would like to define a keyword_table which maps some strings to some tokens, and I would like to make this table visible for both parser.mly and lexer.mll.
It seems that the table has to be defined in parser.mly,
%{
open Utility (* where hash_table is defined to make a table from a list *)
let keyword_table = hash_table [
"Call", CALL; "Case", CASE; "Close", CLOSE; "Const", CONST;
"Declare", DECLARE; "DefBool", DEFBOOL; "DefByte", DEFBYTE ]
%}
However, I could NOT use it in lexer.mll, for instance
{
open Parser
let x = keyword_table (* doesn't work *)
let x = Parser.keyword_table (* doesn't work *)
let x = Parsing.keyword_table (* doesn't work *)
}
As this comment suggests, menhir has a solution for this, could anyone tell me any details?
The first option is to define tokens in a separate .mly file. Executing menhir for this file with --only-tokens option will generate a module containing type token that you can use in your parser compiled with --external-tokens option.
If this solves the problem with tokens, you can specify all other functions that are used by both parser and lexer in a separate file as Thomash suggested.
There is an alternative solution as well. You can use %parameter<module signature> declaration in the parser to parametrize the entire parser over type and function annotations specified inside given signature. The main advantage is that this signature is provided in the interface file for the parser, so the parser can share this signature with other modules (that can construct modules based on the signature).
I suggest to refer to menhir examples, namely see calc-two to get know about external tokens and to calc-param to know how to create parametrized parsers.
I usually put the keyword_tablein lexer.mll and I see no reason to put it in parser.mly.
If you need to access it from both lexer.mll and parser.mly (but why do you want to access it from parser.mly?), the easiest solution is to put it in a third file keyword.ml and use Keyword.keyword_table (or open Keyword and keyword_table).

Parse a list of subroutines

I have written parser_sub.mly and lexer_sub.mll which can parse a subroutine. A subroutine is a block of statement englobed by Sub and End Sub.
Actually, the raw file I would like to deal with contains a list of subroutines and some useless texts. Here is an example:
' a example file
Sub f1()
...
End Sub
haha
' hehe
Sub f2()
...
End Sub
So I need to write parser.mly and lexer.mll which can parse this file by ignoring all the comments (e.g. haha, ' hehe, etc.) and calling parser_sub.main, and returns a list of subroutines.
Could anyone tell me how to let the parser ignore all the useless sentences (sentences outside a Sub and End Sub)?
Here is a part of parser.mly I tried to write:
%{
open Syntax
%}
%start main
%type <Syntax.ev> main
%%
main:
subroutine_declaration* { $1 };
subroutine_declaration:
SUB name = subroutine_name LPAREN RPAREN EOS
body = procedure_body?
END SUB
{ { subroutine_name = name;
procedure_body_EOS_opt = body; } }
The rules and parsing for procedure_body are complex and are actually defined in parser_sub.mly and lexer_sub.mll, so how could I let parser.mly and lexer.mll do not repeat defining it, and just call parser_sub.main?
Maybe we can set some flag when we are inside subroutine:
sub_starts:
SUB { inside:=true };
sub_ends:
ENDSUB { inside:=false };
subroutine_declaration:
sub_starts name body sub_ends { ... }
And when this flag is not set you just skip any input?
If the stuff you want so skip can have any form (not necessarily valid tokens of your language), you pretty much have to solve this by hacking your lexer, as Kakadu suggests. This may be the easiest thing in any case.
If the filler (stuff to skip) consists of valid tokens, and you want to skip using a grammar rule, it seems to me the main problem is to define a nonterminal that matches any token other than END. This will be unpleasant to keep up to date, but seems possible.
Finally you have the problem that your end marker is two symbols, END SUB. You have to handle the case where you see END not followed by SUB. This is even trickier because SUB is your beginning marker also. Again, one way to simplify this would be to hack your lexer so that it treats END SUB as a single token. (Usually this is trickier than you'd expect, say if you want to allow comments between END and SUB.)

Resources