I understand that OpenMP is in fact just a set of macros which is compiled into pthreads. Is there a way of seeing the pthread code before the rest of the compilation occurs? I am using GCC to compile.
First, OpenMP is not a simple set of macros. It may be seen a simple transformation into pthread-like code, but OpenMP does require more than that including runtime support.
Back to your question, at least, in GCC, you can't see pthreaded code because GCC's OpenMP implementation is done in the compiler back-end (or middle-end). Transformation is done in IR(intermediate representation) level. So, from the viewpoint of programmers, it's not easy to see how the code is actually transformed.
However, there are some references.
(1) An Intel engineer provided a great overview of the implementation of OpenMP in Intel C/C++ compiler:
http://www.drdobbs.com/parallel/how-do-openmp-compilers-work-part-1/226300148
http://www.drdobbs.com/parallel/how-do-openmp-compilers-work-part-2/226300277
(2) You may take a look at the implementation of GCC's OpenMP:
https://github.com/mirrors/gcc/tree/master/libgomp
See libgomp.h does use pthread, and loop.c contains the implementation of parallel-loop construct.
OpenMP is a set of compiler directives, not macros. In C/C++ those directives are implemented with the #pragma extension mechanism while in Fortran they are implemented as specially formatted comments. These directives instruct the compiler to perform certain code transformations in order to convert the serial code into parallel.
Although it is possible to implement OpenMP as transformation to pure pthreads code, this is seldom done. Large part of the OpenMP mechanics is usually built into a separate run-time library, which comes as part of the compiler suite. For GCC this is libgomp. It provides a set of high level functions that are used to easily implement the OpenMP constructs. It is also internal to the compiler and not intended to be used by user code, i.e. there is no header file provided.
With GCC it is possible to get a pseudocode representation of what the code looks like after the OpenMP transformation. You have to supply it the -fdump-tree-all option, which would result in the compiler spewing a large number of intermediate files for each compilation unit. The most interesting one is filename.017t.ompexp (this comes from GCC 4.7.1, the number might be different on other GCC versions, but the extension would still be .ompexp). This file contains an intermediate representation of the code after the OpenMP constructs were lowered and then expanded into their proper implementation.
Consider the following example C code, saved as fun.c:
void fun(double *data, int n)
{
#pragma omp parallel for
for (int i = 0; i < n; i++)
data[i] += data[i]*data[i];
}
The content of fun.c.017t.ompexp is:
fun (double * data, int n)
{
...
struct .omp_data_s.0 .omp_data_o.1;
...
<bb 2>:
.omp_data_o.1.data = data;
.omp_data_o.1.n = n;
__builtin_GOMP_parallel_start (fun._omp_fn.0, &.omp_data_o.1, 0);
fun._omp_fn.0 (&.omp_data_o.1);
__builtin_GOMP_parallel_end ();
data = .omp_data_o.1.data;
n = .omp_data_o.1.n;
return;
}
fun._omp_fn.0 (struct .omp_data_s.0 * .omp_data_i)
{
int n [value-expr: .omp_data_i->n];
double * data [value-expr: .omp_data_i->data];
...
<bb 3>:
i = 0;
D.1637 = .omp_data_i->n;
D.1638 = __builtin_omp_get_num_threads ();
D.1639 = __builtin_omp_get_thread_num ();
...
<bb 4>:
... this is the body of the loop ...
i = i + 1;
if (i < D.1644)
goto <bb 4>;
else
goto <bb 5>;
<bb 5>:
<bb 6>:
return;
...
}
I have omitted big portions of the output for brevity. This is not exactly C code. It is a C-like representation of the program flow. <bb N> are the so-called basic blocks - collection of statements, treated as single blocks in the program's workflow. The first thing that one sees is that the parallel region gets extracted into a separate function. This is not uncommon - most OpenMP implementations do more or less the same code transformation. One can also observe that the compiler inserts calls to libgomp functions like GOMP_parallel_start and GOMP_parallel_end, which are used to bootstrap and then to finish the execution of a parallel region (the __builtin_ prefix is removed later on). Inside fun._omp_fn.0 there is a for loop, implemented in <bb 4> (note that the loop itself is also expanded). Also all shared variables are put into a special structure that gets passed to the implementation of the parallel region. <bb 3> contains the code that computes the range of iterations over which the current thread would operate.
Well, not quite a C code, but this is probably the closest thing that one can get from GCC.
I haven't tested it with openmp. But the compiler option -E should give you the code after preprocessing.
Related
If I call R code from Java within GraalVM (using GraalVM's polyglot function), does the R code and the Java code run on the same Java thread (ie there's no switching between OS or Java threads etc?) Also, is it the same "memory/heap" space? That is, in the example code below (which I took from https://www.baeldung.com/java-r-integration)
public double mean(int[] values) {
Context polyglot = Context.newBuilder().allowAllAccess(true).build();
String meanScriptContent = RUtils.getMeanScriptContent();
polyglot.eval("R", meanScriptContent);
Value rBindings = polyglot.getBindings("R");
Value rInput = rBindings.getMember("c").execute(values);
return rBindings.getMember("customMean").execute(rInput).asDouble();
}
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied? Or is GraalVM smart enough to consider it a pointer to the same memory space? If it's a copy, is the copying time the same (or similar, ie within say 20%) time as if I were to a normal java clone() operation? Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function? Bonus question: can the GraalVM JIT compiler even compile accross the layers, eg say I had this:
final long sum = IntStream.range(0,10000)
.stream()
.map(x -> x+4)
.map(x -> <<<FastR version of the following inverse operation: x-4 >>>)
.sum();
would the GraalVM compiler be as smart as say a normal Java JIT compiler and realize that the whole above statement can be simply written without the two map operations (Since they cancel each other out)?
FYI: I'm considering using GraalVM to run both my Java code and my R code, once the issue I identified here is resolved (Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?) and one of the motivitations is that I hope to eliminate the 50% of time that calling R (using RServe()) from Java is spent on network IO (because Java communicates with RServer over TCP/IP and RServe and Java are on different threads and memory spaces etc etc.)
does the R code and the Java code run on the same Java thread. Also, is it the same "memory/heap" space?
Yes and yes. You can even use GraalVM VisualVM to inspect the heap: it provides standard Java view where you can see instances of FastR internal representations like RIntVector mingled with the rest of the other Java objects, or R view where you can see integer vectors, lists, environments, ...
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied?
In general yes: most objects are passed to R as-is. Inside R you have two choices:
Explicitly convert them to some concrete type, i.e., as.integer(arg), which does not make a copy, but tells R explicitly how you want that value to be treated as "native" R type including R's value semantics.
Leave it up to the default rules, which will be applied once your objects is passed to some R builtin, e.g., int[] is treated as integer vector (but note that treating it as a list would be also reasonable in some cases). Again no copies here. And the object itself keeps its reference semantics.
However, sometimes FastR needs to make a copy:
some builtin functions cannot handle foreign objects yet
R language often implicitly copies vectors, because of its value semantics, arguments coercion, etc.
when a vector is passed to native R extension, we need to move its data to off heap memory
I would say that if you happen to have a very large vector, say GBs of data, you need to be very careful about it even in regular R. Note: FastR vectors are by default backed by Java arrays, so their size limitations apply to FastR vectors too.
Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function?
Mostly yes, except that the function cannot be pulled and inlined into the surrounding Java code(+). The call itself is as fast as regular Java call. For the example you give: it cannot be optimized as you suggest, because the R function cannot be inlined(+). However, I would be very skeptical that any compiler can optimize this as you suggest even if both functions where pure Java code. That being said, yes: some things that compiler can optimize, like eliminating some useless computations that it can analyze well, is not going to work because of the impossibility to inline code across the Java <-> R boundary(+).
(+) Unless you'd run the Java code with Espresso (Java on Truffle), but then you would not be using Context API but Espresso's interop support.
If I have some class with a field like __m256i* loaded_v, and a method like:
void load() {
loaded_v = &_mm256_load_si256(reinterpret_cast<const __m256i*>(vector));
}
For how long will loaded_v be a valid pointer? Since there are a limited number of registers, I would imagine that eventually loaded_v will refer to a different value, or some other weird behavior will happen. However, I would like to reduce the number of loads I do.
I'm writing a packed bit array class, and I would like to use AVX intrinsics to increase performance. However, it is inefficient to load my array of bits every time I do some operation (and, or, xor, etc). Therefore, I would like to be able to explicitly call load() before performing some batch of operations. However, I don't understand how exactly AVX registers are handled. Could anyone help me out, or point to me to some documentation for this issue?
The optimizing compiler would use registers automatically.
It may put a __m256 variable into memory, or in a register, or may use a register in one part of you code, and spill it in another. This can be done not only with standalone automatic storage (stack) variable, but also with member of a class, especially if the class instance is an automatic storage variable itself.
In case of registers usage, __m256 variable would correspond one of ymm registers (one of 16 in x86-64, one of 8 in 32-bit compilation, or one of 32 in x86-64 with AVX512), there's no need to indirectly refer to it.
The _mm256_load_si256 intrinsic doesn't necessarily compile to vmovdqa. For example, this code:
#include <immintrin.h>
__m256i f(__m256i a, const void* p)
{
__m256i b = _mm256_load_si256(reinterpret_cast<const __m256i*>(p));
return _mm256_xor_si256(a, b);
}
Compiles as following (https://godbolt.org/z/ve67YPn4T):
vpxor ymm0, ymm0, YMMWORD PTR [rdx]
ret 0
C and C++ are high level languages; the intrinsics should be seen as a way to convey the semantic to the compiler, not instruction mnemonics.
You should load a value into a variable,
__m256i loaded_v;
loaded_v = _mm256_load_si256(reinterpret_cast<const __m256i*>(vector));
or a temporary:
__m256_whatever_operation(_mm256_load_si256(reinterpret_cast<const __m256i*>(vector)), other_operand);
And you should follow the usual C or C++ rules.
If you repeatedly load an indirect value from a pointer, it may be helpful to cache it in a variable, so that compiler would see the value does not change between loads, and use this as an optimization opportunity. Sure compiler may miss this opportunity anyway, or find it even without cached variable (possibly with the help of the strict aliasing rule).
I am looking to find the number of tasks. How to get the number of tasks created by the openMP program ?
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
quicksort(A,left,last-1);
quicksort(A,last+1,right);
#pragma omp taskwait
}
If you want to gain an insight in what your OpenMP program is doing, you should use a OpenMP-task-aware performance analysis tool. For example Score-P can record all task operations in either a trace with full timing information or a summary profile. There are then several other tools to analyse and visualize the recorded information.
Take a look at this paper for more information for performance analysis of task-based OpenMP applications.
There's no good way of counting the number of OpenMP tasks, as OpenMP does not offer any way to actually query how many tasks have been created thus far. The OpenMP runtime system may or may not keep track of this number, so it would unfair (and would have performance implications) if such a number would be kept in a runtime that is otherwise not interested in this number.
The following is a terrible hack! Be sure you absolutely want to do this!
Having said the above, you can do the count manually. Assuming that your code is deterministically creating the same number of tasks for each execution, you can do this:
int tasks_created;
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
{
#pragma omp atomic
tasks_created++
quicksort(A,left,last-1);
}
quicksort(A,last+1,right);
#pragma omp taskwait
}
I'm saying that this is a terrible hack, because
it requires you to find all the task-creating statements to modify them with the atomic construct and the increment
it does not work well for some task-generating directives, e.g., taskloop
it may horribly slow down execution, so that you cannot leave the modification in for production (that's the part abut determinism, you need run once with the count and then remove the counting for production)
Another way...
If you are using a reasonably new implementation of OpenMP that already supports the new OpenMP tools interfaces of OpenMP 5.0, you can write a small tool that hooks into the OpenMP events for task-creation. Then you can do the count in the tool and attach it to our execution through the OpenMP tools mechanism.
I want to find simple loops in LLVM bytecode, and extract the basic
information of the loop.
For example:
for (i=0; i<1000; i++)
sum += i;
I want to extract the bound [0, 1000), the loop variable "i" and the
loop body (sum += i).
What should I do?
I read the LLVM API document, and find some useful classes like "Loop",
"LoopInfo".
But I do not know how to use them in detail.
Could you please give me some help? A detailed usage may be more helpful.
If you do not want to use the pass manager, you might need to call the Analyze method in the llvm::LoopInfoBase class on each function in the IR (assuming you are using LLVM-3.4). However, the Analyze method takes the DominatorTree of each function as input, which you have to generate at first. Following codes are what I tested with LLVM-3.4 (assuming you have read the IR file and converted it into a Module* named as module):
for(llvm::Module::iterator func = module->begin(), y=module->end(); func!=y; func++){
//get the dominatortree of the current function
llvm::DominatorTree* DT = new llvm::DominatorTree();
DT->DT->recalculate(*func);
//generate the LoopInfoBase for the current function
llvm::LoopInfoBase<llvm::BasicBlock, llvm::Loop>* KLoop = new llvm::LoopInfoBase<llvm::BasicBlock, llvm::Loop>();
KLoop->releaseMemory();
KLoop->Analyze(DT->getBase());
}
Basically, with KLoop generated, you get all kinds of LOOP information in the IR level. You can refer APIs in the LoopInfoBase class for details. By the way, you might want to add following headers:
"llvm/Analysis/LoopInfo.h" "llvm/Analysis/Dominators.h".
Once you get to the LLVM IR level, the information you request may no longer be accurate. For example, clang may have transformed your code so that i goes from -1000 up to 0 instead. Or it may have optimised "i" out entirely, so that there is no explicit induction variable. If you really need to extract the information exactly as it says at face value in the C code, then you need to look at clang, not LLVM IR. Otherwise, the best you can do is to calculate a loop trip count, in which case, have a look at the ScalarEvolution pass.
Check the PowerPC hardware loops transformation pass, which demonstrates the trip count calculation fairly well: http://llvm.org/docs/doxygen/html/PPCCTRLoops_8cpp_source.html
The code is fairly heavy, but should be followable. The interesting function is PPCCTRLoops::convertToCTRLoop. If you have any further questions about that, I can try to answer them.
LLVM is just a library. You won't find AST nodes there.
I suggest to have a look at Clang, which is a compiler built on top of LLVM.
Maybe this is what you're looking for?
Much like Matteo said, in order for LLVM to be able to recognize the loop variable and condition, the file need to be in LLVM IR. The question says you have it in LLVM bytecode, but since LLVM IR is written in SSA form, talking about "loop variables" isn't really true. I'm sure if you describe what you're trying to do, and what type of result you expect we can be of further help.
Some code to help you get started:
virtual void getAnalysisUsage(AnalysisUsage &AU) const{
AU.addRequired<LoopInfo>();
}
bool runOnLoop(Loop* L, LPPassManager&){
BasicBlock* h = L->getHeader();
if (BranchInst *bi = dyn_cast<BranchInst>(h->getTerminator())) {
Value *loopCond = bi->getCondition();
}
return false;
}
This code snippet is from inside a regular LLVM pass.
Just an update on Junxzm answer, some references, pointers, and methods have changed in LLVM 3.5.
for(llvm::Module::iterator f = m->begin(), fe=m->end(); f!=fe; ++f){
llvm::DominatorTree DT = llvm::DominatorTree();
DT.recalculate(*f);
llvm::LoopInfoBase<llvm::BasicBlock, llvm::Loop>* LoopInfo = new llvm::LoopInfoBase<llvm::BasicBlock, llvm::Loop>();
LoopInfo->releaseMemory();
LoopInfo->Analyze(DT);
}
Hi, I have written an MPI quicksort program which works like this:
In my cluster 'Master' will divide the integer data and send these to 'Slave nodes'. Upon receiving at the Slave nodes, each slave will perform individual sorting operations and send the sorted data back to Master.
Now my problem is I'm interested in introducing hyper-threading for the slaves.
I have data coming from master
sub (which denotes the array)
count (size of an array)
Now I have initialized Pthreads as where
num_threads=12.
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i = 0; i < num_pthreads; i++) {
if (pthread_create(&thread[i], &attr, new_thread, (void *) &sub[i]))
{
printf("error creating a new thread \n");
exit(1);
}
else
{
printf(" threading is successful %d at node %d \n \t ",i,rank);
}
and in a new thread function
void * new_thread(int *sub)
{
quick_sort(sub,0, count-1);
}
return(0);
}
I don't understand whether my way is correct or not. Can anyone help me with this problem?
Your basic idea is correct, except you also need to determine how you're going to get results back from the threads.
Normally you would want to pass all relevant information for the thread to complete its task through the *arg argument from pthread_create. In your new_thread() function, the count variable is not passed in to the function and so is global between all threads. A better approach is to pass a pointer to a struct through from pthread_create.
typedef struct {
int *sub; /* Pointer to first element in array to sort, different for each thread. */
int count; /* Number of elements in sub. */
int done; /* Flag to indicate that sort is complete. */
} thread_params_t
void * new_thread(thread_params_t *params)
{
quick_sort(params->sub, 0, params->count-1);
params->done = 1;
return 0;
}
You would fill in a new thread_params_t for each thread that was spawned.
The other item that has to be managed is the sort results. It would be normal for the main thread to do a pthread_join on each thread which ensure that it has completed before continuing. Depending on your requirements you could either have each thread send results back to the master directly, of have the main function collect the results from each thread and send results back external to the worker threads.
You can use OpenMP instead of pthreads (just for the record - combining MPI with threading is called hybrid programming). It is a lightweight set of compiler pragmas that turn sequential code into parallel one. Support for OpenMP is available in virtually all modern compilers. With OpenMP you introduce the so-called parallel regions. A team of threads is created at the beginning of the parallel region, then the code continues to execute concurrently until the end of the parallel region, where all threads are joined and only the main thread continues execution (thread creation and joining is logical, e.g. it doesn't have to be implemented like this in real life and most implementations actually use thread pools to speed up the creation of threads):
#pragma omp parallel
{
// This code gets executed in parallel
...
}
You can use omp_get_thread_num() inside the parallel block to get the ID of the thread and make it compute something different. Or you can use one of the worksharing constructs of OpenMP like for, sections, etc. to make it divide the work automatically.
The biggest advantage of OpenMP is that is doesn't introduce deep changes to the source code and it abstracts threads creation/joining away so you don't have to do it manually. Most of the time you can get around with just a few pragmas. Then you have to enable OpenMP during compilation (with -fopenmp for GCC, -openmp for Intel compilers, -xopenmp for Sun/Oracle compilers, etc.). If you do not enable OpenMP or the particular compiler doesn't support it, you'll get a serial program.
You can find a quick but comprehensive OpenMP tutorial at LLNL.