How to calculate the number of tasks created by an openMP program.? - task

I am looking to find the number of tasks. How to get the number of tasks created by the openMP program ?
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
quicksort(A,left,last-1);
quicksort(A,last+1,right);
#pragma omp taskwait
}

If you want to gain an insight in what your OpenMP program is doing, you should use a OpenMP-task-aware performance analysis tool. For example Score-P can record all task operations in either a trace with full timing information or a summary profile. There are then several other tools to analyse and visualize the recorded information.
Take a look at this paper for more information for performance analysis of task-based OpenMP applications.

There's no good way of counting the number of OpenMP tasks, as OpenMP does not offer any way to actually query how many tasks have been created thus far. The OpenMP runtime system may or may not keep track of this number, so it would unfair (and would have performance implications) if such a number would be kept in a runtime that is otherwise not interested in this number.
The following is a terrible hack! Be sure you absolutely want to do this!
Having said the above, you can do the count manually. Assuming that your code is deterministically creating the same number of tasks for each execution, you can do this:
int tasks_created;
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
{
#pragma omp atomic
tasks_created++
quicksort(A,left,last-1);
}
quicksort(A,last+1,right);
#pragma omp taskwait
}
I'm saying that this is a terrible hack, because
it requires you to find all the task-creating statements to modify them with the atomic construct and the increment
it does not work well for some task-generating directives, e.g., taskloop
it may horribly slow down execution, so that you cannot leave the modification in for production (that's the part abut determinism, you need run once with the count and then remove the counting for production)
Another way...
If you are using a reasonably new implementation of OpenMP that already supports the new OpenMP tools interfaces of OpenMP 5.0, you can write a small tool that hooks into the OpenMP events for task-creation. Then you can do the count in the tool and attach it to our execution through the OpenMP tools mechanism.

Related

gpu::OpenCV and openMP

I try to use openMP to parallelise a loop that contains gpu::openCV calls. Although I create an object inside the loop (so it is different for each thread) the results are not correct. They just become correct only if I enclose the gpu::openCV functions with a '#pragma omp critical' but in this way I have no speedup of the parallelisation, but overhead instead.
My question is why the concurrent execution of different instances of gpu::openCV calls with openMP gives wrong results?

Compile openmp into pthreads C code

I understand that OpenMP is in fact just a set of macros which is compiled into pthreads. Is there a way of seeing the pthread code before the rest of the compilation occurs? I am using GCC to compile.
First, OpenMP is not a simple set of macros. It may be seen a simple transformation into pthread-like code, but OpenMP does require more than that including runtime support.
Back to your question, at least, in GCC, you can't see pthreaded code because GCC's OpenMP implementation is done in the compiler back-end (or middle-end). Transformation is done in IR(intermediate representation) level. So, from the viewpoint of programmers, it's not easy to see how the code is actually transformed.
However, there are some references.
(1) An Intel engineer provided a great overview of the implementation of OpenMP in Intel C/C++ compiler:
http://www.drdobbs.com/parallel/how-do-openmp-compilers-work-part-1/226300148
http://www.drdobbs.com/parallel/how-do-openmp-compilers-work-part-2/226300277
(2) You may take a look at the implementation of GCC's OpenMP:
https://github.com/mirrors/gcc/tree/master/libgomp
See libgomp.h does use pthread, and loop.c contains the implementation of parallel-loop construct.
OpenMP is a set of compiler directives, not macros. In C/C++ those directives are implemented with the #pragma extension mechanism while in Fortran they are implemented as specially formatted comments. These directives instruct the compiler to perform certain code transformations in order to convert the serial code into parallel.
Although it is possible to implement OpenMP as transformation to pure pthreads code, this is seldom done. Large part of the OpenMP mechanics is usually built into a separate run-time library, which comes as part of the compiler suite. For GCC this is libgomp. It provides a set of high level functions that are used to easily implement the OpenMP constructs. It is also internal to the compiler and not intended to be used by user code, i.e. there is no header file provided.
With GCC it is possible to get a pseudocode representation of what the code looks like after the OpenMP transformation. You have to supply it the -fdump-tree-all option, which would result in the compiler spewing a large number of intermediate files for each compilation unit. The most interesting one is filename.017t.ompexp (this comes from GCC 4.7.1, the number might be different on other GCC versions, but the extension would still be .ompexp). This file contains an intermediate representation of the code after the OpenMP constructs were lowered and then expanded into their proper implementation.
Consider the following example C code, saved as fun.c:
void fun(double *data, int n)
{
#pragma omp parallel for
for (int i = 0; i < n; i++)
data[i] += data[i]*data[i];
}
The content of fun.c.017t.ompexp is:
fun (double * data, int n)
{
...
struct .omp_data_s.0 .omp_data_o.1;
...
<bb 2>:
.omp_data_o.1.data = data;
.omp_data_o.1.n = n;
__builtin_GOMP_parallel_start (fun._omp_fn.0, &.omp_data_o.1, 0);
fun._omp_fn.0 (&.omp_data_o.1);
__builtin_GOMP_parallel_end ();
data = .omp_data_o.1.data;
n = .omp_data_o.1.n;
return;
}
fun._omp_fn.0 (struct .omp_data_s.0 * .omp_data_i)
{
int n [value-expr: .omp_data_i->n];
double * data [value-expr: .omp_data_i->data];
...
<bb 3>:
i = 0;
D.1637 = .omp_data_i->n;
D.1638 = __builtin_omp_get_num_threads ();
D.1639 = __builtin_omp_get_thread_num ();
...
<bb 4>:
... this is the body of the loop ...
i = i + 1;
if (i < D.1644)
goto <bb 4>;
else
goto <bb 5>;
<bb 5>:
<bb 6>:
return;
...
}
I have omitted big portions of the output for brevity. This is not exactly C code. It is a C-like representation of the program flow. <bb N> are the so-called basic blocks - collection of statements, treated as single blocks in the program's workflow. The first thing that one sees is that the parallel region gets extracted into a separate function. This is not uncommon - most OpenMP implementations do more or less the same code transformation. One can also observe that the compiler inserts calls to libgomp functions like GOMP_parallel_start and GOMP_parallel_end, which are used to bootstrap and then to finish the execution of a parallel region (the __builtin_ prefix is removed later on). Inside fun._omp_fn.0 there is a for loop, implemented in <bb 4> (note that the loop itself is also expanded). Also all shared variables are put into a special structure that gets passed to the implementation of the parallel region. <bb 3> contains the code that computes the range of iterations over which the current thread would operate.
Well, not quite a C code, but this is probably the closest thing that one can get from GCC.
I haven't tested it with openmp. But the compiler option -E should give you the code after preprocessing.

Pthreads in MPI

Hi, I have written an MPI quicksort program which works like this:
In my cluster 'Master' will divide the integer data and send these to 'Slave nodes'. Upon receiving at the Slave nodes, each slave will perform individual sorting operations and send the sorted data back to Master.
Now my problem is I'm interested in introducing hyper-threading for the slaves.
I have data coming from master
sub (which denotes the array)
count (size of an array)
Now I have initialized Pthreads as where
num_threads=12.
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i = 0; i < num_pthreads; i++) {
if (pthread_create(&thread[i], &attr, new_thread, (void *) &sub[i]))
{
printf("error creating a new thread \n");
exit(1);
}
else
{
printf(" threading is successful %d at node %d \n \t ",i,rank);
}
and in a new thread function
void * new_thread(int *sub)
{
quick_sort(sub,0, count-1);
}
return(0);
}
I don't understand whether my way is correct or not. Can anyone help me with this problem?
Your basic idea is correct, except you also need to determine how you're going to get results back from the threads.
Normally you would want to pass all relevant information for the thread to complete its task through the *arg argument from pthread_create. In your new_thread() function, the count variable is not passed in to the function and so is global between all threads. A better approach is to pass a pointer to a struct through from pthread_create.
typedef struct {
int *sub; /* Pointer to first element in array to sort, different for each thread. */
int count; /* Number of elements in sub. */
int done; /* Flag to indicate that sort is complete. */
} thread_params_t
void * new_thread(thread_params_t *params)
{
quick_sort(params->sub, 0, params->count-1);
params->done = 1;
return 0;
}
You would fill in a new thread_params_t for each thread that was spawned.
The other item that has to be managed is the sort results. It would be normal for the main thread to do a pthread_join on each thread which ensure that it has completed before continuing. Depending on your requirements you could either have each thread send results back to the master directly, of have the main function collect the results from each thread and send results back external to the worker threads.
You can use OpenMP instead of pthreads (just for the record - combining MPI with threading is called hybrid programming). It is a lightweight set of compiler pragmas that turn sequential code into parallel one. Support for OpenMP is available in virtually all modern compilers. With OpenMP you introduce the so-called parallel regions. A team of threads is created at the beginning of the parallel region, then the code continues to execute concurrently until the end of the parallel region, where all threads are joined and only the main thread continues execution (thread creation and joining is logical, e.g. it doesn't have to be implemented like this in real life and most implementations actually use thread pools to speed up the creation of threads):
#pragma omp parallel
{
// This code gets executed in parallel
...
}
You can use omp_get_thread_num() inside the parallel block to get the ID of the thread and make it compute something different. Or you can use one of the worksharing constructs of OpenMP like for, sections, etc. to make it divide the work automatically.
The biggest advantage of OpenMP is that is doesn't introduce deep changes to the source code and it abstracts threads creation/joining away so you don't have to do it manually. Most of the time you can get around with just a few pragmas. Then you have to enable OpenMP during compilation (with -fopenmp for GCC, -openmp for Intel compilers, -xopenmp for Sun/Oracle compilers, etc.). If you do not enable OpenMP or the particular compiler doesn't support it, you'll get a serial program.
You can find a quick but comprehensive OpenMP tutorial at LLNL.

False autovectorization in Intel C compiler (icc)

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?
Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

What are the performance implications of these C# features?

I have been designing a component-based game library, with the overall intention of writing it in C++ (as that is my forte), with Ogre3D as the back-end. Now that I am actually ready to write some code, I thought it would be far quicker to test out my framework under the XNA4.0 framework (somewhat quicker to get results/write an editor, etc). However, whilst I am no newcomer to C++ or C#, I am a bit of a newcomer when it comes to doing things the "XNA" way, so to speak, so I had a few queries before I started hammering out code:
I read about using arrays rather than collections to avoid performance hits, then also read that this was not entirely true and that if you enumerated over, say, a concrete List<> collection (as opposed to an IEnumerable<>), the enumerator is a value-type that is used for each iteration and that there aren't any GC worries here. The article in question was back in 2007. Does this hold true, or do you experienced XNA developers have real-world gotchas about this? Ideally I'd like to go down a chosen route before I do too much.
If arrays truly are the way to go, no questions asked, I assume when it comes to resizing the array, you copy the old one over with new space? Or is this off the mark? Do you attempt to never, ever resize an array? Won't the GC kick in for the old one if this is the case, or is the hit inconsequential?
As the engine was designed for C++, the design allows for use of lambdas and delegates. One design uses the fastdelegate library which is the fastest possible way of using delegates in C++. A more flexible, but slightly slower approach (though hardly noticeable in the world of C++) is to use C++0x lambdas and std::function. Ideally, I'd like to do something similar in XNA, and allow delegates to be used. Does the use of delegates cause any significant issues with regard to performance?
If there are performance considerations with regards to delegates, is there a difference between:
public void myDelegate(int a, int b);
private void myFunction(int a, int b)
{
}
event myDelegate myEvent;
myEvent += myFunction;
vs:
public void myDelegate(int a, int b);
event myDelegate myEvent;
myEvent += (int a, int b) => { /* ... */ };
Sorry if I have waffled on a bit, I prefer to be clear in my questions. :)
Thanks in advance!
Basically the only major performance issue to be aware of in C# that is different to what you have to be aware of in C++, is the garbage collector. Simply don't allocate memory during your main game loop and you'll be fine. Here is a blog post that goes into detail.
Now to your questions:
1) If a framework collection iterator could be implemented as a value-type (not creating garbage), then it usually (always?) has been. You can safely use foreach on, for example, List<>.
You can verify if you are allocating in your main loop by using the CLR Profiler.
2) Use Lists instead of arrays. They'll handle the resizing for you. You should use the Capacity property to pre-allocate enough space before you start gameplay to avoid GC issues. Using arrays you'd just have to implement all this functionality yourself - ugly!
The GC kicks in on allocations (not when memory becomes free). On Xbox 360 it kicks in for every 1MB allocated and is very slow. On Windows it is a bit more complicated - but also doesn't have such a huge impact on performance.
3) C# delegates are pretty damn fast. And faster than most people expect. They are about on-par with method calls on interfaces. Here and here are questions that provide more detials about delegate performance in C#.
I couldn't say how they compare to the C++ options. You'd have to measure it.
4) No. I'm fairly sure this code will produce identical IL. You could disassemble it and check, or profile it, though.
I might add - without checking myself - I suspect that having an event myDelegate will be slower than a plain myDelegate if you don't need all the magic of event.

Resources