Task based programming : #pragma omp task versus #pragma omp parallel for - task

Considering :
void saxpy_worksharing(float* x, float* y, float a, int N) {
#pragma omp parallel for
for (int i = 0; i < N; i++) {
y[i] = y[i]+a*x[i];
}
}
And
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
{
for (int i = 0; i < N; i++) {
#pragma omp task
{
y[i] = y[i]+a*x[i];
}
}
}
What is the difference using tasks and the omp parallel directive ? Why can we write recursive algorithms such as merge sort with tasks, but not with worksharing ?

I would suggest that you have a look at the OpenMP tutorial from Lawrence Livermore National Laboratory, available here.
Your particular example is one that should not be implemented using OpenMP tasks. The second code creates N times the number of threads tasks (because there is an error in the code beside the missing }; I would come back to it later), and each task is only performing a very simple computation. The overhead of tasks would be gigantic, as you can see in my answer to this question. Besides the second code is conceptually wrong. Since there is no worksharing directive, all threads would execute all iterations of the loop and instead of N tasks, N times the number of threads tasks would get created. It should be rewritten in one of the following ways:
Single task producer - common pattern, NUMA unfriendly:
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < N; i++)
#pragma omp task
{
y[i] = y[i]+a*x[i];
}
}
}
}
The single directive would make the loop run inside a single thread only. All other threads would skip it and hit the implicit barrier at the end of the single construct. As barriers contain implicit task scheduling points, the waiting threads will start processing tasks immediately as they become available.
Parallel task producer - more NUMA friendly:
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < N; i++)
#pragma omp task
{
y[i] = y[i]+a*x[i];
}
}
}
In this case the task creation loop would be shared among the threads.
If you do not know what NUMA is, ignore the comments about NUMA friendliness.

Related

Implementation of OpenMP and Pthread

I have written a sequential code for a simulation process. Basically the flow goes like the following:
int count=0;
int i, j;
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
label:
if(j==i) continue;
// Some vector calculations and finding out X value example: (X=i+j)
}
// Some calculations with respect to i value example: (K=i*0.8+0.2)
// Example:Z=X+K;
if(Z==0.5)
{
count++;
if(count<3)
{
goto label;
}
}
}
I am new to parallel programming. How can I implement the above code using OpenMP and Pthreads.
I tried using #pragma omp parallel for collapse(2) before the first for loop and I am stuck in figuring out the next steps to be followed.

my process doesn't go to the child process

I have to take a user input number 'n' in the parent process and then pass it to the child process.The child process then takes 'n' user input values and stores them in an array.It then call a thread and send this array as an argument.The thread sums all the values in the array and send it back to the child process which prints it.
#include<stdio.h>
#include<stdlib.h>
#include<unistd.h>
#include<sys/types.h>
#include<pthread.h>
#include<fcntl.h>
void *sum(void *a)
{printf("in thread" );
int * arr=(int *)a;
int i;
int sum=0;
int size=sizeof(arr)/sizeof(arr[0]);
for(i=0;i<size;i++)
{
sum=sum +arr[i];
}
pthread_exit(sum);
}
int main()
{
int pipefd[2];
pid_t childpid;
pthread_t tid;
pipe(pipefd);
int r;
int n;
childpid=fork();
if (0==childpid)
{
printf("in child process" );
close(pipefd[1]);
read(pipefd[0],r,sizeof(int));
close(pipefd[0]);
int *ret;
int a[r];
int i;
for (i = 0; i <r; i++)
{ printf("enter values: ");
scanf("%d",&a[i]);
}
pthread_create(&tid,NULL,sum,(void *)a);
pthread_join(tid,(void *)&ret);
printf("%d",*ret );
}
else
{ printf("in parent process" );
printf("enter a number" );
scanf("%d",&n);
close(pipefd[0]);
write(pipefd[1],n,sizeof(int));
close(pipefd[1]);
}
return 0;
}
I have checked this code a dozen of times and nothing seems to be wrong.The process stops after taking the value of 'n'.The child process never runs.
Both read and write expect to receive the memory address of a buffer to use. You need to take the address of the variables you're trying to fill.
E.g.
write(pipefd[1], &n, sizeof(int));
and
read(pipefd[0], &r, sizeof(int));
By the way, the child process most likely is running. It's just your value for r is coming back as 0. Simple debugging trick: Use fprintf(stderr, "stuff to print"); to check your results at various stages. You can easily verify the child process is running, for example.

OpenMP v4.0 GOMP_task()

I'm trying to implement my own version of the libgomb library. Now I'm implementing the GOMP_task() function but there are somo parameters that I donĀ“t understand.
void
GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
long arg_size, long arg_align, bool if_clause, unsigned flags,
void **depend)
I have problems with this parameters
unsigned flags
void **depend
When I compile this code with my own libgomp library
#pragma omp parallel
{
#pragma omp for schedule(dynamic,3)
for (long i = 0; i < 10; i++){
#pragma omp task shared(result) depend(in:result) depend(out:result)
{
result++;}
#pragma omp task depend(in:result) depend(out:result)
result++;
}
#pragma omp taskwait
printf("result = %ld\n", result);
}
I print those two parameters and flags is always 8 and depend is always the same value even if the differents tasks depens of differents variables.
There is any document with information about that? Because I haven't found anything.
Or does anybody knows about that function?
Thanks

OpenMP task with and without parallel

Node *head = &node1;
while (head)
{
#pragma omp task
cout<<head->value<<endl;
head = head->next;
}
#pragma omp parallel
{
#pragma omp single
{
Node *head = &node1;
while (head)
{
#pragma omp task
cout<<head->value<<endl;
head = head->next;
}
}
}
In the first block, I just created tasks without parallel directive, while in the second block, I used parallel directive and single directive which is a common way I saw in the papers.
I wonder what's the difference between them? BTW, I know the basic meaning of these directives.
The code in my comment:
void traverse(node *root)
{
if (root->left)
{
#pragma omp task
traverse(root->left);
}
if (root->right)
{
#pragma omp task
traverse(root->right);
}
process(root);
}
The difference is that in the first block you are not really creating any tasks since the block itself is not nested (neither syntactically nor lexically) inside an active parallel region. In the second block the task construct is syntactically nested inside the parallel region and would queue explicit tasks if the region happens to be active at run-time (an active parallel region is one that executes with a team of more than one thread). Lexical nesting is less obvious. Observe the following example:
void foo(void)
{
int i;
for (i = 0; i < 10; i++)
#pragma omp task
bar();
}
int main(void)
{
foo();
#pragma omp parallel num_threads(4)
{
#pragma omp single
foo();
}
return 0;
}
The first call to foo() happens outside of any parallel regions. Hence the task directive does (almost) nothing and all calls to bar() happen serially. The second call to foo() comes from inside the parallel region and hence new tasks would be generated inside foo(). The parallel region is active since the number of threads was fixed to 4 by the num_threads(4) clause.
This different behaviour of the OpenMP directives is a design feature. The main idea is to be able to write code that could execute both as serial and as parallel.
Still the presence of the task construct in foo() does some code transformation, e.g. foo() is transformed to something like:
void foo_omp_fn_1(void *omp_data)
{
bar();
}
void foo(void)
{
int i;
for (i = 0; i < 10; i++)
OMP_make_task(foo_omp_fn_1, NULL);
}
Here OMP_make_task() is a hypothetical (not publicly available) function from the OpenMP support library that queues a call to the function, supplied as its first argument. If OMP_make_task() detects, that it works outside an active parallel region, it would simply call foo_omp_fn_1() instead. This adds some overhead to the call to bar() in the serial case. Instead of main -> foo -> bar, the call goes like main -> foo -> OMP_make_task -> foo_omp_fn_1 -> bar. The implication of this is slower serial code execution.
This is even more obviously illustrated with the worksharing directive:
void foo(void)
{
int i;
#pragma omp for
for (i = 0; i < 12; i++)
bar();
}
int main(void)
{
foo();
#pragma omp parallel num_threads(4)
{
foo();
}
return 0;
}
The first call to foo() would run the loop in serial. The second call would distribute the 12 iterations among the 4 threads, i.e. each thread would only execute 3 iteratons. Once again, some code transformation magic is used to achieve this and the serial loop would run slower than if no #pragma omp for was present in foo().
The lesson here is to never add OpenMP constructs where they are not really necessary.

How Lua deal with the stack?

I'm trying Lua and want to know how lua_State working
code and result:
state.c
#include <stdio.h>
#include "lua/src/lua.h"
#include "lua/src/lauxlib.h"
static void stackDump(lua_State *L){
int i;
int top = lua_gettop(L);
for(i = 1; i<= top; i++) {
int t = lua_type(L, i);
switch(t){
case LUA_TSTRING:
printf("'%s'", lua_tostring(L, i));
break;
case LUA_TBOOLEAN:
printf(lua_toboolean(L, i) ?"true":"false");
break;
case LUA_TNUMBER:
printf("%g", lua_tonumber(L, i));
break;
default:
printf("%s", lua_typename(L, t));
break;
}
printf(" ");
}
printf("\n");
}
static int divide(struct lua_State *L){
double a = lua_tonumber(L, 1);
double b = lua_tonumber(L, 2);
printf("%p\n", L);
stackDump(L);
int quot = (int)a / (int)b;
int rem = (int)a % (int)b;
lua_pushnumber(L, quot);
lua_pushnumber(L, rem);
stackDump(L);
printf("---end div---\n");
return 2;
}
int main(void){
struct lua_State *L = lua_open();
lua_pushboolean(L, 1);
lua_pushnumber(L, 10);
lua_pushnil(L);
lua_pushstring(L, "hello");
printf("%p\n", L);
stackDump(L);
lua_register(L, "div", divide);
luaL_dofile(L, "div.lua");
stackDump(L);
lua_close(L);
return 0;
}
div.lua
local c = div(20, 10)
0x100c009e0
true 10 nil 'hello'
---start div---
0x100c009e0
20 10
20 10 2 0
---end div---
true 10 nil 'hello'
I see lua_State in divide is the same with the main one, but they have different data in stack, How this be done ?
I know the best way to understand this is to read source code of Lua , maybe you can tell me where to find the right place.
Think of lua_State as containing the Lua stack, as well as indices delimiting the current visible part of the stack. When you invoke a Lua function, it may look like you have a new stack, but really only the indices have changed. That's the simplified version.
lua_State is defined in lstate.h. I've pulled out the relevant parts for you. stack is the beginning of the big Lua stack containing everything. base is the beginning of the stack for the current function. This is what your function sees as "the stack" when it is executing.
struct lua_State {
/* ... */
StkId top; /* first free slot in the stack */
StkId base; /* base of current function */
/* ... */
StkId stack_last; /* last free slot in the stack */
StkId stack; /* stack base */
/* ... */
};
Programming in Lua, 2nd Edition discusses Lua states in chapter 30: Threads and States. You'll find some good information there. For example, lua_State not only represents a Lua state, but also a thread within that state. Furthermore, all threads have their own stack.
It gets different data the same way anything gets different data: code changes the data inside of the object.
struct Object
{
int val;
};
void more_stuff(Object *the_data)
{
//the_data->val has 5 in it now.
}
void do_stuff(Object *the_data)
{
int old_val = the_data->val;
the_data->val = 5;
more_stuff(the_data);
the_data->val = old_val;
}
int main()
{
Object my_data;
my_data.val = 1;
//my_data.val has 1.
do_stuff(&my_data);
//my_data.val still has 1.
}
When Lua calls a registered C function, it gives it a new stack frame.

Resources