OpenMP task with and without parallel - task

Node *head = &node1;
while (head)
#pragma omp task
head = head->next;
#pragma omp parallel
#pragma omp single
Node *head = &node1;
while (head)
#pragma omp task
head = head->next;
In the first block, I just created tasks without parallel directive, while in the second block, I used parallel directive and single directive which is a common way I saw in the papers.
I wonder what's the difference between them? BTW, I know the basic meaning of these directives.
The code in my comment:
void traverse(node *root)
if (root->left)
#pragma omp task
if (root->right)
#pragma omp task

The difference is that in the first block you are not really creating any tasks since the block itself is not nested (neither syntactically nor lexically) inside an active parallel region. In the second block the task construct is syntactically nested inside the parallel region and would queue explicit tasks if the region happens to be active at run-time (an active parallel region is one that executes with a team of more than one thread). Lexical nesting is less obvious. Observe the following example:
void foo(void)
int i;
for (i = 0; i < 10; i++)
#pragma omp task
int main(void)
#pragma omp parallel num_threads(4)
#pragma omp single
return 0;
The first call to foo() happens outside of any parallel regions. Hence the task directive does (almost) nothing and all calls to bar() happen serially. The second call to foo() comes from inside the parallel region and hence new tasks would be generated inside foo(). The parallel region is active since the number of threads was fixed to 4 by the num_threads(4) clause.
This different behaviour of the OpenMP directives is a design feature. The main idea is to be able to write code that could execute both as serial and as parallel.
Still the presence of the task construct in foo() does some code transformation, e.g. foo() is transformed to something like:
void foo_omp_fn_1(void *omp_data)
void foo(void)
int i;
for (i = 0; i < 10; i++)
OMP_make_task(foo_omp_fn_1, NULL);
Here OMP_make_task() is a hypothetical (not publicly available) function from the OpenMP support library that queues a call to the function, supplied as its first argument. If OMP_make_task() detects, that it works outside an active parallel region, it would simply call foo_omp_fn_1() instead. This adds some overhead to the call to bar() in the serial case. Instead of main -> foo -> bar, the call goes like main -> foo -> OMP_make_task -> foo_omp_fn_1 -> bar. The implication of this is slower serial code execution.
This is even more obviously illustrated with the worksharing directive:
void foo(void)
int i;
#pragma omp for
for (i = 0; i < 12; i++)
int main(void)
#pragma omp parallel num_threads(4)
return 0;
The first call to foo() would run the loop in serial. The second call would distribute the 12 iterations among the 4 threads, i.e. each thread would only execute 3 iteratons. Once again, some code transformation magic is used to achieve this and the serial loop would run slower than if no #pragma omp for was present in foo().
The lesson here is to never add OpenMP constructs where they are not really necessary.


Implementation of OpenMP and Pthread

I have written a sequential code for a simulation process. Basically the flow goes like the following:
int count=0;
int i, j;
if(j==i) continue;
// Some vector calculations and finding out X value example: (X=i+j)
// Some calculations with respect to i value example: (K=i*0.8+0.2)
// Example:Z=X+K;
goto label;
I am new to parallel programming. How can I implement the above code using OpenMP and Pthreads.
I tried using #pragma omp parallel for collapse(2) before the first for loop and I am stuck in figuring out the next steps to be followed.

Async/Await in Dart

I'm making a Flutter app that using asynchronous a lot but it not working like how I understand about it. So I have some question about async and await in dart. Here is an example:
Future<int> someFunction() async {
int count = 0;
for (int i=0; i< 1000000000;i ++) {
count+= i;
return count;
Future<void> test2() async {
var a = await someFunction();
void _incrementCounter() {
test2() function will take a lot of time to done. right? So what i want is when test2 keep his work running until done, everything will keep running and not wait for test2().
When i run the function _incrementCounter(), it show the result:
above begin done below end
The problem is it didn't show "below" right away but it wait until someFunction() done.
This is result i want:
above begin below done end
This is the expected behavior since this change in Dart 2.0 which can be found in the changelog:
(Breaking) Functions marked async now run synchronously until the first await statement. Previously, they would return to the event loop once at the top of the function body before any code runs (issue 30345).
Before I give the solution I want to note you that async code are not running in another thread so the concept of:
keep his work running until done, everything will keep running and not
wait for test2()
Is fine but at some point your application are going to wait for test2() to finish since it is spawned as a task on the job queue where it will not leave other jobs to run before it is done. If you want the experience of no slowdown you want to either split the job into multiple smaller jobs or spawn an isolate (which are running in another thread) to run the calculation and later return the result.
Here is the solution go get your example to work:
Future<int> someFunction() async {
int count = 0;
for (int i=0; i< 1000000000;i ++) {
count+= i;
return count;
Future<void> test2() async {
var a = await Future.microtask(someFunction);
void _incrementCounter() {
main() {
By using the Future.microtask constructor we are scheduling the someFunction() to be running as another task. This makes it so the "await" are going to wait since it will be the first true instance of an async call.

Loop unrolling in Metal kernels

I need to force the Metal compiler to unroll a loop in my kernel compute function. So far I've tried to put #pragma unroll(num_times) before a for loop, but the compiler ignores that statement.
It seems that the compiler doesn't unroll the loops automatically — I compared execution times for 1) a code with for loop 2) the same code but with hand-unrolled loop. The hand-unrolled version was 3 times faster.
E.g.: I want to go from this:
for (int i=0; i<3; i++) {
to this:
Is there even something like loop unrolling in the Metal C++ language? If yes, how can I possibly let the compiler know I want to unroll a loop?
Metal is a subset C++11, and you can try using template metaprogramming to unroll loops. The following compiled in metal, though I don't have time to properly test it:
template <unsigned N> struct unroll {
template<class F>
static void call(F f) {
template <> struct unroll<0u> {
template<class F>
static void call(F f) {}
kernel void test() {
Please let me know if it works! You'll probably have to add some arguments to call to pass arguments to do_stuff.
See also: Self-unrolling macro loop in C/C++

Task based programming : #pragma omp task versus #pragma omp parallel for

Considering :
void saxpy_worksharing(float* x, float* y, float a, int N) {
#pragma omp parallel for
for (int i = 0; i < N; i++) {
y[i] = y[i]+a*x[i];
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
for (int i = 0; i < N; i++) {
#pragma omp task
y[i] = y[i]+a*x[i];
What is the difference using tasks and the omp parallel directive ? Why can we write recursive algorithms such as merge sort with tasks, but not with worksharing ?
I would suggest that you have a look at the OpenMP tutorial from Lawrence Livermore National Laboratory, available here.
Your particular example is one that should not be implemented using OpenMP tasks. The second code creates N times the number of threads tasks (because there is an error in the code beside the missing }; I would come back to it later), and each task is only performing a very simple computation. The overhead of tasks would be gigantic, as you can see in my answer to this question. Besides the second code is conceptually wrong. Since there is no worksharing directive, all threads would execute all iterations of the loop and instead of N tasks, N times the number of threads tasks would get created. It should be rewritten in one of the following ways:
Single task producer - common pattern, NUMA unfriendly:
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
#pragma omp task
y[i] = y[i]+a*x[i];
The single directive would make the loop run inside a single thread only. All other threads would skip it and hit the implicit barrier at the end of the single construct. As barriers contain implicit task scheduling points, the waiting threads will start processing tasks immediately as they become available.
Parallel task producer - more NUMA friendly:
void saxpy_tasks(float* x, float* y, float a, int N) {
#pragma omp parallel
#pragma omp for
for (int i = 0; i < N; i++)
#pragma omp task
y[i] = y[i]+a*x[i];
In this case the task creation loop would be shared among the threads.
If you do not know what NUMA is, ignore the comments about NUMA friendliness.

CUDA cudaMemcpy: invalid argument

Here is my code:
struct S {
int a, b;
float c, d;
class A {
S* d;
S h[3];
A() {
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3));
void Init();
void A::Init() {
for (int i=0;i<3;i++) {
h[i].a = 0;
h[i].b = 1;
h[i].c = 2;
h[i].d = 3;
cutilSafeCall(cudaMemcpy(d, h, 3*sizeof(S), cudaMemcpyHostToDevice));
A a;
In fact it is a complex program which contain CUDA and OpenGL. When I debug this program, it fails when running at cudaMemcpy with the error information
cudaSafeCall() Runtime API error 11: invalid argument.
Actually, this program is transformed from another one that can run correctly. But in that one, I used two variables S* d and S h[3] in the main function instead of in the class. What is more weird is that I implement this class A in a small program, it works fine.
And I've updated my driver, error still exists.
Could anyone give me a hint on why this happen and how to solve it. Thanks.
Because the memory operations in CUDA are blocking, they make a synchronization point. So other errors, if not checked with cudaThreadSynchonize, will seem like errors on the memory calls.
So if an error is received on a memory operation, try to place a cudaThreadSynchronize before it and check the result.
Be sure that the first malloc statement is being executed. If it is a problem about initialization of CUDA, like #Harrism indicate, then it would fail in this statement?? Try to place printf statements, and see proper initializations are performed. I think generally invalid argument errors are generated because of using uninitalized memory areas.
Write a printf to your constructor showing the address of the cudaMalloc'ed memory area
d = NULL;
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3));
printf("D: %p\n", d);
Try to make a memory copy for an area that is locally allocated, namely move the cudaMalloc to above of cudaMemcopy (just for testing).
void A::Init()
for (int i=0;i<3;i++)
h[i].a = 0;
h[i].b = 1;
h[i].c = 2;
h[i].d = 3;
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3)); // here!..
cutilSafeCall(cudaMemcpy(d, h, 3*sizeof(S), cudaMemcpyHostToDevice));
Good luck.
