OpenMP scheduler - task

I have problem with distributing tasks in the OpenMP.
I have next code:
#include <stdio.h>
#include <unistd.h>
int cnttotal = 0;
int cnt1 = 0, cnt2 = 0;
int main()
{
int i;
#pragma omp parallel
#pragma omp single nowait
for (i = 0; i < 60; i++) {
if (cnttotal < 1) {
cnttotal++;
#pragma omp task
{
#pragma omp atomic
cnt1++;
usleep(10);
cnttotal--;
}
} else {
#pragma omp task
{
#pragma omp atomic
cnt2++;
sleep(1);
}
}
}
printf("cnt1 = %d; cnt2 = %d\n", cnt1, cnt2);
return 0;
}
What would I didn't, cnt1 = 1, cnt2 = 59. I think that problem in OpenMP scheduler.
Or is there something don't catch.

My feeling is that you confuse about task instantiation with the actual execution of a task. The #pragma omp task refers to the instantiaton of a task and that is extremely fast. Another thing is that an idle thread of the OpenMP runtime looks a list for ready tasks and executes it.
Going into the problem you posted. In this code is a running thread (say T1) enters the first iteration (i=0), thus it enters into the first if and then sets cnttotal to 1 and instantiates the first task (cnt1). After that instantiation, T1 keeps instantiating the remaining tasks while an idle thread (say T2) executes the task cnt1 which takes approx 10us and sets cnttotal to 0 again.
So in brief, what happens is the thread that instantiates any task is faster to execute than those 10us in task cnt1.
For instance, in my Intel(R) Core(TM) i7-2760QM CPU # 2.40GHz if I changed the code so that the loop runs until i = 500 and sleeps 1 us (usleep (1)) I get:
cnt1 = 2; cnt2 = 498
which shows that instantiation of tasks is extremely fast.

Related

What is the overhead of running an empty block on a queue

I realized that I was queuing a lot of blocks calling to empty methods. In the debugger it looks like a lot is happening when really all the blocks are empty.
Is there any real performance impact from having empty blocks?
The overhead should be negligible: You may check this with Instruments and a simple program like:
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
#autoreleasepool {
dispatch_queue_t q = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0);
void (^b)(void) = ^{ };
double d = 2.0;
for(int i = 0; i < 10000000; ++i) {
dispatch_sync(q, b);
d = d * 1.5 - 1.0;
}
NSLog(#"d = %.3f", d);
}
return 0;
}
As you can see in the Instruments stack trace the calls require 40ms for 10 millions synchronous invocations of an empty block. That's not much overhead.

C Application with Pthread crashes

i have a problem with the pthread library in a C-Application for Linux.
In my Application a Thread is started over and over again.
But I allways wait until the Thread is finished before starting it.
At some point the thread doesn't start anymore and I get an out of memory error.
The solution I found is to do a pthread_join after the thread has finished.
Can anyone tell me why the Thread doesn't end correctly?
Here is an Example Code, that causes the same Problem.
If the pthread_join isn't called the Process stops at about 380 calls of the Thread:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>
#include <unistd.h>
volatile uint8_t check_p1 = 0;
uint32_t stack_start;
void *thread1(void *ch)
{
static int counter = 0;
int i;
int s[100000];
char stack_end;
srand(time(NULL) + counter);
for (i = 0; i < (sizeof (s)/sizeof(int)); i++) //do something
{
s[i] = rand();
}
counter++;
printf("Thread %i finished. Stacksize: %u\n", counter, ((uint32_t) (stack_start)-(uint32_t) (&stack_end)));
check_p1 = 1; // Mark Thread as finished
return 0;
}
int main(int argc, char *argv[])
{
pthread_t p1;
int counter = 0;
stack_start = (uint32_t)&counter; // save the Address of counter
while (1)
{
counter++;
check_p1 = 0;
printf("Start Thread %i\n", counter);
pthread_create(&p1, NULL, thread1, 0);
while (!check_p1) // wait until thread has finished
{
usleep(100);
}
usleep(1000); // wait a little bit to be really sure that the thread is finished
//pthread_join(p1,0); // crash without pthread_join
}
return 0;
}
The solution I found is to do a pthread_join after the thread has finished.
That is the correct solution. You must do that, or you leak thread resources.
Can anyone tell me why the Thread doesn't end correctly?
It does end correctly, but you must join it in order for the thread library to know: "yes, he is really done with this thread; no need to hold resources any longer".
This is exactly the same reason you must use wait (or waitpid, etc.) in this loop:
while (1) {
int status;
pid_t p = fork();
if (p == 0) exit(0); // child
// parent
wait(&status); // without this wait, you will run out of OS resources.
}

Why is RB interrupt routine running twice?

I have some code below that has a slight bug that I don't know how to fix. Essentially what is happening is my high ISR is running twice after the the flag is set. It only runs twice and is consistent. The subroutine should run only once because the flag is set when the input on RB changes, and the routine runs twice after one change to RB's input. The testing was conducted in MPLAB v8.6 using the workbook feature.
#include <p18f4550.h>
#include <stdio.h>
void init(void)
{
RCONbits.IPEN =1; //allows priority
INTCONbits.GIE = 1; //allows interrupts
INTCONbits.PEIE = 1; //allows peripheral interrupts
INTCONbits.RBIF = 0; //sets flag to not on
INTCONbits.RBIE = 1; //enables RB interrupts
INTCON2bits.RBPU = 1; //enable pull up resistors
INTCON2bits.RBIP = 1; //RB interrupts is high priority
PORTB = 0x00;
TRISBbits.RB7 = 1; //enable RB7 as an input so we can throw interrupts when it changes.
}
#pragma code
#pragma interrupt high_isr
void high_isr(void)
{
if(INTCONbits.RBIF == 1)
{
INTCONbits.RBIF = 0;
//stuff
}
}
#pragma code
#pragma code high_isr_entry = 0x08
void high_isr_entry(void)
{_asm goto high_isr _endasm}
void main(void)
{
init();
while(1);
}
The RB7 interrupt flag is set based on a compare of the last latched value and the current state of the pin. In the datasheet it says "The pins are compared with the old value latched on the last read of PORTB. The 'mismatch' outputs of RB7:RB4 are ORed together to generate the RB Port Change Interrupt with Flag bit."
To clear the mismatch condition the datasheet goes on to say "Any read or write of PORTB (except with the
MOVFF (ANY), PORTB instruction). This will end the mismatch condition."
You then wait one Tcy (execute a nop instruction) and then clear the flag. This is documented on page 116 of the datasheet.
So the simplest fix in this scenario is in the interrupt routine declare a dummy variable and set it to RB7 like this:
#pragma interrupt high_isr
void high_isr(void)
{
unsigned short dummy;
if(INTCONbits.RBIF == 1)
{
dummy = PORTBbits.RB7; // Perform read before clearing flag
Nop();
INTCONbits.RBIF = 0;
// rest of your routine here
// ...

pthread_kill ends calling program

I am working on Ubuntu 12.04.2 LTS. I have a strange problem with pthread_kill(). The following program ends after writing only "Create thread 0!" to standard output. The program ends with exit status 138.
If I uncomment "usleep(1000);" everything executes properly. Why would this happen?
#include <nslib.h>
void *testthread(void *arg);
int main() {
pthread_t tid[10];
int i;
for(i = 0; i < 10; ++i) {
printf("Create thread %d!\n", i);
Pthread_create(&tid[i], testthread, NULL);
//usleep(1000);
Pthread_kill(tid[i], SIGUSR1);
printf("Joining thread %d!\n", i);
Pthread_join(tid[i]);
printf("Joined %d!", i);
}
return 0;
}
void sighandlertest(int sig) {
printf("print\n");
pthread_exit();
//return NULL;
}
void* testthread(void *arg) {
struct sigaction saction;
memset(&saction, 0, sizeof(struct sigaction));
saction.sa_handler = &sighandlertest;
if(sigaction(SIGUSR1, &saction, NULL) != 0 ) {
fprintf(stderr, "Sigaction failed!\n");
}
printf("Starting while...\n");
while(true) {
}
return 0;
}
If the main thread does not sleep a bit before raising the SIGUSR1, the signal handler for the thread created most propably had not been set up, so the default action for receiving the signal applies, which is ending the process.
Using sleep()s to synchronise threads is not recommended as not guaranteed to be reliable. Use other mechanics here. A condition/mutex pair would be suitable.
Declare a global state variable int signalhandlersetup = 0, protect access to it by a mutex, create the thread, make the main thread wait using pthread_cond_wait(), let the created thread set up the signal handle for SIGUSR1, set signalhandlersetup = 0 and then signal the condition the main thread is waiting on using pthread_signal_cond(). Finally let the main thread call pthread_kill() as by your posting.

Multi-threads slower than single threads?

I am trying to find out how to speed up a certain part of my code. I've got three float variables: var1, var2, var3.
in sequential mode ...
double start1, end1, t1;
start1= (double)cvGetTickCount();
var1= tester->predict(videocapture, params1, image);
var2= tester->predict(videocapture, params2, image);
var3= tester->predict(videocapture, params3, image);
end1= (double)cvGetTickCount();
t1= (end1-start1)/((double)cvGetTickFrequency()*1000.);
printf( "Time1 = %g ms\n", t1 );
it seems to be slightly faster than parallel threads ...
double start2, end2, t2;
start2= (double)cvGetTickCount();
mp_set_dynamic(0); // Explicitly enable/disable dynamic teams
omp_set_num_threads(3); // Use 3 threads for all consecutive parallel regions
#pragma omp parallel num_threads(3)
{
#pragma omp sections //nowait
{
#pragma omp section
{
#pragma omp critical
{
var1= tester->predict(videocapture, params1, image);
}
}
#pragma omp section
{
#pragma omp critical
{
var2= tester->predict(videocapture, params2, image);
}
}
#pragma omp section
{
#pragma omp critical
{
var3= tester->predict(videocapture, params3, image);
}
}
}
}
}
end2= (double)cvGetTickCount();
t2= (end2-start2)/((double)cvGetTickFrequency()*1000.);
printf( "Time2 = %g ms\n", t2 );
Can someone please help me speed up the process of finding those three variables and tell me what I am doing wrong
Look at the specification of the critical pragma here.
There are three crucial points here:
1) If you don't give the critical pragma a name, it's mapped to some unspecified name, and the same name is used for all unnamed critical sections.
2) Only one thread at a time can be in any critical section with a given name.
3) When a thread encounters a critical section, it waits there until no other threads are active in it before moving into it.
Thus your code tells all your threads to wait while they all take turns executing one item of your code at a time. This is equivalent to doing the operations serially, except that it requires the overhead of generating the threads in the first place (and whatever overhead occurs due to the waiting, which might be busy-waiting).

Resources