pthread: one printf statement get printed twice in child thread - pthreads

this is my first pthread program, and I have no idea why the printf statement get printed twice in child thread:
int x = 1;
void *func(void *p)
{
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
return NULL;
}
int main(void)
{
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
printf("main thread: %ld\n", pthread_self());
func(NULL);
}
Observed output on my platform (Linux 3.2.0-32-generic #51-Ubuntu SMP x86_64 GNU/Linux):
1.
main thread: 140144423188224
tid 140144423188224: x is 2
2.
main thread: 140144423188224
tid 140144423188224: x is 3
3.
main thread: 139716926285568
tid 139716926285568: x is 2
tid 139716918028032: x is 3
tid 139716918028032: x is 3
4.
main thread: 139923881056000
tid 139923881056000: x is 3
tid 139923872798464tid 139923872798464: x is 2
for 3, two output lines from the child thread
for 4, the same as 3, and even the outputs are interleaved.

Threading generally occurs by time-division multiplexing. It is generally in-efficient for the processor to switch evenly between two threads, as this requires more effort and higher context switching. Typically what you'll find is a thread will execute several times before switching (as is the case with examples 3 and 4. The child thread executes more than once before it is finally terminated (because the main thread exited).
Example 2: I don't know why x is increased by the child thread while there is no output.
Consider this. Main thread executes. it calls the pthread and a new thread is created.The new child thread increments x. Before the child thread is able to complete the printf statement the main thread kicks in. All of a sudden it also increments x. The main thread is however also able to run the printf statement. Suddenly x is now equal to 3.
The main thread now terminates (also causing the child 3 to exit).
This is likely what happened in your case for example 2.
Examples 3 clearly shows that the variable x has been corrupted due to inefficient locking and stack data corruption!!
For more info on what a thread is.
Link 1 - Additional info about threading
Link 2 - Additional info about threading
Also what you'll find is that because you are using the global variable of x, access to this variable is shared amongst the threads. This is bad.. VERY VERY bad as threads accessing the same variable create race conditions and data corruption due to multiple read writes occurring on the same register for the variable x.
It is for this reason that mutexes are used which essentially create a lock whilst variables are being updated to prevent multiple threads attempting to modify the same variable at the same time.
Mutex locks will ensure that x is updated sequentially and not sporadically as in your case.
See this link for more about Pthreads in General and Mutex locking examples.
Pthreads and Mutex variables
Cheers,
Peter

Hmm. your example uses the same "resources" from different threads. One resource is the variable x, the other one is the stdout-file. So you should use mutexes as shown down here. Also a pthread_join at the end waits for the other thread to finish its job. (Usually a good idea would also be to check the return-codes of all these pthread... calls)
#include <pthread.h>
#include <stdio.h>
int x = 1;
pthread_mutex_t mutex;
void *func(void *p)
{
pthread_mutex_lock (&mutex);
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
pthread_mutex_unlock (&mutex);
return NULL;
}
int main(void)
{
pthread_mutex_init(&mutex, 0);
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_mutex_lock (&mutex);
printf("main thread: %ld\n", pthread_self());
pthread_mutex_unlock (&mutex);
func(NULL);
pthread_join (tid, 0);
}

It looks like the real answer is Michael Burr's comment which references this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14697
In summary, glibc does not handle the stdio buffers correctly during program exit.

Related

Thread index as an memory location index in CUDA

By definition, a thread is a path of execution within a process.
But during the implementation of a kernel, a thread_id or global_index is generated to access a memory location allocated. For instance, in the Matrix Multiplication code below, ROW and COL are generated to access matrix A and B sequential.
My doubt here is, index generated isn't pointing to a thread(by definition), instead, it is used to access the location of the data in the memory, then why do we refer to it as thread index or global thread index and why not memory index or something else?
__global__ void matrixMultiplicationKernel(float* A, float* B, float* C, int N) {
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
float tmpSum = 0;
if (ROW < N && COL < N) {
// each thread computes one element of the block sub-matrix
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
C[ROW * N + COL] = tmpSum;
}
 This question seems to be mostly about semantics, so let's start at Wikipedia
.... a thread of execution is the smallest sequence of programmed
instructions that can be managed independently by a scheduler ....
That is pretty much describes exactly what s thread in CUDA is -- the kernel is the sequence of instructions, and the scheduler is the warp/thread scheduler in each streaming multiprocessor on the GPU.
The code in your question is calculating the unique ID of the thread in the kernel launch, as it is abstracted in the CUDA programming/execution model. It has no intrinsic relationship to memory layouts, only to the unique ID in the kernel launch. The fact it is being used to ensure that each parallel operation is being performed on a different memory location is programming technique and nothing more.
Thread ID seems like a logical moniker to me, but to paraphrase Miles Davis when he was asked what the name of the jam his band just played at the Isle of Wight festival in 1970: "call it whatever you want".

passing file descriptors from the main process to its threads

I have a simple question regarding file descriptors passage from processes into their threads. I'm almost sure but need to a confirmation, if the files descriptors are treated as normal integers and thus can be passed through an array of integers for example to the process thread through the pthread_create() thread argument. Thanks
The rough definition of the term "process" could be "a memory space with at least one thread". In other words, all threads within the same process share a memory space.
Now, file descriptors are basically indices that reference objects within a table that belongs to the process. Since the objects belong to the process, and the threads operate inside the process, the threads can refer to these objects via their index ("file descriptor").
Yes, file descriptors are just integers and so can be passed as function arguments like any other variable. They will still refer to the same files, because the open files are shared by all the threads in a process.
#include <pthread.h>
struct files {
int count;
int* descriptors;
};
void* worker(void* p)
{
struct files *f = (struct files*)p;
// ...
}
int main(void)
{
struct files f;
f.count = 4;
f.descriptors = (int*)malloc(sizeof(int) * f.count);
f.descriptors[0] = open("...", O_RDONLY);
// ...
pthread_t t;
pthread_create(&t, NULL, worker, &f);
// ...
pthread_join(t);
}

MPI processes causes pthreads to execute sequentially

I wrote down an MPI/pthread hybrid code and I execute it on a cluster. Specifically, I compile it using mpicc -lpthread and launch 2 MPI processes on different nodes (6 nodes total, with 8 cores per node) using mpirun -np 2 -bynode and then create 8 threads on each node. However, the threads do not execute in parallel and they follow a sequential execution?
Parts of my code:
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Get_processor_name(hostname, &len);
// more code between these..
pthread_create(&thread_id, NULL, &sort, (void*) arrays[0]);
pthread_create(&thread_id2, NULL, &sort, (void*) arrays[1]);
Finally solved the problem, i just used a latest version of mpicc and it's all okay now

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Posix / Thread with join

I read a book, which give the next code:
void *printme(void *id) {
int *i;
i = (int *)id;
printf("Hi. I'm thread %d\n", *i);
return NULL;
}
void main() {
int i, vals[4];
pthread_t tids[4];
void *retval;
for (i = 0; i < 4; i++) {
vals[i] = i;
pthread_create(tids+i, NULL, printme, vals+i);
}
for (i = 0; i < 4; i++) {
printf("Trying to join with tid%d\n", i);
pthread_join(tids[i], &retval);
printf("Joined with tid%d\n", i);
}
}
and the next possible output:
Trying to join with tid0
Hi. I'm thread 0
Hi. I'm thread 1
Hi. I'm thread 2
Hi. I'm thread 3
Joined with tid0
Trying to join with tid1
Joined with tid1
Trying to join with tid2
Joined with tid2
Trying to join with tid3
Joined with tid3
And I don't understand how is it possible. We start with the main thread, and create 4 threads: tids[0]... tids[3]. Then, we suspend the execution (by the join instruction): the main thread would wait that tids[0] would stop the execution, tids[0] would wait to tids[1] and so on.
So the output should be:
Hi. I'm thread 0
Hi. I'm thread 1
Hi. I'm thread 2
Hi. I'm thread 3
Trying to join with tid0
Trying to join with tid1
Joined with tid0
Trying to join with tid2
Joined with tid1
Trying to join with tid3
Joined with tid2
Joined with tid3
I feel that I don't understand something really basic. Thanks.
I think what you're missing is that pthread_create is very different from fork. The created thread starts at the supplied function (printme, in this case) and exits as soon as that function returns. Hence, none of the newly created threads ever reaches the second for loop.
When you create new thread pthread_create then both thread #1 and main works in parallel. Main goes to next instruction which is phtread_join and hang until thread #1 finishes. This is why you have Trying to join with tid0 , then hello I'm thread #1.
Please also notice that main thread will join child threads in specified order. It means that when you have thread #1, thread #2 and thread #3 and thread 1 takes 10 second to execute, thread 2 takes 6 seconds to execute and thread 3 takes 7 seconds to execute, then first join will take place after 10 seconds and then in few milisecond you should have next joins, since all other thread should finish their jobs.

Resources