pthred_exit return variable static vs global scope - pthreads

I am seeing different behaviors when variable used to get return values using pthread_join is defined gloabal vs static scope. I have included code_snippet here.
Static variables
int main()
{
static int r1,r2;
pthread_t t1, t2;
int i1[] = {1,2};
int i2[] = {3,4};
r1 = pthread_create( &t1, NULL, myfn, (void*)i1);
r2 = pthread_create( &t2, NULL, myfn, (void*)i2);
pthread_join( t1, (void *)&r1 );
pthread_join( t2, (void *)&r2 );
printf("Thread 1 returns: %d\n",r1);
printf("Thread 2 returns: %d\n",r2);
return 0;
}
void *myfn( void *intarray )
{
pthread_t t=pthread_self();
int *g = (int *) intarray;
int i=0;
int d=1;
for (i=g[0];i<=g[1];i++)
d*=i;
fprintf(stderr, "TID=%u %d\n",t, d);
pthread_exit((void *)d);
}
Return value
TID=3425117952 12
TID=3433510656 2
Thread 1 returns: 2
Thread 2 returns: 12
Global variables
int r1,r2;
int main()
{
same as above
}
void *myfn( void *intarray )
{
same as above
}
Return value
TID=3425117952 12
TID=3433510656 2
Thread 1 returns: 0 <<<<< it returns 0
Thread 2 returns: 12
Could someone please explain why it behaves differently ?

Almost certainly it's because the size of int and void * differ on your platform, so when pthread_join() writes a void * value through the int * pointer you gave it, it overwrites adjacent memory.
The different declaration of r1 and r2 changes the layout of the variables enough to change the effect you see.
Casting an int to void * in order to return it is messy; you're better off either allocating space for a result in the main thread and passing that to the thread when it starts, or have the thread allocate the result and return a pointer to it when it finishes.
However, if you insist on the cast to void method, you can fix it by passing the address of an actual void * object to pthread_join and then casting from that to int:
int main()
{
static int r1,r2;
void *result;
pthread_t t1, t2;
int i1[] = {1,2};
int i2[] = {3,4};
r1 = pthread_create( &t1, NULL, myfn, (void*)i1);
r2 = pthread_create( &t2, NULL, myfn, (void*)i2);
pthread_join( t1, &result );
r1 = (int)result;
pthread_join( t2, &result );
r2 = (int)result;
printf("Thread 1 returns: %d\n",r1);
printf("Thread 2 returns: %d\n",r2);
return 0;
}

Related

BinaryOperator doesn't work when comes to a=function(b,c)?

I want to identify the Expression like int a = function(b,c), so I wrote the code as followers:
void foo(int* a, int *b) {
int x;
int m;
int z;
int *p;
if (a[0] > 1) {
b[0] = 2;
z=10;
x = function( sizeof(char));
}
m = function( sizeof(char));
bar(x,m);
}
void bar(float x, float y);
int function(int size){
return size;
}
And than I used clang -Xclang -ast-dump -fsyntax-only cfunc_with_if.c to get the AST of the code:
From the result I found the AST Node type of int a = function(b,c) is BinaryOperator. In order to verify this, I use VisitStmt(Stmt *s) to print out all stmts' type.
bool VisitStmt(Stmt *s) {
if(isa<Stmt>(s)) {
Stmt *Statement = dyn_cast<Stmt>(s);
//Statement->dump();
std::string st(Statement->getStmtClassName());
st = st + "\n";
TheRewriter.InsertText(Statement->getLocStart(), st, true, true);
}
return true;
}
But the result is so weird. There is nothing printed out about the type of int a = function(b,c). and I'm so confused about the result. Is there some error in my code or something else?
There's no output at bar(x,m); either. Are there any errors when the tool compiles the code being analyzed? As written above, the code would fail to compile at x = function( sizeof(char)); since function has not been declared. Even when compilation has failed due to errors, the libtool tools can still run at least partially, with strange results.
Edit to add: what happens if you run the tool on this code?
void bar(float x, float y);
int function(int size);
void foo(int* a, int *b) {
int x;
int m;
int z;
int *p;
if (a[0] > 1) {
b[0] = 2;
z=10;
x = function( sizeof(char));
}
m = function( sizeof(char));
bar(x,m);
}
void bar(float x, float y);
int function(int size){
return size;
}

histogram kernel memory issue

I am trying to implement an algorithm to process images with more than 256 bins.
The main issue to process histogram in such case comes from the impossibility to allocate more than 32 Kb as local tab in the GPU.
All the algorithms I found for 8 bits per pixel images use a fixed size tab locally.
The histogram is the first process in that tab then a barrier is up and at last an addition is made with the output vector.
I am working with IR image which has more than 32K bins of dynamic.
So I cannot allocate a fixed size tab inside the GPU.
My algorithm use an atomic_add in order to create directly the output histogram.
I am interfacing with OpenCV so, in order to manage the possible case of saturation my bins use floating points. Depending on the ability of the GPU to manage single or double precision.
OpenCV doesn't manage unsigned int, long, and unsigned long data type as matrix type.
I have an error... I do think this error is a kind of segmentation fault.
After several days I still have no idea what can be wrong.
Here is my code :
histogram.cl :
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable
static void Atomic_Add_f64(__global double *val, double delta)
{
union {
double f;
ulong i;
} old;
union {
double f;
ulong i;
} new;
do {
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg ( (volatile __global ulong *)val, old.i, new.i) != old.i);
}
static void Atomic_Add_f32(__global float *val, double delta)
{
union
{
float f;
uint i;
} old;
union
{
float f;
uint i;
} new;
do
{
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg ( (volatile __global ulong *)val, old.i, new.i) != old.i);
}
__kernel void khist(
__global const uchar* _src,
const int src_steps,
const int src_offset,
const int rows,
const int cols,
__global uchar* _dst,
const int dst_steps,
const int dst_offset)
{
const int gid = get_global_id(0);
// printf("This message has been printed from the OpenCL kernel %d \n",gid);
if(gid < rows)
{
__global const _Sty* src = (__global const _Sty*)_src;
__global _Dty* dst = (__global _Dty*) _dst;
const int src_step1 = src_steps/sizeof(_Sty);
const int dst_step1 = dst_steps/sizeof(_Dty);
src += mad24(gid,src_step1,src_offset);
dst += mad24(gid,dst_step1,dst_offset);
_Dty one = (_Dty)1;
for(int c=0;c<cols;c++)
{
const _Rty idx = (_Rty)(*(src+c+src_offset));
ATOMIC_FUN(dst+idx+dst_offset,one);
}
}
}
The function Atomic_Add_f64 directly come from here and there
main.cpp
#include <opencv2/core.hpp>
#include <opencv2/core/ocl.hpp>
#include <fstream>
#include <sstream>
#include <chrono>
int main()
{
cv::Mat_<unsigned short> a(480,640);
cv::RNG rng(std::time(nullptr));
std::for_each(a.begin(),a.end(),[&](unsigned short& v){ v = rng.uniform(0,100);});
bool ret = false;
cv::String file_content;
{
std::ifstream file_stream("../test/histogram.cl");
std::ostringstream file_buf;
file_buf<<file_stream.rdbuf();
file_content = file_buf.str();
}
int output_flag = cv::ocl::Device::getDefault().doubleFPConfig() == 0 ? CV_32F : CV_64F;
cv::String atomic_fun = output_flag == CV_32F ? "Atomic_Add_f32" : "Atomic_Add_f64";
cv::ocl::ProgramSource source(file_content);
// std::cout<<source.source()<<std::endl;
cv::ocl::Kernel k;
cv::UMat src;
cv::UMat dst = cv::UMat::zeros(1,65536,output_flag);
a.copyTo(src);
atomic_fun = cv::format("-D _Sty=%s -D _Rty=%s -D _Dty=%s -D ATOMIC_FUN=%s",
cv::ocl::typeToStr(src.depth()),
cv::ocl::typeToStr(src.depth()), // this to manage case like a matrix of usigned short stored as a matrix of float.
cv::ocl::typeToStr(output_flag),
atomic_fun.c_str());
ret = k.create("khist",source,atomic_fun);
std::cout<<"check create : "<<ret<<std::endl;
k.args(cv::ocl::KernelArg::ReadOnly(src),cv::ocl::KernelArg::WriteOnlyNoSize(dst));
std::size_t sz = a.rows;
ret = k.run(1,&sz,nullptr,false);
std::cout<<"check "<<ret<<std::endl;
cv::Mat b;
dst.copyTo(b);
std::copy_n(b.ptr<double>(0),101,std::ostream_iterator<double>(std::cout," "));
std::cout<<std::endl;
return EXIT_SUCCESS;
}
Hello I arrived to fix it.
I don't really know where the issue come from.
But if I suppose the output as a pointer rather than a matrix it work.
The changes I made are these :
histogram.cl :
__kernel void khist(
__global const uchar* _src,
const int src_steps,
const int src_offset,
const int rows,
const int cols,
__global _Dty* _dst)
{
const int gid = get_global_id(0);
if(gid < rows)
{
__global const _Sty* src = (__global const _Sty*)_src;
__global _Dty* dst = _dst;
const int src_step1 = src_steps/sizeof(_Sty);
src += mad24(gid,src_step1,src_offset);
ulong one = 1;
for(int c=0;c<cols;c++)
{
const _Rty idx = (_Rty)(*(src+c+src_offset));
ATOMIC_FUN(dst+idx,one);
}
}
}
main.cpp
k.args(cv::ocl::KernelArg::ReadOnly(src),cv::ocl::KernelArg::PtrWriteOnly(dst));
The rest of the code is the same in the two files.
For me it work fine.
If someone know why it work when the ouput is declared as a pointer rather than a vector (matrix of one row) I am interested.
Nevertheless my issue is fix :).

Output for sample code for an upcoming exam concerning pthread

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int token = 2;
int value = 3;
void * red ( void *arg ) {
int myid = * ((int *) arg);
pthread_mutex_lock( &mutex );
while ( myid != token) {
pthread_cond_wait( &cond, &mutex );
}
value = value + (myid + 3);
printf( "RED: id is %d \n", value);
token = (token + 1) % 3;
pthread_cond_broadcast( &cond );
pthread_mutex_unlock( &mutex );
}
void * blue ( void *arg ) {
int myid = * ((int *) arg);
pthread_mutex_lock( &mutex );
while ( myid != token) {
pthread_cond_wait( &cond, &mutex );
}
value = value * (myid + 2);
printf( "BLUE: id is %d \n", value);
token = (token + 1) % 3;
pthread_cond_broadcast( &cond );
pthread_mutex_unlock( &mutex );
}
void * white ( void *arg ) {
int myid = * ((int *) arg);
pthread_mutex_lock( &mutex );
while ( myid != token) {
pthread_cond_wait( &cond, &mutex );
}
value = value * (myid + 1);
printf( "WHITE: id is %d \n", value);
token = (token + 1) % 3;
pthread_cond_broadcast( &cond );
pthread_mutex_unlock( &mutex );
}
main( int argc, char *argv[] ) {
pthread_t tid;
int count = 0;
int id1, id2, id3;
id1 = count;
n = pthread_create( &tid, NULL, red, &id1);
id2 = ++count;
n = pthread_create( &tid, NULL, blue, &id2);
id3 = ++count;
n = pthread_create( &tid, NULL, white, &id3);
if ( n = pthread_join( tid, NULL ) ) {
fprintf( stderr, "pthread_join: %s\n", strerror( n ) );
exit( 1 );
}
}
I am just looking for comments and or notes to what the output would be. THIS IS FOR AN EXAM AND WAS OFFERED AS AN EXAMPLE. THIS IS NOT HOMEWORK OR GOING TO BE USED FOR ANY TYPE OF SUBMISSION. I am looking to understand what is going on. Any help is greatly appreciated.
I'm going to assume that you know the function of the locks, condition variables, and the waits. Basically you have three threads that each call Red, Blue, and White. Token is originally 2, and value is originally 3.
Red is called when id1 = 0, but it will stay in the while block calling wait() until the token = 0.
Blue is called when id3 = 1, and will stay in the while block called wait() until the token is 1.
White is called when id2 = 2, and will stay in the while block calling wait() until the token is 2.
So White will enter the critical section first, since it's the only one that won't enter the while loop. So value = 3 * ( 3 ) = 9; token = ( 3 ) % 3 = 0;
Broadcast wakes every waiting thread, but the only one that will enter the critical section is Red. It adds 3 to value for 12; token = ( 1 ) % 3 = 1; Broadcast wakes Blue.
Blue enters the critical section. value = 12 * 3; token = 2 ( but it doesn't matter anymore ).
This would be the order of the threads would execute, which is what I assume the test is really asking. However, what should really come out is just:
White is 9
This is because there is only one pthread_t tid. So after pthread_join( tid, NULL ), it can immediately exit. If you put different pthread_t in each of the pthread_create() then all of them would print.

cudaFree is not freeing memory

The code below calculates the dot product of two vectors a and b. The correct result is 8192. When I run it for the first time the result is correct. Then when I run it for the second time the result is the previous result + 8192 and so on:
1st iteration: result = 8192
2nd iteration: result = 8192 + 8192
3rd iteration: result = 8192 + 8192
and so on.
I checked by printing it on screen and the device variable dev_c is not freed. What's more writing to it causes something like a sum, the result beeing the previous value plus the new one being written to it. I guess that could be something with the atomicAdd() operation, but nonetheless cudaFree(dev_c) should erase it after all.
#define N 8192
#define THREADS_PER_BLOCK 512
#define NUMBER_OF_BLOCKS (N/THREADS_PER_BLOCK)
#include <stdio.h>
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ ){
sum += temp[i];
}
atomicAdd(c,sum);
}
}
int main( void ) {
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;
int size = N * sizeof( int);
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, sizeof(int));
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(sizeof(int));
for(int i = 0 ; i < N ; i++){
a[i] = 1;
b[i] = 1;
}
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);
dot<<< N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>( dev_a, dev_b, dev_c);
cudaMemcpy( c, dev_c, sizeof(int) , cudaMemcpyDeviceToHost);
printf("Dot product = %d\n", *c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);
return 0;
}
cudaFree doesn't erase anything, it simply returns memory to a pool to be re-allocated. cudaMalloc doesn't guarantee the value of memory that has been allocated. You need to initialize memory (both global and shared) that your program uses, in order to have consistent results. The same is true for malloc and free, by the way.
From the documentation of cudaMalloc();
The memory is not cleared.
That means that dev_c is not initialized, and your atomicAdd(c,sum); will add to any random value that happens to be stored in memory at the returned position.

Double dimension tab count

I've forgotten how count a double dimension C array because I don't understand why this code return me a count of 12 instead of 6.
// My tab
static NSString *kStringTag[][2] = {
{#"string1", #"1"},
{#"string2", #"1"},
{#"string3", #"0"},
{#"string4", #"0"},
{#"string5", #"1"},
{#"string6", #"1"},
{nil, nil}
};
// My C func
unsigned int tablen(void **tab)
{
unsigned int i = 0;
while (tab[i] != nil)
i++;
return i;
}
- (void)viewDidLoad
{
NSLog(#"%d", tablen((void **)kStringTab));
}
Your code is not C.
If it were C, tab would be an array of array of pointers to NSStrings (whatever that is).
In C an array of arrays of pointers to NSStrings is not necessarily compatible with a pointer to pointer to void ... so remove the casts and get the types correct.
In C, this works ...
#include <stdio.h>
static char *kStringTab[][2] = {
{"string1", "1"},
{"string2", "1"},
{"string3", "0"},
{"string4", "0"},
{"string5", "1"},
{"string6", "1"},
{NULL, NULL},
};
unsigned int tablen(char *tab[][2]) {
unsigned int i = 0;
while (tab[i][0] != NULL) i++;
return i;
}
int main(void) {
printf("%d\n", tablen(kStringTab));
return 0;
}
Suggestion: increase the warning level of your compiler and mind the warnings.
Edit: new generic version
#include <math.h>
#include <stdio.h>
static double anothertest[][3] = {
{42, 54, -122},
{33, -0.001, 0.001},
{6, 0, 7}, /* 0 in middle: stop condition in nullp2 :) */
{2, 2, 2},
};
static char *kStringTab[][2] = {
{"string1", "1"},
{"string2", "1"},
{"string3", "0"},
{"string4", "0"},
{"string5", "1"},
{"string6", "1"},
{NULL, NULL},
};
int nullp2(const void *elem) {
const double *tmp = elem;
return (fabs(tmp[1]) < 0.000000001);
}
int nullp(const void *elem) {
char (*const *tmp)[2] = elem; /* tmp is a pointer to each element of kStringTab */
return ((*tmp)[0] == NULL);
}
unsigned int tablen(void *x, size_t size,
int (*check)(const void *)) {
char *y = x;
unsigned int i = 0;
while (!check(y)) {
i++;
y += size;
}
return i;
}
int main(void) {
printf("tablen returns %d\n",
tablen(kStringTab, sizeof *kStringTab, nullp));
printf("tablen returns %d\n",
tablen(anothertest, sizeof *anothertest, nullp2));
return 0;
}
You can see it running at ideone.
tab[i]
is just an offset from a memory address, and you have 12 items stored at that address.

Resources