pthread deadlock issue when using pthread_cond_t - pthreads

I am having horrible time trying to figure out why my sync. code gets deadlocked when using pthread library. Using winapi primitives is instead of pthread works with no problems. Using c++11 threading also works fine (unless compiled with visual studio 2012 service pack 3, there it just crashes - Microsoft accepted it as a bug.) Using pthread however proves to be a problem - at least running in on linux machine, haven't had a chance to try different OS.
I've written a simple program which illustrates the issue. The code is just showing the deadlock - I am well aware the design is pretty poor and can be written much better.
typedef struct _pthread_event
{
pthread_mutex_t Mutex;
pthread_cond_t Condition;
unsigned char State;
} pthread_event;
void pthread_event_create( pthread_event * ev , unsigned char init_state )
{
pthread_mutex_init( &ev->Mutex , 0 );
pthread_cond_init( &ev->Condition , 0 );
ev->State = init_state;
}
void pthread_event_destroy( pthread_event * ev )
{
pthread_cond_destroy( &ev->Condition );
pthread_mutex_destroy( &ev->Mutex );
}
void pthread_event_set( pthread_event * ev , unsigned char state )
{
pthread_mutex_lock( &ev->Mutex );
ev->State = state;
pthread_mutex_unlock( &ev->Mutex );
pthread_cond_broadcast( &ev->Condition );
}
unsigned char pthread_event_get( pthread_event * ev )
{
unsigned char result;
pthread_mutex_lock( &ev->Mutex );
result = ev->State;
pthread_mutex_unlock( &ev->Mutex );
return result;
}
unsigned char pthread_event_wait( pthread_event * ev , unsigned char state , unsigned int timeout_ms )
{
struct timeval time_now;
struct timespec timeout_time;
unsigned char result;
gettimeofday( &time_now , NULL );
timeout_time.tv_sec = time_now.tv_sec + ( timeout_ms / 1000 );
timeout_time.tv_nsec = time_now.tv_usec * 1000 + ( ( timeout_ms % 1000 ) * 1000000 );
pthread_mutex_lock( &ev->Mutex );
while ( ev->State != state )
if ( ETIMEDOUT == pthread_cond_timedwait( &ev->Condition , &ev->Mutex , &timeout_time ) ) break;
result = ev->State;
pthread_mutex_unlock( &ev->Mutex );
return result;
}
static pthread_t thread_1;
static pthread_t thread_2;
static pthread_event data_ready;
static pthread_event data_needed;
void * thread_fx1( void * c )
{
for ( ; ; )
{
pthread_event_wait( &data_needed , 1 , 90 );
pthread_event_set( &data_needed , 0 );
usleep( 100000 );
pthread_event_set( &data_ready , 1 );
printf( "t1: tick\n" );
}
}
void * thread_fx2( void * c )
{
for ( ; ; )
{
pthread_event_wait( &data_ready , 1 , 50 );
pthread_event_set( &data_ready , 0 );
pthread_event_set( &data_needed , 1 );
usleep( 100000 );
printf( "t2: tick\n" );
}
}
int main( int argc , char * argv[] )
{
pthread_event_create( &data_ready , 0 );
pthread_event_create( &data_needed , 0 );
pthread_create( &thread_1 , NULL , thread_fx1 , 0 );
pthread_create( &thread_2 , NULL , thread_fx2 , 0 );
pthread_join( thread_1 , NULL );
pthread_join( thread_2 , NULL );
pthread_event_destroy( &data_ready );
pthread_event_destroy( &data_needed );
return 0;
}
Basically two threads signaling each other - start doing something, and doing their own thing even if not signaled after a short timeout.
Any idea what it going wrong there?
Thanks.

The problem is the timeout_time parameter to pthread_cond_timedwait(). The way you increment it is eventually and quite soon going to have an invalid value there with nanosecond part bigger than or equal to billion. In this case pthread_cond_timedwait() perhaps return with EINVAL, and probably actually before waiting for the condition.
The problem can be found very quickly with valgrind --tool=helgrind ./test_prog (quite soon it said it had already detected 10000000 errors and gave up counting):
bash$ gcc -Werror -Wall -g test.c -o test -lpthread && valgrind --tool=helgrind ./test
==3035== Helgrind, a thread error detector
==3035== Copyright (C) 2007-2012, and GNU GPL'd, by OpenWorks LLP et al.
==3035== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==3035== Command: ./test
==3035==
t1: tick
t2: tick
t2: tick
t1: tick
t2: tick
t1: tick
t1: tick
t2: tick
t1: tick
t2: tick
t1: tick
==3035== ---Thread-Announcement------------------------------------------
==3035==
==3035== Thread #2 was created
==3035== at 0x41843C8: clone (clone.S:110)
==3035==
==3035== ----------------------------------------------------------------
==3035==
==3035== Thread #2's call to pthread_cond_timedwait failed
==3035== with error code 22 (EINVAL: Invalid argument)
==3035== at 0x402DB03: pthread_cond_timedwait_WRK (hg_intercepts.c:784)
==3035== by 0x8048910: pthread_event_wait (test.c:65)
==3035== by 0x8048965: thread_fx1 (test.c:80)
==3035== by 0x402E437: mythread_wrapper (hg_intercepts.c:219)
==3035== by 0x407DD77: start_thread (pthread_create.c:311)
==3035== by 0x41843DD: clone (clone.S:131)
==3035==
t2: tick
==3035==
==3035== More than 10000000 total errors detected. I'm not reporting any more.
==3035== Final error counts will be inaccurate. Go fix your program!
==3035== Rerun with --error-limit=no to disable this cutoff. Note
==3035== that errors may occur in your program without prior warning from
==3035== Valgrind, because errors are no longer being displayed.
==3035==
^C==3035==
==3035== For counts of detected and suppressed errors, rerun with: -v
==3035== Use --history-level=approx or =none to gain increased speed, at
==3035== the cost of reduced accuracy of conflicting-access information
==3035== ERROR SUMMARY: 10000000 errors from 1 contexts (suppressed: 412 from 109)
Killed
There is also two other minor comments:
For improved correctness, in your pthread_event_set() you could have the condition variable broadcast be done before mutex unlock (the effect of wrong ordering basically could break the determinism of the scheduling; helgrind complains about this issue too);
you could safely drop the mutex locking in pthread_event_get() to return the value of ev->State - this should be atomic operation.

Related

What lives above the last accessible address in the stack?

I've asked people before about why the stack doesn't start at 0x7fff...c before, and was told that typically 0x800... onwards is for the kernel, and the cli args and environment variables live at the top of the user's stack which is why it starts below 0x7fff...c. But I recently tried to examine all the strings with the following program
#include <stdio.h>
#include <string.h>
int main(int argc, const char **argv) {
const char *ptr = argv[0];
while (1) {
printf("%p: %s\n", ptr, ptr);
size_t len = strlen(ptr);
ptr = (void *)ptr + len + 1;
}
}
However, after displaying all my environment variables, I see the following (I compiled the program to an executable called ./t):
0x7ffc19f84fa0: <final env variable string>
0x7ffc19f84fee: _=./t
0x7ffc19f84ff4: ./t
0x7ffc19f84ff8:
0x7ffc19f84ff9:
0x7ffc19f84ffa:
0x7ffc19f84ffb:
0x7ffc19f84ffc:
0x7ffc19f84ffd:
0x7ffc19f84ffe:
0x7ffc19f84fff:
So it appears there's one extra empty byte after the null terminator for the ./t string at bytes 0x7ffc19f84ff4..0x7ffc19f84ff7, and after that I segfault so I guess that's the base of the stack. What actually lives in the remaining "empty" space before kernel memory starts?
Edit: I also tried the following:
global _start
extern print_hex, fgets, puts, print, exit
section .text
_start:
pop rdi
mov rcx, 0
_start_loop:
mov rdi, rsp
call print_hex
pop rdi
call puts
jmp _start_loop
mov rdi, 0
call exit
where print_hex is a routine I wrote elsewhere. It seems this is all I can get
0x00007ffcd272de28
./bin/main
0x00007ffcd272de30
abc
0x00007ffcd272de38
def
0x00007ffcd272de40
ghi
0x00007ffcd272de48
make: *** [Makefile:47: run] Segmentation fault
so it seems that even in _start we don't begin at 0x7fff...

cJSON Memory leak when freeing cJSON object

I am facing an issue while using the cJSON Library. I am assuming that there is a memory leak that is breaking the code after a certain time (40 mins to 1 hr).
I have copied my code below :
void my_work_handler_5(struct k_work *work)
{
char *ptr1[6];
int y=0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf(rsrp_str, sizeof(rsrp_str), "%d", rsrp_current);
sensor5 = cJSON_CreateObject();
cJSON_AddItemToObject(sensor5, "description", cJSON_CreateString(desc6));
cJSON_AddItemToObject(sensor5, "Time", cJSON_CreateString(time_string));
cJSON_AddItemToObject(sensor5, "value", cJSON_CreateNumber(rsrp_current));
cJSON_AddItemToObject(sensor5, "unit", cJSON_CreateString(id6));
cJSON_AddItemToObject(sensor5, "type", cJSON_CreateString(type6));
/* print everything */
ptr1[counterdo] = cJSON_Print(sensor5);
printk("Counterdo value is : %d\n", counterdo);
cJSON_Delete(sensor5);
counterdo = counterdo + 1;
if (counterdo==6){
for(y=0;y<=counterdo;y++){
free(ptr1[y]);
}
counterdo = 0;
}
return;
}
I read some other threads regarding freeing up the memory and tried to do the same. Can anyone let me know if this is the right approach to free up the space allocated to the cJSON Object.
Regards,
Adeel.
Since cJSON is a portable library with no dependencies, this is better to look for a potential issue in your code on a PC: they are specialized tools available in this environment for facilitating the investigation. I am assuming here you have a Linux system, a Windows system with WSL or WSL2 installed, or a Linux virtual machine, available, and gcc, valgrind installed.
A minimal, self-contained, portable version of your code could be:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cJSON.h>
static int rsrp_current = 1;
static char *time_string = NULL;
void
my_work_handler_5 ()
{
char *ptr1[6];
int y = 0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf (rsrp_str, sizeof (rsrp_str), "%d", rsrp_current);
cJSON *sensor5 = cJSON_CreateObject ();
cJSON_AddItemToObject (sensor5, "description", cJSON_CreateString (desc6));
cJSON_AddItemToObject (sensor5, "Time", cJSON_CreateString (time_string));
cJSON_AddItemToObject (sensor5, "value", cJSON_CreateNumber (rsrp_current));
cJSON_AddItemToObject (sensor5, "unit", cJSON_CreateString (id6));
cJSON_AddItemToObject (sensor5, "type", cJSON_CreateString (type6));
/* print everything */
ptr1[counterdo] = cJSON_Print (sensor5);
printf ("Counterdo value is : %d\n", counterdo);
cJSON_Delete (sensor5);
counterdo = counterdo + 1;
if (counterdo == 6)
{
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
counterdo = 0;
}
return;
}
int
main (int argc, char **argv)
{
time_t curtime;
time (&curtime);
for (int n = 0; n < 3 * 6; n++)
{
my_work_handler_5 ();
}
}
Build procedure:
wget https://github.com/DaveGamble/cJSON/archive/v1.7.14.tar.gz
tar zxf v1.7.14.tar.gz
gcc -g -O0 -IcJSON-1.7.14 -o cjson cjson.c cJSON-1.7.14/cJSON.c
Running valgrind on the program:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./cjson
..indicates some memory is being freed that was not previously allocated: Invalid free() / delete / delete[] / realloc():
==6747==
==6747== HEAP SUMMARY:
==6747== in use at exit: 0 bytes in 0 blocks
==6747== total heap usage: 271 allocs, 274 frees, 14,614 bytes allocated
==6747==
==6747== All heap blocks were freed -- no leaks are possible
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
==6747==
==6747== 3 errors in context 1 of 2:
==6747== Invalid free() / delete / delete[] / realloc()
==6747== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Address 0x31 is not stack'd, malloc'd or (recently) free'd
==6747==
==6747==
==6747== 18 errors in context 2 of 2:
==6747== Conditional jump or move depends on uninitialised value(s)
==6747== at 0x483C9F5: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Uninitialised value was created by a stack allocation
==6747== at 0x109312: my_work_handler_5 (cjson.c:11)
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
Replacing:
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
by:
for (y = 0; y < counterdo; y++)
{
free (ptr1[y]);
}
and executing valgrind again:
==6834==
==6834== HEAP SUMMARY:
==6834== in use at exit: 1,095 bytes in 15 blocks
==6834== total heap usage: 271 allocs, 256 frees, 14,614 bytes allocated
==6834==
==6834== Searching for pointers to 15 not-freed blocks
==6834== Checked 75,000 bytes
==6834==
==6834== 1,095 bytes in 15 blocks are definitely lost in loss record 1 of 1
==6834== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6834== by 0x10B161: print (cJSON.c:1209)
==6834== by 0x10B25F: cJSON_Print (cJSON.c:1248)
==6834== by 0x1094AB: my_work_handler_5 (cjson.c:30)
==6834== by 0x10959C: main (cjson.c:59)
==6834==
==6834== LEAK SUMMARY:
==6834== definitely lost: 1,095 bytes in 15 blocks
==6834== indirectly lost: 0 bytes in 0 blocks
==6834== possibly lost: 0 bytes in 0 blocks
==6834== still reachable: 0 bytes in 0 blocks
==6834== suppressed: 0 bytes in 0 blocks
==6834==
==6834== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Some memory is definitively being leaked.
The reason is that char *ptr1[6] is not static, and is therefore created on the stack every time my_work_handler_5() is being called. The pointers that were returned are by cJSON_Print() are therefore lost between two calls, and free() is being called on arbitrary pointer values, since ptr1[] is not initialized as it could be:
char *ptr1[6] = { NULL, NULL, NULL, NULL, NULL, NULL };
Since you are freeing memory every 6 calls, this is causing the memory leak you were suspecting.
Replacing:
char *ptr1[6];
by:
static char *ptr1[6];
compiling, running valgrind again:
==6927==
==6927== HEAP SUMMARY:
==6927== in use at exit: 0 bytes in 0 blocks
==6927== total heap usage: 271 allocs, 271 frees, 14,614 bytes allocated
==6927==
==6927== All heap blocks were freed -- no leaks are possible
==6927==
==6927== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
The modified version of the program should now work on your bare-metal system.

mprotect errno 22 iOS

I'm developing a jailbroken app on iOS and getting errno 22 when calling
mprotect(p, 1024, PROT_READ | PROT_EXEC)
errno 22 means invalid arguments but I can't figure out whats wrong. I've aligned p to be a multiple of page size, and I've malloced the memory previously before calling mprotect.
Here's my code and sample output
#define PAGESIZE 4096
FILE * pFile;
pFile = fopen ("log.txt","w");
uint32_t code[] = {
0xe2800001, // add r0, r0, #1
0xe12fff1e, // bx lr
};
fprintf(pFile, "Before Execution\n");
p = (uint32_t *)malloc(1024+PAGESIZE-1);
if (!p) {
fprintf(pFile, "Couldn't malloc(1024)");
perror("Couldn't malloc(1024)");
exit(errno);
}
fprintf(pFile, "Malloced to %p\n", p);
p = (uint32_t *)(((uintptr_t)p + PAGESIZE-1) & ~(PAGESIZE-1));
fprintf(pFile, "Moved pointer to %p\n", p);
fprintf(pFile, "Before Compiling\n");
// copy instructions to function
p[0] = code[0];
p[1] = code[1];
fprintf(pFile, "After Compiling\n");
if (mprotect(p, 1024, PROT_READ | PROT_EXEC)) {
int err = errno;
fprintf(pFile, "Couldn't mprotect2: %i\n", errno);
perror("Couldn't mprotect");
exit(errno);
}
And output:
Before Execution
Malloced to 0x13611ec00
Moved pointer 0x13611f000
Before Compiling
After Compiling
Couldn't mprotect2: 22
Fixed this by using posix_memalign(). Turns out I wasn't aligning my pointer to the page size correctly

iOS 64bit ntpdate don't work properly with arm64 flag

I have a function (ANSI C) to retrieve time for our ntpd server.
This code work properly when I compile 32bit but doesn't work if I compile in armv64.
It works properly on iPhone 4,4S,5 (32bit), it doesn't work properly on Iphone 5s,6,6S (64bit).
I think that the problem is:
tmit=ntohl((time_t)buf[10]); //# get transmit time
time_t is now 8byte when compiled in armv64.....
Underneath you can find the source code...
Output Correct with Iphone 5 Simulator (32bit) ---------------------------
xxx.xxx.xxx.xxx PORT 123
sendto-->48
prima recv
recv-->48
tmit=-661900093
tmit=1424078403
1424078403-->Time: Mon Feb 16 10:20:03 2015
10:20:03 --> 37203
---------------------------------------------------------
Output Wrong with Iphone 6 Simulator (64bit) ---------------------------
xxx.xxx.xxx.xxx PORT 123
sendto-->48
prima recv
recv-->48
tmit=19612797
tmit=2105591293
2105591293-->Time: Tue Nov 19 00:47:09 38239
00:47:09 --> 2829
//---------------------------------------------------------------------------
long ntpdate(char *hostname) {
//ntp1.inrim.it (193.204.114.232)
//ntp2.inrim.it (193.204.114.233)
int portno=NTP_PORT; //NTP is port 123
int maxlen=1024; //check our buffers
int i=0; // misc var i
unsigned char msg[48]={010,0,0,0,0,0,0,0,0}; // the packet we send
unsigned long buf[maxlen]; // the buffer we get back
//struct in_addr ipaddr; //
struct protoent *proto; //
struct sockaddr_in server_addr;
int s; // socket
int tmit; // the time -- This is a time_t sort of
char ora[20]="";
//
//#we use the system call to open a UDP socket
//socket(SOCKET, PF_INET, SOCK_DGRAM, getprotobyname("udp")) or die "socket: $!";
proto=getprotobyname("udp");
s=socket(PF_INET, SOCK_DGRAM, proto->p_proto);
if(s==-1) {
//printf("ERROR socket=%d\n",s);
return -1;
}
//Setto il timeout per la ricezione --------------------
struct timeval tv;
tv.tv_sec = TIMEOUT_NTP; //sec
tv.tv_usec = 0;
if (setsockopt(s, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(struct timeval)) != 0)
{
//printf("Error assigning socket option");
return -1;
}
memset( &server_addr, 0, sizeof( server_addr ));
server_addr.sin_family=AF_INET;
//trasformo il nome in ip
struct hostent *hp = gethostbyname(hostname);
if (hp == NULL) {
return -1;
} else {
sprintf(hostname_ip, "%s", inet_ntoa( *( struct in_addr*)( hp -> h_addr_list[0])));
}
#ifdef LOG_NTP
printf("%s-->%s PORT %d\n",hostname,hostname_ip,portno);
#endif
server_addr.sin_addr.s_addr = inet_addr(hostname_ip);
server_addr.sin_port=htons(portno);
//printf("ipaddr (in hex): %x\n",server_addr.sin_addr);
/*
* build a message. Our message is all zeros except for a one in the
* protocol version field
* msg[] in binary is 00 001 000 00000000
* it should be a total of 48 bytes long
*/
// send the data
i=sendto(s,msg,sizeof(msg),0,(struct sockaddr *)&server_addr,sizeof(server_addr));
#ifdef LOG_NTP
printf("sendto-->%d\n",i);
#endif
if (i==-1)
return -1;
#ifdef LOG_NTP
printf("prima recv\n");
#endif
// get the data back
i=recv(s,buf,sizeof(buf),0);
#ifdef LOG_NTP
printf("recv-->%d\n",i);
#endif
if (i==-1)
{
#ifdef LOG_NTP
printf("Error: %s (%d)\n", strerror(errno), errno);
#endif
return -1;
}
//printf("recvfr: %d\n",i);
//We get 12 long words back in Network order
//for(i=0;i<12;i++)
//printf("%d\t%-8x\n",i,ntohl(buf[i]));
/*
* The high word of transmit time is the 10th word we get back
* tmit is the time in seconds not accounting for network delays which
* should be way less than a second if this is a local NTP server
*/
tmit=ntohl((time_t)buf[10]); //# get transmit time
#ifdef LOG_NTP
printf("tmit=%d\n",tmit);
#endif
/*
* Convert time to unix standard time NTP is number of seconds since 0000
* UT on 1 January 1900 unix time is seconds since 0000 UT on 1 January
* 1970 There has been a trend to add a 2 leap seconds every 3 years.
* Leap seconds are only an issue the last second of the month in June and
* December if you don't try to set the clock then it can be ignored but
* this is importaint to people who coordinate times with GPS clock sources.
*/
tmit-= 2208988800U;
#ifdef LOG_NTP
printf("tmit=%d\n",tmit);
#endif
/* use unix library function to show me the local time (it takes care
* of timezone issues for both north and south of the equator and places
* that do Summer time/ Daylight savings time.
*/
//#compare to system time
#ifdef LOG_NTP
//printf("%d-->Time: %s\n",tmit,ctime((const time_t)&tmit));
printf("%d-->Time: %s\n",tmit,ctime((const time_t)&tmit));
#endif
//i=time(0);
//printf("%d-%d=%d\n",i,tmit,i-tmit);
//printf("System time is %d seconds off\n",i-tmit);
//Prendo l'ora e la converto in HH:MM:SS --> Sec
strftime(ora, 20, "%T", localtime((const time_t)&tmit));
#ifdef LOG_NTP
printf("%s --> %ld\n",ora, C2TIME(ora));
#endif
return C2TIME(ora);
}
I Solved the Problem!!!!!!!!!
uint32_t buf[maxlen];
uint32_t tmit;
instead of:
unsigned long buf[maxlen];
int tmit;
Defining a variable of type time_t
time_t tmit_temp=tmit;
printf("%d-->Time: %s\n",tmit,ctime((const time_t)&tmit_temp));
strftime(ora, 20, "%T", localtime((const time_t)&tmit_temp));
This works properly!!! ;-)

Where is the global memory replay overhead coming from?

Running the code below to write 1 GB in global memory in the NVIDIA Visual Profiler, I get:
- 100% storage efficiency
- 69.4% (128.6 GB/s) DRAM utilization
- 18.3% total replay overhead
- 18.3% global memory replay overhead.
The memory writes are supposed to be coalesced and there is no divergence in the kernel, so the question is where is the global memory replay overhead coming from? I am running this on Ubuntu 13.04, with nvidia-cuda-toolkit version 5.0.35-4ubuntu1.
#include <cuda.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <ctype.h>
#include <sched.h>
#include <assert.h>
static void
HandleError( cudaError_t err, const char *file, int line )
{
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString(err), file, line);
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))
// Global memory writes
__global__ void
kernel_write(uint32_t *start, uint32_t entries)
{
uint32_t tid = threadIdx.x + blockIdx.x*blockDim.x;
while (tid < entries) {
start[tid] = tid;
tid += blockDim.x*gridDim.x;
}
}
int main(int argc, char *argv[])
{
uint32_t *gpu_mem; // Memory pointer
uint32_t n_blocks = 256; // Blocks per grid
uint32_t n_threads = 192; // Threads per block
uint32_t n_bytes = 1073741824; // Transfer size (1 GB)
float elapsedTime; // Elapsed write time
// Allocate 1 GB of memory on the device
HANDLE_ERROR( cudaMalloc((void **)&gpu_mem, n_bytes) );
// Create events
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
// Write to global memory
HANDLE_ERROR( cudaEventRecord(start, 0) );
kernel_write<<<n_blocks, n_threads>>>(gpu_mem, n_bytes/4);
HANDLE_ERROR( cudaGetLastError() );
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&elapsedTime, start, stop) );
// Report exchange time
printf("#Delay(ms) BW(GB/s)\n");
printf("%10.6f %10.6f\n", elapsedTime, 1e-6*n_bytes/elapsedTime);
// Destroy events
HANDLE_ERROR( cudaEventDestroy(start) );
HANDLE_ERROR( cudaEventDestroy(stop) );
// Free memory
HANDLE_ERROR( cudaFree(gpu_mem) );
return 0;
}
The nvprof profiler and the API profiler are giving different results:
$ nvprof --events gst_request ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
13.345920 80.454690
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8388608 8388608 8388608 gst_request
$ nvprof --events global_store_transaction ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
9.469216 113.392892
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8257560 8257560 8257560 global_store_transaction
I had the impression that global_store_transation could not be lower than gst_request. What is going on here? I can't ask for both events in the same command, so I had to run the two separate commands. Could this be the problem?
Strangely, the API profiler shows different results with perfect coalescing. Here is the output, I had to run twice to get the proper counters:
$ cat config.txt
inst_issued
inst_executed
gst_request
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log.csv COMPUTE_PROFILE_CONFIG=config.txt ./app
$ cat log.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eaca946b8
method,gputime,cputime,occupancy,inst_issued,inst_executed,gst_request,gld_request
_Z12kernel_writePjj,7771.776,7806.000,1.000,4737053,3900426,557058,0
$ cat config2.txt
global_store_transaction
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log2.csv COMPUTE_PROFILE_CONFIG=config2.txt ./app
$ cat log2.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eea92d0e8
method,gputime,cputime,occupancy,global_store_transaction
_Z12kernel_writePjj,7807.584,7831.000,1.000,557058
Here gst_request and global_store_transactions are exactly the same, showing perfect coalescing. Which one is correct (nvprof or the API profiler)? Why does NVIDIA Visual Profiler says that I have non-coalesced writes? There are still significant instruction replays, and I have no idea where they are coming from :(
Any ideas? I don't think this is hardware malfunctioning, since I have two boards on the same machine and both show the same behavior.

Resources