cJSON Memory leak when freeing cJSON object - memory

I am facing an issue while using the cJSON Library. I am assuming that there is a memory leak that is breaking the code after a certain time (40 mins to 1 hr).
I have copied my code below :
void my_work_handler_5(struct k_work *work)
{
char *ptr1[6];
int y=0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf(rsrp_str, sizeof(rsrp_str), "%d", rsrp_current);
sensor5 = cJSON_CreateObject();
cJSON_AddItemToObject(sensor5, "description", cJSON_CreateString(desc6));
cJSON_AddItemToObject(sensor5, "Time", cJSON_CreateString(time_string));
cJSON_AddItemToObject(sensor5, "value", cJSON_CreateNumber(rsrp_current));
cJSON_AddItemToObject(sensor5, "unit", cJSON_CreateString(id6));
cJSON_AddItemToObject(sensor5, "type", cJSON_CreateString(type6));
/* print everything */
ptr1[counterdo] = cJSON_Print(sensor5);
printk("Counterdo value is : %d\n", counterdo);
cJSON_Delete(sensor5);
counterdo = counterdo + 1;
if (counterdo==6){
for(y=0;y<=counterdo;y++){
free(ptr1[y]);
}
counterdo = 0;
}
return;
}
I read some other threads regarding freeing up the memory and tried to do the same. Can anyone let me know if this is the right approach to free up the space allocated to the cJSON Object.
Regards,
Adeel.

Since cJSON is a portable library with no dependencies, this is better to look for a potential issue in your code on a PC: they are specialized tools available in this environment for facilitating the investigation. I am assuming here you have a Linux system, a Windows system with WSL or WSL2 installed, or a Linux virtual machine, available, and gcc, valgrind installed.
A minimal, self-contained, portable version of your code could be:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cJSON.h>
static int rsrp_current = 1;
static char *time_string = NULL;
void
my_work_handler_5 ()
{
char *ptr1[6];
int y = 0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf (rsrp_str, sizeof (rsrp_str), "%d", rsrp_current);
cJSON *sensor5 = cJSON_CreateObject ();
cJSON_AddItemToObject (sensor5, "description", cJSON_CreateString (desc6));
cJSON_AddItemToObject (sensor5, "Time", cJSON_CreateString (time_string));
cJSON_AddItemToObject (sensor5, "value", cJSON_CreateNumber (rsrp_current));
cJSON_AddItemToObject (sensor5, "unit", cJSON_CreateString (id6));
cJSON_AddItemToObject (sensor5, "type", cJSON_CreateString (type6));
/* print everything */
ptr1[counterdo] = cJSON_Print (sensor5);
printf ("Counterdo value is : %d\n", counterdo);
cJSON_Delete (sensor5);
counterdo = counterdo + 1;
if (counterdo == 6)
{
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
counterdo = 0;
}
return;
}
int
main (int argc, char **argv)
{
time_t curtime;
time (&curtime);
for (int n = 0; n < 3 * 6; n++)
{
my_work_handler_5 ();
}
}
Build procedure:
wget https://github.com/DaveGamble/cJSON/archive/v1.7.14.tar.gz
tar zxf v1.7.14.tar.gz
gcc -g -O0 -IcJSON-1.7.14 -o cjson cjson.c cJSON-1.7.14/cJSON.c
Running valgrind on the program:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./cjson
..indicates some memory is being freed that was not previously allocated: Invalid free() / delete / delete[] / realloc():
==6747==
==6747== HEAP SUMMARY:
==6747== in use at exit: 0 bytes in 0 blocks
==6747== total heap usage: 271 allocs, 274 frees, 14,614 bytes allocated
==6747==
==6747== All heap blocks were freed -- no leaks are possible
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
==6747==
==6747== 3 errors in context 1 of 2:
==6747== Invalid free() / delete / delete[] / realloc()
==6747== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Address 0x31 is not stack'd, malloc'd or (recently) free'd
==6747==
==6747==
==6747== 18 errors in context 2 of 2:
==6747== Conditional jump or move depends on uninitialised value(s)
==6747== at 0x483C9F5: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Uninitialised value was created by a stack allocation
==6747== at 0x109312: my_work_handler_5 (cjson.c:11)
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
Replacing:
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
by:
for (y = 0; y < counterdo; y++)
{
free (ptr1[y]);
}
and executing valgrind again:
==6834==
==6834== HEAP SUMMARY:
==6834== in use at exit: 1,095 bytes in 15 blocks
==6834== total heap usage: 271 allocs, 256 frees, 14,614 bytes allocated
==6834==
==6834== Searching for pointers to 15 not-freed blocks
==6834== Checked 75,000 bytes
==6834==
==6834== 1,095 bytes in 15 blocks are definitely lost in loss record 1 of 1
==6834== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6834== by 0x10B161: print (cJSON.c:1209)
==6834== by 0x10B25F: cJSON_Print (cJSON.c:1248)
==6834== by 0x1094AB: my_work_handler_5 (cjson.c:30)
==6834== by 0x10959C: main (cjson.c:59)
==6834==
==6834== LEAK SUMMARY:
==6834== definitely lost: 1,095 bytes in 15 blocks
==6834== indirectly lost: 0 bytes in 0 blocks
==6834== possibly lost: 0 bytes in 0 blocks
==6834== still reachable: 0 bytes in 0 blocks
==6834== suppressed: 0 bytes in 0 blocks
==6834==
==6834== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Some memory is definitively being leaked.
The reason is that char *ptr1[6] is not static, and is therefore created on the stack every time my_work_handler_5() is being called. The pointers that were returned are by cJSON_Print() are therefore lost between two calls, and free() is being called on arbitrary pointer values, since ptr1[] is not initialized as it could be:
char *ptr1[6] = { NULL, NULL, NULL, NULL, NULL, NULL };
Since you are freeing memory every 6 calls, this is causing the memory leak you were suspecting.
Replacing:
char *ptr1[6];
by:
static char *ptr1[6];
compiling, running valgrind again:
==6927==
==6927== HEAP SUMMARY:
==6927== in use at exit: 0 bytes in 0 blocks
==6927== total heap usage: 271 allocs, 271 frees, 14,614 bytes allocated
==6927==
==6927== All heap blocks were freed -- no leaks are possible
==6927==
==6927== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
The modified version of the program should now work on your bare-metal system.

Related

What lives above the last accessible address in the stack?

I've asked people before about why the stack doesn't start at 0x7fff...c before, and was told that typically 0x800... onwards is for the kernel, and the cli args and environment variables live at the top of the user's stack which is why it starts below 0x7fff...c. But I recently tried to examine all the strings with the following program
#include <stdio.h>
#include <string.h>
int main(int argc, const char **argv) {
const char *ptr = argv[0];
while (1) {
printf("%p: %s\n", ptr, ptr);
size_t len = strlen(ptr);
ptr = (void *)ptr + len + 1;
}
}
However, after displaying all my environment variables, I see the following (I compiled the program to an executable called ./t):
0x7ffc19f84fa0: <final env variable string>
0x7ffc19f84fee: _=./t
0x7ffc19f84ff4: ./t
0x7ffc19f84ff8:
0x7ffc19f84ff9:
0x7ffc19f84ffa:
0x7ffc19f84ffb:
0x7ffc19f84ffc:
0x7ffc19f84ffd:
0x7ffc19f84ffe:
0x7ffc19f84fff:
So it appears there's one extra empty byte after the null terminator for the ./t string at bytes 0x7ffc19f84ff4..0x7ffc19f84ff7, and after that I segfault so I guess that's the base of the stack. What actually lives in the remaining "empty" space before kernel memory starts?
Edit: I also tried the following:
global _start
extern print_hex, fgets, puts, print, exit
section .text
_start:
pop rdi
mov rcx, 0
_start_loop:
mov rdi, rsp
call print_hex
pop rdi
call puts
jmp _start_loop
mov rdi, 0
call exit
where print_hex is a routine I wrote elsewhere. It seems this is all I can get
0x00007ffcd272de28
./bin/main
0x00007ffcd272de30
abc
0x00007ffcd272de38
def
0x00007ffcd272de40
ghi
0x00007ffcd272de48
make: *** [Makefile:47: run] Segmentation fault
so it seems that even in _start we don't begin at 0x7fff...

How to write to shared library code segment loaded into RAM memory?

I would like you to ask why I cannot write to the loaded shared library code segment in RAM memory in Linux 2.6.28.9 on MIPS CPU platform (LG TV). I am able to read bytes but not able to write anything. In the example source code below (cross-compiled in gcc) I get ERROR 22: Invalid argument (EINVAL) when write() function is called.
// this app tries to replace 4 bytes in code segment memory of loaded shared library
#include <stdio.h> // printf
#include <stdlib.h> // off_t
#include <dlfcn.h> // dlopen, dlclose
#include <fcntl.h> // open, O_RDWR
#include <unistd.h> // lseek, close, read
#include <errno.h> // errno
#include <string.h> // strerror
#include <sys/mman.h> // mprotect, PROT_READ, PROT_WRITE, PROT_EXEC
#define BYTES_TO_REPLACE 4
int main (int argc, char *argv[])
{
int fd, pid;
unsigned *handle;
unsigned long pagesize;
off_t fun_addr, pa_fun_addr;
unsigned char buf[BYTES_TO_REPLACE];
char s[100];
// initialize
pagesize = sysconf(_SC_PAGESIZE); // memory page size from system
pid = getpid(); // PID of current process
// open shared library file, OK
handle = dlopen("/path_to_library_files/shared_library.so", RTLD_LAZY | RTLD_GLOBAL);
// get function address, OK
fun_addr = (off_t)dlsym(handle, "function_name_in_lib");
// open memory device (pseudo-file), OK
sprintf(s, "/proc/%d/mem", pid); // memory space of our process (/proc/self/mem)
//strcpy(s, "/dev/mem"); // in that case when reading from memory ==> ERROR 14: Bad address
fd = open(s, O_RDWR); // open for reading and writing
// go to starting address of the library function loaded earlier, OK
lseek(fd, fun_addr, SEEK_SET);
// read from memory, OK
read(fd, buf, BYTES_TO_REPLACE);
printf("old replaced bytes = [%02X %02X %02X %02X]\n", buf[0], buf[1], buf[2], buf[3]);
// move back, OK
lseek(fd, fun_addr, SEEK_SET);
// unprotect memory page - no error, but does not help
pa_fun_addr = (fun_addr / pagesize) * pagesize; // page-aligned address
mprotect((void *)pa_fun_addr, pagesize, PROT_READ | PROT_WRITE | PROT_EXEC);
// write new data to memory: ERROR 22: Invalid argument
buf[0] = 0x08; buf[1] = 0x00; buf[2] = 0xE0; buf[3] = 0x03; // replacing 4-byte command: jr $ra (MIPS CPU)
if (write(fd, buf, BYTES_TO_REPLACE) != BYTES_TO_REPLACE) printf("ERROR %d: %s!\n", errno, strerror(errno));
// close memory device and shared library
close(fd);
dlclose(handle);
return 0;
}
This is because process code in memory doesn't have write permission by default. To see the permissions of a process's memory, use pmap:
For example, shared libraries below have only rx permission at most:
sudo pmap 5869
5869: vim supervisor_meeting-2017-05-22.txt
000055b391f62000 2604K r-x-- vim
000055b3923ed000 56K r---- vim
000055b3923fb000 100K rw--- vim
000055b392414000 60K rw--- [ anon ]
000055b393377000 2868K rw--- [ anon ]
00007fc59ef5a000 40K r-x-- libnss_files-2.24.so
00007fc59ef64000 2048K ----- libnss_files-2.24.so
00007fc59f164000 4K r---- libnss_files-2.24.so
00007fc59f165000 4K rw--- libnss_files-2.24.so
00007fc59f166000 24K rw--- [ anon ]
<..snip..>
I understand that you're trying to change this with mprotect - but you're also not checking the return value from the mprotect() call - this is probably failing for some reason.
Also, as an aside - write() is not guaranteed to write all bytes given to it, and is quite within it's design to return with no or a partial number of bytes written - I'd suggest you also change the code to reflect this.

Memory leak in PKCS12_newpass

I wrote function which changes PKCS12 certificate passphrase.
I noticed that PKCS12_newpass function leaks memory. When commenting out this line then memory leak is not generated.
How I could fix this memory leak?
- (NSData*)changePKCS12:(NSData*)p12Data
oldPassphrase:(NSString*)oldPassphrase
newPassphrase:(NSString*)newPassphrase {
OpenSSL_add_all_algorithms();
BIO *bp = NULL;
PKCS12 *p12 = NULL;
int status = 0;
do {
bp = BIO_new_mem_buf((void *)[p12Data bytes], (int)[p12Data length]);
p12 = d2i_PKCS12_bio(bp, NULL);
// MEMORY LEAK in PKCS12_newpass
status = PKCS12_newpass(p12, (char *)[oldPassphrase UTF8String], (char *)[newPassphrase UTF8String]);
} while (false);
if (p12) {
PKCS12_free(p12);
p12 = NULL;
}
if (bp) {
BIO_free_all(bp);
bp = NULL;
}
EVP_cleanup();
return NULL;
}
Here are the two leaks reported by Valgrind:
$ valgrind --leak-check=full ./test.exe
==32547== Memcheck, a memory error detector
==32547== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==32547== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==32547== Command: ./test.exe
==32547==
==32547==
==32547== HEAP SUMMARY:
==32547== in use at exit: 4,044 bytes in 25 blocks
==32547== total heap usage: 3,273 allocs, 3,248 frees, 149,992 bytes allocated
==32547==
==32547== 1,307 (32 direct, 1,275 indirect) bytes in 1 blocks are definitely lost in loss record 22 of 24
==32547== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==32547== by 0x408A76: CRYPTO_malloc (mem.c:140)
==32547== by 0x408AA9: CRYPTO_zalloc (mem.c:148)
==32547== by 0x447104: asn1_item_embed_new (tasn_new.c:171)
==32547== by 0x446E66: ASN1_item_ex_new (tasn_new.c:88)
==32547== by 0x4439AA: asn1_item_embed_d2i (tasn_dec.c:333)
==32547== by 0x4431B7: ASN1_item_ex_d2i (tasn_dec.c:162)
==32547== by 0x44314A: ASN1_item_d2i (tasn_dec.c:152)
==32547== by 0x4AB8BA: PKCS12_item_decrypt_d2i (p12_decr.c:159)
==32547== by 0x40CA79: PKCS8_decrypt (p12_p8d.c:69)
==32547== by 0x40C8DC: newpass_bag (p12_npas.c:206)
==32547== by 0x40C868: newpass_bags (p12_npas.c:188)
==32547==
==32547== 2,625 (32 direct, 2,593 indirect) bytes in 1 blocks are definitely lost in loss record 24 of 24
==32547== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==32547== by 0x408A76: CRYPTO_malloc (mem.c:140)
==32547== by 0x408AA9: CRYPTO_zalloc (mem.c:148)
==32547== by 0x41258E: sk_new (stack.c:153)
==32547== by 0x41256A: sk_new_null (stack.c:146)
==32547== by 0x40C27E: sk_PKCS7_new_null (pkcs7.h:199)
==32547== by 0x40C489: newpass_p12 (p12_npas.c:118)
==32547== by 0x40C3CC: PKCS12_newpass (p12_npas.c:96)
==32547== by 0x40315F: main (in /home/openssl/test.exe)
The first one is due to 0x40C8DC: newpass_bag (p12_npas.c:206):
X509_SIG_get0(&shalg, NULL, bag->value.shkeybag);
However, the get0 in X509_SIG_get0 does not bump the reference count, so I think its really the line before it (or a bug in X509_SIG_get0):
if (PKCS12_SAFEBAG_get_nid(bag) != NID_pkcs8ShroudedKeyBag)
PKCS12_SAFEBAG_get_nid is not documented. That means its a private API, so the OpenSSL devs have to fix the leak caused by it. (I think its actually due to PKCS12_item_decrypt_d2i a little deeper into the stack, but its out of reach because of PKCS12_SAFEBAG_get_nid).
The second one is due to 0x40C489: newpass_p12 (p12_npas.c:118):
if ((newsafes = sk_PKCS7_new_null()) == NULL)
return 0;
sk_PKCS7_new_null is not documented. That means its a private API, so the OpenSSL devs have to fix the leak caused by it.
How I could fix this memory leak?
Unfortunately, you cannot because both of the offenders are private APIs. The best you can do it report them at RT, which is the OpenSSL bug tracker.
There was a somewhat unexpected "big deal" about documentation and private APIs and such. See EC_KEY_priv2buf(): check parameter sanity for a discussion and the new rules.
According to a search of the sources, PKCS12_newpass lacks documentation, so its a private API, too (there's no POD file, which is used to build the man page):
$ grep -IR PKCS12_newpass *
CHANGES: *) New function PKCS12_newpass() which changes the password of a
crypto/pkcs12/pk12err.c: {ERR_FUNC(PKCS12_F_PKCS12_NEWPASS), "PKCS12_newpass"},
crypto/pkcs12/p12_npas.c:int PKCS12_newpass(PKCS12 *p12, const char *oldpass, const char *newpass)
crypto.map: PKCS12_newpass;
include/openssl/pkcs12.h:int PKCS12_newpass(PKCS12 *p12, const char *oldpass, const char *newpass);
util/libcrypto.num:PKCS12_newpass 3204 1_1_0 EXIST::FUNCTION:
A bug report was submitted with documentation at Issue 4478: DOCUMENTATION PKCS12_newpass. It should help get beyond the "documentation else its private" rule.
Below is a cat test.cc:
#include "openssl/pkcs12.h"
#include "openssl/bio.h"
#include "openssl/engine.h"
#include "openssl/conf.h"
#include "openssl/err.h"
/* openssl req -x509 -newkey rsa:1024 -keyout key.pem -nodes -out cert.pem -days 365 */
/* openssl pkcs12 -export -out pkcs12.p12 -inkey key.pem -in cert.pem */
/* gcc -ansi -I . -I ./include test.cc ./libcrypto.a -o test.exe */
int main(int argc, char* argv[])
{
OpenSSL_add_all_algorithms();
BIO *bp = NULL;
PKCS12 *p12 = NULL;
int rc = -1;
unsigned long err = 0;
char password[] = "passphrase";
bp = BIO_new_file("pkcs12.p12", "r");
if (bp == NULL) goto cleanup;
p12 = d2i_PKCS12_bio(bp, NULL);
if (p12 == NULL) goto cleanup;
/* Use empty string when no password was applies with 'openssl pkcs12' */
rc = PKCS12_newpass(p12, password, password);
cleanup:
if (rc == 1)
{
fprintf(stdout, "Sucessfully changed password\n");
}
else
{
err = ERR_get_error();
fprintf(stderr, "Failed to change password, error %lu\n", err);
}
if (p12)
PKCS12_free(p12);
if (bp)
BIO_free_all(bp);
/* http://wiki.openssl.org/index.php/Library_Initialization#Cleanup */
ENGINE_cleanup();
CONF_modules_unload(1);
EVP_cleanup();
CRYPTO_cleanup_all_ex_data();
#if OPENSSL_API_COMPAT < 0x10000000L
ERR_remove_state(0);
#else
ERR_remove_thread_state();
#endif
ERR_free_strings();
return 0;
}

SIGBUS crash on Solaris 8

Compiled with g++ 4.7.4 on Solaris 8. 32 bit application. Stack trace is
Core was generated by `./z3'.
Program terminated with signal 10, Bus error.
\#0 0x012656ec in vector<unsigned long long, false, unsigned int>::push_back (this=0x2336ef4 <g_prime_generator>, elem=#0xffbff1f0: 2) at ../src/util/vector.h:284
284 new (m_data + reinterpret_cast<SZ *>(m_data)[SIZE_IDX]) T(elem);
(gdb) bt
\#0 0x012656ec in vector<unsigned long long, false, unsigned int>::push_back (this=0x2336ef4 <g_prime_generator>, elem=#0xffbff1f0: 2) at ../src/util/vector.h:284
\#1 0x00ae66d4 in prime_generator::prime_generator (this=0x2336ef4 <g_prime_generator>) at ../src/util/prime_generator.cpp:24
\#2 0x00ae714c in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at ../src/util/prime_generator.cpp:99
\#3 0x00ae71c4 in _GLOBAL__sub_I_prime_generator.cpp(void) () at ../src/util/prime_generator.cpp:130
\#4 0x00b16a68 in __do_global_ctors_aux ()
\#5 0x00b16aa0 in _init ()
\#6 0x00640b10 in _start ()
(gdb) list
279
280 void push_back(T const & elem) {
281 if (m_data == 0 || reinterpret_cast<SZ *>(m_data)[SIZE_IDX] == reinterpret_cast<SZ *>(m_data)[CAPACITY_IDX]) {
282 expand_vector();
283 }
284 new (m_data + reinterpret_cast\<Z *>(m_data)[SIZE_IDX]) T(elem);
285 reinterpret_cast<SZ *>(m_data)[SIZE_IDX]++;
286 }
287
288 void insert(T const & elem) {
(gdb) ptype SZ
type = unsigned int
(gdb) ptype m_data
type = unsigned long long *
SIGBUS on Solaris is usually indicative of a misaligned access, but I am not sure if it is due to the casting going on an endianess issue
The SPARC data alignment requirements is most likely at issue.
The m_data field in the vector class is off by two fields that are used
to store the size and capacity of a vector.
You can debug this by displaying (printing or using the debugger) the pointer m_data and it's alignment.
One option is to supply a separate vector implementation
where the size and capacity fields are stored
in fields directly in the vector for porting this library utility.
Z3 interacts with memory alignment a few other places (but not overly many).
The main other potential places are in the watch lists (sat_solver and smt_context), and region memory allocators (region.h) and possibly in hash tables.

Where is the global memory replay overhead coming from?

Running the code below to write 1 GB in global memory in the NVIDIA Visual Profiler, I get:
- 100% storage efficiency
- 69.4% (128.6 GB/s) DRAM utilization
- 18.3% total replay overhead
- 18.3% global memory replay overhead.
The memory writes are supposed to be coalesced and there is no divergence in the kernel, so the question is where is the global memory replay overhead coming from? I am running this on Ubuntu 13.04, with nvidia-cuda-toolkit version 5.0.35-4ubuntu1.
#include <cuda.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <ctype.h>
#include <sched.h>
#include <assert.h>
static void
HandleError( cudaError_t err, const char *file, int line )
{
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString(err), file, line);
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))
// Global memory writes
__global__ void
kernel_write(uint32_t *start, uint32_t entries)
{
uint32_t tid = threadIdx.x + blockIdx.x*blockDim.x;
while (tid < entries) {
start[tid] = tid;
tid += blockDim.x*gridDim.x;
}
}
int main(int argc, char *argv[])
{
uint32_t *gpu_mem; // Memory pointer
uint32_t n_blocks = 256; // Blocks per grid
uint32_t n_threads = 192; // Threads per block
uint32_t n_bytes = 1073741824; // Transfer size (1 GB)
float elapsedTime; // Elapsed write time
// Allocate 1 GB of memory on the device
HANDLE_ERROR( cudaMalloc((void **)&gpu_mem, n_bytes) );
// Create events
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
// Write to global memory
HANDLE_ERROR( cudaEventRecord(start, 0) );
kernel_write<<<n_blocks, n_threads>>>(gpu_mem, n_bytes/4);
HANDLE_ERROR( cudaGetLastError() );
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&elapsedTime, start, stop) );
// Report exchange time
printf("#Delay(ms) BW(GB/s)\n");
printf("%10.6f %10.6f\n", elapsedTime, 1e-6*n_bytes/elapsedTime);
// Destroy events
HANDLE_ERROR( cudaEventDestroy(start) );
HANDLE_ERROR( cudaEventDestroy(stop) );
// Free memory
HANDLE_ERROR( cudaFree(gpu_mem) );
return 0;
}
The nvprof profiler and the API profiler are giving different results:
$ nvprof --events gst_request ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
13.345920 80.454690
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8388608 8388608 8388608 gst_request
$ nvprof --events global_store_transaction ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
9.469216 113.392892
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8257560 8257560 8257560 global_store_transaction
I had the impression that global_store_transation could not be lower than gst_request. What is going on here? I can't ask for both events in the same command, so I had to run the two separate commands. Could this be the problem?
Strangely, the API profiler shows different results with perfect coalescing. Here is the output, I had to run twice to get the proper counters:
$ cat config.txt
inst_issued
inst_executed
gst_request
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log.csv COMPUTE_PROFILE_CONFIG=config.txt ./app
$ cat log.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eaca946b8
method,gputime,cputime,occupancy,inst_issued,inst_executed,gst_request,gld_request
_Z12kernel_writePjj,7771.776,7806.000,1.000,4737053,3900426,557058,0
$ cat config2.txt
global_store_transaction
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log2.csv COMPUTE_PROFILE_CONFIG=config2.txt ./app
$ cat log2.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eea92d0e8
method,gputime,cputime,occupancy,global_store_transaction
_Z12kernel_writePjj,7807.584,7831.000,1.000,557058
Here gst_request and global_store_transactions are exactly the same, showing perfect coalescing. Which one is correct (nvprof or the API profiler)? Why does NVIDIA Visual Profiler says that I have non-coalesced writes? There are still significant instruction replays, and I have no idea where they are coming from :(
Any ideas? I don't think this is hardware malfunctioning, since I have two boards on the same machine and both show the same behavior.

Resources