Optimizing C code for ARM-based devices - ios

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code

Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.

There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Related

How to measure CPU memory bandwidth in C RTOS?

I am working on an ARMv7 Embedded system, which uses an RTOS in the cortex-A7 SOC(2 cores, 256KB L2 Cache).
Now I want to measure the memory bandwidth of the CPU in this system, so I wrote following functions to do the measurement.
The function allocates 32MB memory and does memory reading in 10 loops. And measure the time of the reading, to get the memory reading bandwidth.. (I know there is dcache involved in it so the measurement is not precise).
#define T_MEM_SIZE 0x2000000
static void print_summary(char *tst, uint64_t ticks)
{
uint64_t msz = T_MEM_SIZE/1000000;
float msec = (float)ticks/24000;
printf("%s: %.2f MB/Sec\n", tst, 1000 * msz/msec);
}
static void memrd_cache(void)
{
int *mptr = malloc(T_MEM_SIZE);
register uint32_t i = 0;
register int va = 0;
uint32_t s, e, diff = 0, maxdiff = 0;
uint16_t loop = 0;
if (mptr == NULL)
return;
while (loop++ < 10) {
s = read_cntpct();
for (i = 0; i < T_MEM_SIZE/sizeof(int); i++) {
va = mptr[i];
}
e = read_cntpct();
diff = e - s;
if (diff > maxdiff) {
maxdiff = diff;
}
}
free(mptr);
print_summary("memrd", maxdiff);
}
Below is the measurement of reading, which tries to remove the caching effect of Dcache.
It fills 4Bytes in each cache line (CPU may fill the cacheline), until the 256KB L2 cache is full and be flushed/reloaded, so I think the Dcache effect should be minimized. (I may be wrong, correct me please).
#define CLINE_SIZE 16 // 16 * 4B
static void memrd_nocache(void)
{
int *mptr = malloc(T_MEM_SIZE);
register uint32_t col = 0, ln = 0;
register int va = 0;
uint32_t s, e, diff = 0, maxdiff = 0;
uint16_t loop = 0;
if (mptr == NULL)
return;
while (loop++ < 10) {
s = read_cntpct();
for (col = 0; col < CLINE_SIZE; col++) {
for (ln = 0; ln < T_MEM_SIZE/(CLINE_SIZE*sizeof(int)); ln++) {
va = *(mptr + ln * CLINE_SIZE + col);
}
}
e = read_cntpct();
diff = e - s;
if (diff > maxdiff) {
maxdiff = diff;
}
}
free(mptr);
print_summary("memrd_nocache", maxdiff);
}
After running these 2 functions, I found the bandwith is about,
memrd: 1973.04 MB/Sec
memrd_nocache: 1960.67 MB/Sec
The CPU is running at 1GHz, with DDR3 on dieļ¼Œ the two testing has the similar data!? It is a big surprise to me.
I had worked with lmbench in Linux ARM server, but I don't think it can be ran in this embedded system.
So I want to get a software tool to measure the memory bandwidth in this embedded system, get one from community or do it by myself.

Testing memory bandwidth - Odd results

Hy everyone,
Was asking myself the other day how much different access patterns affected memory read speed (mostly thinking about the frequency vs bus size discussion, and the impact of cache hit rate), so made a small program to test memory speed doing sequential and fully random accesses, but the results I got are quite odd, so I'm not trusting my code.
My idea was quite straightforward, just loop on an array and mov the data to a register. Made 3 versions, one moves 128 bits at a time with sse, the other 32 , and the last one 32 again but doing two movs, the first one loading a random number from an array, and the second one reading from the position specified by the prev value.
I got ~40 GB/s for the sse version, that it's reasonable considering i'm using an i7 4790K with DDR3 1600 cl9 memory at dual channel, that gives about 25 GB/s, so add to that cache and it feels ok, but then I got 3.3 GB/s for the normal sequential, and the worst, 15 GB/s for the random one. That last result makes me think that the bench is bogus.
Below is the code, if anyone could shed some light on this it would be appreciated. Did the inner loop in assembly to make sure it only did a mov.
EDIT: Managed to get a bit more performance by using vlddqu ymm0, buffL[esi] (avx) instead of movlps, went from 38 GB/s to 41 GB/s
EDIT 2: Did some more testing, unrolling the inner assembly loop, making a version that loads 4 times per iteration and another one that loads 8 times. Got ~35 GB/s for the x4 version and ~24 GB/s for the x8 version
#define PASSES 1000000
double bw = 0;
int main()
{
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL0:
movlps xmm0, buffL[esi]
add esi,16
cmp esi,maxByte
jb loopL0
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential (SSE) : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
for(int t = 0;t < l;t++) buffL[t] = (t+1)*4;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL1:
mov esi, buffL[esi]
cmp esi,maxByte
jb loopL1
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 14;
int buffL[l];
int maxByte = l*4;
int roffset[l];
for(int t = 0;t < l;t++) roffset[t] = (rand()*4) % maxByte;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
__asm
{
push esi
push edi
mov esi,0
loopL2:
mov edi, roffset[esi]
mov edi, buffL[edi]
add esi,4
cmp esi,maxByte
jb loopL2
pop edi
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(2*4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Random : " << bw << " GB/s " << endl;
return 0;
}
Gathering the measurement code into a Bandwidth class, creating some constants, having all three tests use the same buffer (and size) aligning the tops of the loops and computing random offset into the entire buffer (3rd test):
#include "stdafx.h"
#include "windows.h"
#include <iostream>
#include <vector>
using namespace std;
constexpr size_t passes = 1000000;
constexpr size_t buffsize = 64 * 1024;
constexpr double gigabyte = 1024.0 * 1024.0 * 1024.0;
constexpr double gb_per_test = double(long long(buffsize) * passes) / gigabyte;
struct Bandwidth
{
LARGE_INTEGER pc_tick_per_sec;
LARGE_INTEGER start_pc;
const char* _label;
public:
Bandwidth(const char* label): _label(label)
{
cout << "Running : ";
QueryPerformanceFrequency(&pc_tick_per_sec);
QueryPerformanceCounter(&start_pc);
}
~Bandwidth() {
LARGE_INTEGER end_pc{};
QueryPerformanceCounter(&end_pc);
const auto seconds = double(end_pc.QuadPart - start_pc.QuadPart) / pc_tick_per_sec.QuadPart;
cout << "\n " << _label << ": " << gb_per_test / seconds << " GB/s " << endl;
}
};
int wmain()
{
vector<char> buff(buffsize, 0);
const auto buff_begin = buff.data();
const auto buff_end = buff.data()+buffsize;
{
Bandwidth b("Sequential (SSE)");
for (size_t n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff_begin
mov edi, buff_end
align 16
loopL0:
movlps xmm0, [esi]
lea esi, [esi + 16]
cmp esi, edi
jne loopL0
pop edi
pop esi
}
}
}
{
Bandwidth b("Sequential (DWORD)");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff
mov edi, buff_end
align 16
loopL1:
mov eax, [esi]
lea esi, [esi + 4]
cmp esi, edi
jne loopL1
pop edi
pop esi
}
}
}
{
uint32_t* roffset[buffsize];
for (auto& roff : roffset)
roff = (uint32_t*)(buff.data())+(uint32_t)(double(rand()) / RAND_MAX * (buffsize / sizeof(int)));
const auto roffset_end = end(roffset);
Bandwidth b("Random");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
push ebx
lea edi, roffset //begin(roffset)
mov ebx, roffset_end //end(roffset)
align 16
loopL2:
mov esi, [edi] //fetch the next random offset
mov eax, [esi] //read from the random location
lea edi, [edi + 4] // point to the next random offset
cmp edi, ebx //are we done?
jne loopL2
pop ebx
pop edi
pop esi
}
}
}
}
I have also found more consistent results if I SetPriorityClass(GetCurrentProcess, HIGH_PRIORITY_CLASS); and SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
Your second test has one array on the stack that is 1 << 16 in size. That's 64k. Or more easier to read:
int buffL[65536];
Your third test has two arrays on the stack. Both at `1 << 14' in size. That's 16K each
int buffL[16384];
int roffset[16384];
So right away you are using a much smaller stack size (i.e. fewer pages being cached and swapped out). I think your loop is only iterating half as many times in the third test as it is in the second. Maybe you meant to declare 1 << 15 (or 1 << 16) as the size instead for each array instead?

realloc() in a 64bit iOS device

When I use the C function realloc(p,size) in my project, the code runs well in both the simulator and on an iPhone 5.
However, when the code is running on an iPhone 6 plus, some odd things happen. The contents a points to are changed! What's worse, the output is different every time the function runs!
Here is the test code:
#define LENGTH (16)
- (void)viewDidLoad {
[super viewDidLoad];
char* a = (char*)malloc(LENGTH+1);
a[0] = 33;
a[1] = -14;
a[2] = 76;
a[3] = -128;
a[4] = 25;
a[5] = 49;
a[6] = -45;
a[7] = 56;
a[8] = -36;
a[9] = 56;
a[10] = 125;
a[11] = 44;
a[12] = 26;
a[13] = 79;
a[14] = 29;
a[15] = 66;
//print binary contents pointed by a, print each char in int can also see the differene
[self printInByte:a];
realloc(a, LENGTH);
[self printInByte:a]
free(a);
}
-(void) printInByte: (char*)t{
for(int i = 0;i < LENGTH ;++i){
for(int j = -1;j < 16;(t[i] >> ++j)& 1?printf("1"):printf("0"));
printf(" ");
if (i % 4 == 3) {
printf("\n");
}
}
}
One more thing, when LENGTH is not 16, it runs well with assigning up to a[LENGTH-1]. However, if LENGTH is exactly 16, things go wrong.
Any ideas will be appreciated.
From the realloc man doc:
The realloc() function tries to change the size of the allocation
pointed
to by ptr to size, and returns ptr. If there is not enough room to
enlarge the memory allocation pointed to by ptr, realloc() creates a new
allocation, copies as much of the old data pointed to by ptr as will fit
to the new allocation, frees the old allocation, and returns a pointer to
the allocated memory.
So, you must write: a = realloc(a, LENGTH); because in the case where the already allocated memory block can't be made larger, the zone will be moved to another location, and so, the adress of the zone will change. That is why realloc return a memory pointer.
You need to do:
a = realloc(a, LENGTH);
The reason is that realloc() may move the block, so you need to update your pointer to the new address:
http://www.cplusplus.com/reference/cstdlib/realloc/

cudaFree is not freeing memory

The code below calculates the dot product of two vectors a and b. The correct result is 8192. When I run it for the first time the result is correct. Then when I run it for the second time the result is the previous result + 8192 and so on:
1st iteration: result = 8192
2nd iteration: result = 8192 + 8192
3rd iteration: result = 8192 + 8192
and so on.
I checked by printing it on screen and the device variable dev_c is not freed. What's more writing to it causes something like a sum, the result beeing the previous value plus the new one being written to it. I guess that could be something with the atomicAdd() operation, but nonetheless cudaFree(dev_c) should erase it after all.
#define N 8192
#define THREADS_PER_BLOCK 512
#define NUMBER_OF_BLOCKS (N/THREADS_PER_BLOCK)
#include <stdio.h>
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ ){
sum += temp[i];
}
atomicAdd(c,sum);
}
}
int main( void ) {
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;
int size = N * sizeof( int);
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, sizeof(int));
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(sizeof(int));
for(int i = 0 ; i < N ; i++){
a[i] = 1;
b[i] = 1;
}
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);
dot<<< N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>( dev_a, dev_b, dev_c);
cudaMemcpy( c, dev_c, sizeof(int) , cudaMemcpyDeviceToHost);
printf("Dot product = %d\n", *c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);
return 0;
}
cudaFree doesn't erase anything, it simply returns memory to a pool to be re-allocated. cudaMalloc doesn't guarantee the value of memory that has been allocated. You need to initialize memory (both global and shared) that your program uses, in order to have consistent results. The same is true for malloc and free, by the way.
From the documentation of cudaMalloc();
The memory is not cleared.
That means that dev_c is not initialized, and your atomicAdd(c,sum); will add to any random value that happens to be stored in memory at the returned position.

Cuda-memcheck not reporting out of bounds shared memory access

I am runnig the follwoing code using shared memory:
__global__ void computeAddShared(int *in , int *out, int sizeInput){
//not made parameters gidata and godata to emphasize that parameters get copy of address and are different from pointers in host code
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
}
__syncthreads();
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
}
__syncthreads();
offset*=2;
}
if(ltid == 0){
out[blockIdx.x] = temp[0];
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
/*FILE *f;
f = fopen("invertedList.txt" , "w");
a[0] = 1 + (rand() % 8);
fprintf(f, "%d,",a[0]);
for( int i = 1 ; i< N; i++){
a[i] = a[i-1] + (rand() % 8) + 1;
fprintf(f, "%d,",a[i]);
}
fclose(f);*/
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(blocks != 1 ){
if(size < TPB){
TPB = size; // size is 2^sth
}
blocks = (size+ TPB -1 ) / TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
cudaFree(gidata);
gidata = godata;
size = blocks;
}
//printf("The error by cuda is %s",cudaGetErrorString(cudaGetLastError()));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaError_t chk = cudaFree(godata);
if(chk!=0){
printf("First chk also printed error. Maybe error in my logic\n");
}
printf("The error by threadsyn is %s", cudaGetErrorString(cudaGetLastError()));
printf("The sum of the array is %d\n", output[0]);
getchar();
return 0;
}
Clearly, the first while loop in computeAddShared is causing out of bounds error because I am allocating 4 bytes to shared memory. Why does cudamemcheck not catch this. Below is the output of cuda-memcheck
========= CUDA-MEMCHECK
time is 12.334816 msThe error by threadsyn is no errorThe sum of the array is 13
6
========= ERROR SUMMARY: 0 errors
Shared memory allocation granularity. The Hardware undoubtedly has a page size for allocations (probably the same as the L1 cache line side). With only 4 threads per block, there will "accidentally" be enough shared memory in a single page to let you code work. If you used a sensible number of threads block (ie. a round multiple of the warp size) the error would be detected because there would not be enough allocated memory.

Resources