realloc() in a 64bit iOS device - ios

When I use the C function realloc(p,size) in my project, the code runs well in both the simulator and on an iPhone 5.
However, when the code is running on an iPhone 6 plus, some odd things happen. The contents a points to are changed! What's worse, the output is different every time the function runs!
Here is the test code:
#define LENGTH (16)
- (void)viewDidLoad {
[super viewDidLoad];
char* a = (char*)malloc(LENGTH+1);
a[0] = 33;
a[1] = -14;
a[2] = 76;
a[3] = -128;
a[4] = 25;
a[5] = 49;
a[6] = -45;
a[7] = 56;
a[8] = -36;
a[9] = 56;
a[10] = 125;
a[11] = 44;
a[12] = 26;
a[13] = 79;
a[14] = 29;
a[15] = 66;
//print binary contents pointed by a, print each char in int can also see the differene
[self printInByte:a];
realloc(a, LENGTH);
[self printInByte:a]
free(a);
}
-(void) printInByte: (char*)t{
for(int i = 0;i < LENGTH ;++i){
for(int j = -1;j < 16;(t[i] >> ++j)& 1?printf("1"):printf("0"));
printf(" ");
if (i % 4 == 3) {
printf("\n");
}
}
}
One more thing, when LENGTH is not 16, it runs well with assigning up to a[LENGTH-1]. However, if LENGTH is exactly 16, things go wrong.
Any ideas will be appreciated.

From the realloc man doc:
The realloc() function tries to change the size of the allocation
pointed
to by ptr to size, and returns ptr. If there is not enough room to
enlarge the memory allocation pointed to by ptr, realloc() creates a new
allocation, copies as much of the old data pointed to by ptr as will fit
to the new allocation, frees the old allocation, and returns a pointer to
the allocated memory.
So, you must write: a = realloc(a, LENGTH); because in the case where the already allocated memory block can't be made larger, the zone will be moved to another location, and so, the adress of the zone will change. That is why realloc return a memory pointer.

You need to do:
a = realloc(a, LENGTH);
The reason is that realloc() may move the block, so you need to update your pointer to the new address:
http://www.cplusplus.com/reference/cstdlib/realloc/

Related

Explicit memory prefetching for Intel Compilers

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA

warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll

The compiler produce a "warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll" and I don't understand why.
Here is the source code. It is a revisited itoa() function for HLSL producing resulting ascii codes in an array of uint.
#define ITOA_BUFFER_SIZE 16
// Convert uint to ascii and return number of characters
uint UIntToAscii(
in uint Num, // Number to convert
out uint Buf[ITOA_BUFFER_SIZE], // Where to put resulting ascii codes
in uint Base) // Numeration base for convertion
{
uint I, J, K;
I = 0;
while (I < (ITOA_BUFFER_SIZE - 1)) { // <==== Warning X3557
uint Digit = Num % Base;
if (Digit < 10)
Buf[I++] = '0' + Digit;
else
Buf[I++] = 'A' + Digit - 10;
if ((Num /= Base) == 0)
break;
}
// Reverse buffer
for (K = 0, J = I - 1; K < J; K++, J--) { // <==== Warning X3557
uint T = Buf[K];
Buf[K] = Buf[J];
Buf[J] = T;
}
// Fill remaining of buffer with zeros to make compiler happy
K = I;
while (K < ITOA_BUFFER_SIZE)
Buf[K++] = 0;
return I;
}
I tried to rewrite the while loop but this doesn't change anything. Also tried to use attribute [fastopt] without success. As far as I can see the function produce the correct result.
Any help appreciated.
The warning you are getting is
WAR_TOO_SIMPLE_LOOP 3557 The loop only executes for a limited number
of iterations or doesn't seem to do anything so consider removing it
or forcing it to unroll.
The warning is pretty much self explanatory, if you consider that loops are considered inefficient in GPGPU, so the compiler tries to unroll them when it's possible. What the compiler is telling you is that you created some loops that can run more efficiently if unrolled, or can be removed because they never run. If a loop is unrollable, it means that you can predict at compile time the number of times it will run. Your loops on first look should not fulfill this criterium.
I = 0;
while (I < (ITOA_BUFFER_SIZE - 1)) { // <==== Warning X3557
uint Digit = Num % Base;
if (Digit < 10)
Buf[I++] = '0' + Digit;
else
Buf[I++] = 'A' + Digit - 10;
if ((Num /= Base) == 0)
break;
}
This while loop runs up to 15 times I < (ITOA_BUFFER_SIZE - 1), depending on (Num /= Base) == 0. Final value of I is between 1 and 15, depending on how if ((Num /= Base) == 0) evaluates on each cycle. Nonetheless, it still is unrollable, because the compiler may still insert a conditional jump over the iterations.
// Reverse buffer
for (K = 0, J = I - 1; K < J; K++, J--) { // <==== Warning X3557
uint T = Buf[K];
Buf[K] = Buf[J];
Buf[J] = T;
}
This second loop, instead should not be unrollable, because I should not be known to the compiler.
The warning you reported
warning X3557: loop only executes for 0 iteration(s), forcing loop to unroll
might refer to the first loop if if ((Num /= Base) == 0) always evaluates to true on first iteration. In that case, I would be equal to 1, and J would be equal to 0 on the second loop. That second loop would not run because K < J would evaluate to false on first iteration.
What you get in the end if you let it [unroll] is probably a single iteration on the while loop and the complete removal of the subsequent for loop. I highly suspect this is not your intended behaviour, and while it might suppress the warning, you might want to check the code and see if something does not run the way it should.

Optimizing C code for ARM-based devices

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code
Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.
There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Kiss fft does not work after giving it more than 32 samples

I am trying to take data from an accelerometer and apply Kiss FFT to the samples. I'm using a Freescale Kinetis FRDM-K22F board. I want to use 64 samples, but when I run the program I get an error saying "kiss fft usage error: improper alloc" I started turning down the sample size and saw that the FFT does work with 32 samples, but giving it 33 samples the program just stops and returns no errors. Giving it any more samples gives similar results.
I played around with how I set up the FFT and followed a few websites and forum posts:
KissFFT output of kiss_fftr
http://digiphd.com/programming-reconstruction-fast-fourier-transform-real-signal-kiss-fft-libraries/
Kiss FFT on a dsPIC33
From what I can see, I haven't done anything different from what the above websites and forums have done. I've included my code below. Any help or advice is greatly appreciated.
void Sample_RUN()
{
int size = 64;
kiss_fft_scalar zero;
memset(&zero,0,sizeof(zero));
kiss_fft_cpx fft_in[size];
kiss_fft_cpx fft_out[size];
kiss_fftr_cfg fft = kiss_fftr_alloc(size*2 ,0 ,NULL,NULL);
signed short samples[size];
for (int i = 0; i < size; i++) {
fft_in[i].r = zero;
fft_in[i].i = zero;
fft_out[i].r = zero;
fft_out[i].i = zero;
}
printf("Data Collection Begins \r\n");
for(int j = 0; j < size; j++)
{
for(;;)
{
dr_status = My_I2C_ReadByte(STATUS_REG);
dr_status = (dr_status & 0x04);
if (dr_status == 0x04)
{
//READING FROM ACCEL OUTPUT DATA REGISTERS
AccelData[0] = My_I2C_ReadByte(OUT_X_MSB_REG);
AccelData[1] = My_I2C_ReadByte(OUT_X_LSB_REG);
AccelData[2] = My_I2C_ReadByte(OUT_Y_MSB_REG);
AccelData[3] = My_I2C_ReadByte(OUT_Y_LSB_REG);
AccelData[4] = My_I2C_ReadByte(OUT_Z_MSB_REG);
AccelData[5] = My_I2C_ReadByte(OUT_Z_LSB_REG);
// 14-bit accelerometer data
Xout_Accel_14_bit = ((signed short) (AccelData[0]<<8 | AccelData[1])) >> 2; // Compute 16-bit X-axis acceleration output value
Yout_Accel_14_bit = ((signed short) (AccelData[2]<<8 | AccelData[3])) >> 2; // Compute 16-bit Y-axis acceleration output value
Zout_Accel_14_bit = ((signed short) (AccelData[4]<<8 | AccelData[5])) >> 2; // Compute 16-bit Z-axis acceleration output value
mag_accel = sqrt(pow(Xout_Accel_14_bit, 2) + pow(Yout_Accel_14_bit, 2) + pow(Zout_Accel_14_bit, 2) );
printf("%d \r\n", mag_accel);
samples[j] = mag_accel;
break;
} // end if
} // end infinite for
} // end for
for (int j = 0; j < size; j++)
{
fft_in[j].r = samples[j];
fft_in[j].i = zero;
fft_out[j].r = zero;
fft_out[j].i = zero;
}
printf("Executing FFT\r\n");
kiss_fftr(fft, (kiss_fft_scalar*) fft_in, fft_out);
printf("Printing FFT Outputs\r\n");
for(int j = 0; j < size; j++)
{
printf("%d \r\n", fft_out[j].r);
}
kiss_fft_cleanup();
free(fft);
} // end Sample_RUN
Sounds like you are running out of memory. I am not familiar with that chip, but perhaps you should be using the last arguments of kiss_fft_alloc so you can skip heap allocation.

Create 1D Texture in DirectX 11

I try to create the 1D Texture in DirectX 11 wih this code:
PARAMETER: ID3D11Device* pDevice
D3D11_TEXTURE1D_DESC text1_desc;
::ZeroMemory(&text1_desc, sizeof(D3D11_TEXTURE1D_DESC));
text1_desc.Width = 258
text1_desc.MipLevels = 2;
text1_desc.ArraySize = 2;
text1_desc.Usage = D3D11_USAGE_IMMUTABLE;
text1_desc.BindFlags = D3D11_BIND_SHADER_RESOURCE;
text1_desc.Format = R8G8B8A8_UNORM;
FLOAT* pData = new FLOAT[text1_desc.MipLevels * text1_desc.ArraySize * text1_desc.Width];
D3D11_SUBRESOURCE_DATA sr_data;
::ZeroMemory(&sr_data, sizeof(D3D11_SUBRESOURCE_DATA));
sr_data.pSysMem = pData;
ID3D11Texture1D* pTexture1D = nullptr;
auto hr = pDevice->CreateTexture1D(&text1_desc, &sr_data, &pTexture1D);
When text1_desc.MipLevels = 1 and text1_desc.ArraySize = 1 everything is good.
When text1_desc.MipLevels = 0 or text1_desc.MipLevels > 1 it raises Unhandled exception at 0x000007FEE6D14CC0 (nvwgf2umx.dll): 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.
Can anyone help me to solve this problem?
Mip-levels of '0' is a problem since it results in an allocation size of '0'. You need to figure out the number of mip-levels that will get generated for the given input width. So for 0, you need something like:
size_t mipLevels = 1;
size_t width = 258;
while ( width > 1 )
{
if ( width > 1 )
width >>= 1;
++mipLevels;
}
The second thing to note is that you have to pass an array of D3D11_SUBRESOURE_DATA instances, not just one, if you are creating a complex resource. There is one D3D11_SUBRESOURE_DATA per sub-resource, which needs to be mipLevels * text1_desc.ArraySize in length. You only ever allocate 1, which is why you get a runtime fault.
You should look at DirectXTex for code that works with a all kinds of Direct3D 11 textures.

Resources