OpenCV SURF comparing descriptors - opencv

Folowing snippet is from OpenCV find_obj.cpp which is demo for using SURF,
double
compareSURFDescriptors( const float* d1, const float* d2, double best, int length )
{
double total_cost = 0;
assert( length % 4 == 0 );
int i;
for( i = 0; i best )
break;
}
return total_cost;
}
As far as I can tell it checking the euclidian distance, what I do not understand is why is it doing it in groups of 4? Why not calculate the whole thing at once?

Usually things like this are done for making SSE optimizations possible. SSE registers are 128 bits long and can contain 4 floats, so you can do the 4 subtractions using one instruction, parallelly.
Another upside: you have to check the loop counter only after every fourth difference. That makes the code faster even if the compiler doesn't use the opportunity to generate SSE code. For example, VS2008 didn't, not even with -O2:
double t0 = d1[i] - d2[i];
00D91666 fld dword ptr [edx-0Ch]
00D91669 fsub dword ptr [ecx-4]
double t1 = d1[i+1] - d2[i+1];
00D9166C fld dword ptr [ebx+ecx]
00D9166F fsub dword ptr [ecx]
double t2 = d1[i+2] - d2[i+2];
00D91671 fld dword ptr [edx-4]
00D91674 fsub dword ptr [ecx+4]
double t3 = d1[i+3] - d2[i+3];
00D91677 fld dword ptr [edx]
00D91679 fsub dword ptr [ecx+8]
total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3;
00D9167C fld st(2)
00D9167E fmulp st(3),st
00D91680 fld st(3)
00D91682 fmulp st(4),st
00D91684 fxch st(2)
00D91686 faddp st(3),st
00D91688 fmul st(0),st
00D9168A faddp st(2),st
00D9168C fmul st(0),st
00D9168E faddp st(1),st
00D91690 faddp st(2),st

I think it is because for each subregion we get 4 numbers. Totally 4x4x4 subregions making 64 length vector. So its basically getting the difference between 2 sub regions.

Related

CLANG is ignoring AVX2 intrinsics in this code

I'm testing LLVM's ability to vectorize some code in https://rust.godbolt.org/
Options : -mavx2 -ffast-math -fno-math-errno -O3
Compiler LLVM 13, but any LLVM actually does the same thing.
#include <immintrin.h>
template<class T>
struct V4
{
T A,B,C,D;
V4() { };
V4(T x) : A(x), B(x), C(x), D(x) { };
V4(T a, T b, T c, T d) : A(a), B(b), C(c), D(d) { };
void operator +=(const V4& x)
{
//A += x.A; B += x.B; C += x.C; D += x.D;
__m256 f = _mm256_loadu_ps(&A);
__m256 f2 = _mm256_loadu_ps(&x.A);
_mm256_store_ps(&A, _mm256_add_ps(f, f2));
};
T GetSum() const { return A + B + C + D; };
};
typedef V4<float> V4F;
double FN(float f[4], float g[4], int cnt)
{
V4F vec1(f[0], f[1], f[2], f[3]), vec2(g[0], g[1], g[2], g[3]);
for (int i=0; i<cnt; i++)
vec1 += vec2;
return vec1.GetSum();
};
This is the resulting disassembly:
FN(float*, float*, int): # #FN(float*, float*, int)
vmovddup xmm0, qword ptr [rdi + 8] # xmm0 = mem[0,0]
vaddps xmm0, xmm0, xmmword ptr [rdi]
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vaddss xmm0, xmm0, xmm1
vcvtss2sd xmm0, xmm0, xmm0
ret
So it is completely ignoring the intrinsics. If I uncomment that part that should be doing the same thing in C++, a really long code appears, so it apparently starts understanding it.
Am I missing something or is this a bug in LLVM?

Is it possible to convince clang to auto-vectorize this code without using intrinsics?

Imagine I have this naive function to detect sphere overlap. The point of this question is not really to discuss the best way to do hit testing on spheres, so this is just for illustration.
inline bool sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2) {
float xd = (x1 - x2);
float yd = (y1 - y2);
float zd = (z1 - z2);
float max_dist = (r1 + r2);
return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
}
And I call it in a nested loop, as follows:
std::vector<float> xs, ys, zs, rs;
int n_spheres;
// <snip>
int n_hits = 0;
for (int i = 0; i < n_spheres; ++i) {
for (int j = i + 1; j < n_spheres; ++j) {
if (sphere_hit(xs[i], ys[i], zs[i], rs[i],
xs[j], ys[j], zs[j], rs[j])) {
++n_hits;
}
}
}
std::printf("total hits: %d\n", n_hits);
Now, clang (with -O3 -march=native) is smart enough to figure out how to vectorize (and unroll) this loop into 256-bit avx2 instructions. Awesome!
However, if I do anything more complicated than increment the number of hits, for example calling some arbitrary function handle_hit(i, j), clang instead emits a naive scalar version.
Hits should be very rare, so what I think should happen is checking on every vectorized loop iteration if the value is true for any of the lanes, and jumping to some scalar slow path if so. This should be possible with vcmpltps followed by vmovmskps. However, I can't get clang to emit this code, even if I surround the call to sphere_hit with __builtin_expect(..., 0).
Indeed it is possible to convince clang to vectorize this code. With compiler options
-Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize, clang claims that the floating point operations are vectorized, which is confirmed by the Godbolt output. (The red underlined fors are not errors, but vectorization reports).
Vectorization is possible by storing the results of sphere_hit as chars to a temporary array hitx8.
Afterwards, 8 sphere_hit results are tested per iteration by reading the 8 chars back from memory as one uint64_t a. This should be quite efficient since the condition a!=0
(see code below) is still rare since sphere hits are very rare. Moreover, array hitx8 is likely in L1 or L2 cache most of the time.
I didn't test the code for correctness, but at least the auto-vectorization idea should work.
/* clang -Ofast -Wall -march=broadwell -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize */
#include<string.h>
char sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2);
void handle_hit(int i, int j);
void vectorized_code(float* __restrict xs, float* __restrict ys, float* __restrict zs, float* __restrict rs, char* __restrict hitx8, int n_spheres){
unsigned long long int a;
for (int i = 0; i < n_spheres; ++i) {
for (int j = i + 1; j < n_spheres; ++j){
/* Store the boolean results temporarily in char array hitx8. */
/* The indices of hitx8 are shifted by i+1, so the loop */
/* starts with hitx8[0] */
/* char array hitx8 should have n_spheres + 8 elements */
hitx8[j-i-1] = sphere_hit(xs[i], ys[i], zs[i], rs[i],
xs[j], ys[j], zs[j], rs[j]);
}
for (int j = n_spheres; j < n_spheres+8; ++j){
/* Add 8 extra zeros at the end of hitx8. */
hitx8[j-i-1] = 0; /* hitx8 is 8 elements longer than xs */
}
for (int j = i + 1; j < n_spheres; j=j+8){
memcpy(&a,&hitx8[j-i-1],8);
/* Check 8 sphere hits in parallel: */
/* one `unsigned long long int a` contains 8 boolean values here */
/* The condition a!=0 is still rare since sphere hits are very rare. */
if (a!=0ull){
if (hitx8[j-i-1+0] != 0) handle_hit(i,j+0);
if (hitx8[j-i-1+1] != 0) handle_hit(i,j+1);
if (hitx8[j-i-1+2] != 0) handle_hit(i,j+2);
if (hitx8[j-i-1+3] != 0) handle_hit(i,j+3);
if (hitx8[j-i-1+4] != 0) handle_hit(i,j+4);
if (hitx8[j-i-1+5] != 0) handle_hit(i,j+5);
if (hitx8[j-i-1+6] != 0) handle_hit(i,j+6);
if (hitx8[j-i-1+7] != 0) handle_hit(i,j+7);
}
}
}
}
inline char sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2) {
float xd = (x1 - x2);
float yd = (y1 - y2);
float zd = (z1 - z2);
float max_dist = (r1 + r2);
return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
}

Testing memory bandwidth - Odd results

Hy everyone,
Was asking myself the other day how much different access patterns affected memory read speed (mostly thinking about the frequency vs bus size discussion, and the impact of cache hit rate), so made a small program to test memory speed doing sequential and fully random accesses, but the results I got are quite odd, so I'm not trusting my code.
My idea was quite straightforward, just loop on an array and mov the data to a register. Made 3 versions, one moves 128 bits at a time with sse, the other 32 , and the last one 32 again but doing two movs, the first one loading a random number from an array, and the second one reading from the position specified by the prev value.
I got ~40 GB/s for the sse version, that it's reasonable considering i'm using an i7 4790K with DDR3 1600 cl9 memory at dual channel, that gives about 25 GB/s, so add to that cache and it feels ok, but then I got 3.3 GB/s for the normal sequential, and the worst, 15 GB/s for the random one. That last result makes me think that the bench is bogus.
Below is the code, if anyone could shed some light on this it would be appreciated. Did the inner loop in assembly to make sure it only did a mov.
EDIT: Managed to get a bit more performance by using vlddqu ymm0, buffL[esi] (avx) instead of movlps, went from 38 GB/s to 41 GB/s
EDIT 2: Did some more testing, unrolling the inner assembly loop, making a version that loads 4 times per iteration and another one that loads 8 times. Got ~35 GB/s for the x4 version and ~24 GB/s for the x8 version
#define PASSES 1000000
double bw = 0;
int main()
{
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL0:
movlps xmm0, buffL[esi]
add esi,16
cmp esi,maxByte
jb loopL0
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential (SSE) : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
for(int t = 0;t < l;t++) buffL[t] = (t+1)*4;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL1:
mov esi, buffL[esi]
cmp esi,maxByte
jb loopL1
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 14;
int buffL[l];
int maxByte = l*4;
int roffset[l];
for(int t = 0;t < l;t++) roffset[t] = (rand()*4) % maxByte;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
__asm
{
push esi
push edi
mov esi,0
loopL2:
mov edi, roffset[esi]
mov edi, buffL[edi]
add esi,4
cmp esi,maxByte
jb loopL2
pop edi
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(2*4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Random : " << bw << " GB/s " << endl;
return 0;
}
Gathering the measurement code into a Bandwidth class, creating some constants, having all three tests use the same buffer (and size) aligning the tops of the loops and computing random offset into the entire buffer (3rd test):
#include "stdafx.h"
#include "windows.h"
#include <iostream>
#include <vector>
using namespace std;
constexpr size_t passes = 1000000;
constexpr size_t buffsize = 64 * 1024;
constexpr double gigabyte = 1024.0 * 1024.0 * 1024.0;
constexpr double gb_per_test = double(long long(buffsize) * passes) / gigabyte;
struct Bandwidth
{
LARGE_INTEGER pc_tick_per_sec;
LARGE_INTEGER start_pc;
const char* _label;
public:
Bandwidth(const char* label): _label(label)
{
cout << "Running : ";
QueryPerformanceFrequency(&pc_tick_per_sec);
QueryPerformanceCounter(&start_pc);
}
~Bandwidth() {
LARGE_INTEGER end_pc{};
QueryPerformanceCounter(&end_pc);
const auto seconds = double(end_pc.QuadPart - start_pc.QuadPart) / pc_tick_per_sec.QuadPart;
cout << "\n " << _label << ": " << gb_per_test / seconds << " GB/s " << endl;
}
};
int wmain()
{
vector<char> buff(buffsize, 0);
const auto buff_begin = buff.data();
const auto buff_end = buff.data()+buffsize;
{
Bandwidth b("Sequential (SSE)");
for (size_t n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff_begin
mov edi, buff_end
align 16
loopL0:
movlps xmm0, [esi]
lea esi, [esi + 16]
cmp esi, edi
jne loopL0
pop edi
pop esi
}
}
}
{
Bandwidth b("Sequential (DWORD)");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff
mov edi, buff_end
align 16
loopL1:
mov eax, [esi]
lea esi, [esi + 4]
cmp esi, edi
jne loopL1
pop edi
pop esi
}
}
}
{
uint32_t* roffset[buffsize];
for (auto& roff : roffset)
roff = (uint32_t*)(buff.data())+(uint32_t)(double(rand()) / RAND_MAX * (buffsize / sizeof(int)));
const auto roffset_end = end(roffset);
Bandwidth b("Random");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
push ebx
lea edi, roffset //begin(roffset)
mov ebx, roffset_end //end(roffset)
align 16
loopL2:
mov esi, [edi] //fetch the next random offset
mov eax, [esi] //read from the random location
lea edi, [edi + 4] // point to the next random offset
cmp edi, ebx //are we done?
jne loopL2
pop ebx
pop edi
pop esi
}
}
}
}
I have also found more consistent results if I SetPriorityClass(GetCurrentProcess, HIGH_PRIORITY_CLASS); and SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
Your second test has one array on the stack that is 1 << 16 in size. That's 64k. Or more easier to read:
int buffL[65536];
Your third test has two arrays on the stack. Both at `1 << 14' in size. That's 16K each
int buffL[16384];
int roffset[16384];
So right away you are using a much smaller stack size (i.e. fewer pages being cached and swapped out). I think your loop is only iterating half as many times in the third test as it is in the second. Maybe you meant to declare 1 << 15 (or 1 << 16) as the size instead for each array instead?

MCR and MRC instruction usage

here i have written code to find number of cycles taken by a function but i am getting error at first MCR instruction can any one suggest me how to solve this problem.This code is written in XCODE and running on ios.
#include <stdio.h>
static inline unsigned int get_cyclecount (void)
{
unsigned int value;
// Read CCNT Register
asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
return value;
}
static inline void init_perfcounters (int do_reset, int enable_divider)
{
// in general enable all counters (including cycle counter)
int value = 1;
// perform reset:
if (do_reset)
{
value |= 2; // reset all counters to zero.
value |= 4; // reset cycle counter to zero.
}
if (enable_divider)
value |= 8; // enable "by 64" divider for CCNT.
value |= 16;
// program the performance-counter control-register:
asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
// enable all counters:
asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
// clear overflows:
asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}
int main () {
float x = 100.0f;
float y = 0.00000f;
float inst,cycl,cycl_inst;
int do_reset=0;
int enable_divider=0;
init_perfcounters (1, 0);
// measure the counting overhead:
unsigned int overhead = get_cyclecount();
overhead = get_cyclecount() - overhead;
unsigned int t = get_cyclecount();
// do some stuff here..
log_10_c_function(x);
t = get_cyclecount() - t;
printf ("Totaly %d cycles (including function call) ", t - overhead);
return 0;
}

Optimizing C code for ARM-based devices

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code
Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.
There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Resources