FreeRTOS - Hardfault Analysis - Failure - freertos

After the tutorial which can be found here, i try unsuccessfully to see which thread and where in the stack the hardfault was originated.
Unlike the tutorial, my HardFault_Handler() function is not directly a naked function but forwards to one (prepareRegistersFromStack()). In the debugger, i can see the jump to the function prepareRegistersFromStack().
This functiom, prepareRegistersFromStack() then should jump to the getRegistersFromStack() function, although this never happens.
I also tried directly like the example states, so making the HardFault_Handler() a naked function which should jump to the getRegistersFromStack(), but unfortunatelly the same applies, no jump.
Can someone help me out here? Why does the getRegistersFromStack() not get called?
Thanks
extern "C" __attribute__((naked)) void HardFault_Handler()
{
__disable_fault_irq();
__disable_irq();
prepareRegistersFromStack();
}
extern "C" void prepareRegistersFromStack()
{
__asm volatile
(
" tst lr, #4 \n"
" ite eq \n"
" mrseq r0, msp \n"
" mrsne r0, psp \n"
" ldr r1, [r0, #24] \n"
" ldr r2, handler2_address_const \n"
" bx r2 \n"
" handler2_address_const: .word getRegistersFromStack \n"
);
}
extern "C" void getRegistersFromStack( uint32_t *pulFaultStackAddress )
{
uint32_t dummy;
/* These are volatile to try and prevent the compiler/linker optimising them
away as the variables never actually get used. If the debugger won't show the
values of the variables, make them global my moving their declaration outside
of this function. */
volatile uint32_t r0;
volatile uint32_t r1;
volatile uint32_t r2;
volatile uint32_t r3;
volatile uint32_t r12;
volatile uint32_t lr; /* Link register. */
volatile uint32_t pc; /* Program counter. */
volatile uint32_t psr;/* Program status register. */
r0 = pulFaultStackAddress[ 0 ];
r1 = pulFaultStackAddress[ 1 ];
r2 = pulFaultStackAddress[ 2 ];
r3 = pulFaultStackAddress[ 3 ];
r12 = pulFaultStackAddress[ 4 ];
lr = pulFaultStackAddress[ 5 ];
pc = pulFaultStackAddress[ 6 ];
psr = pulFaultStackAddress[ 7 ];
/* When the following line is hit, the variables contain the register values. */
for(;;);
/* remove warnings */
dummy = r0;
dummy = r1;
dummy = r2;
dummy = r3;
dummy = r12;
dummy = lr;
dummy = pc;
dummy = psr;
dummy = dummy;
}

In your modified approach you have changed the stack frame (when you called prepareRegistersFromStack() ), so it wont work as written. Revert back to the original sample. I have used the following sample before, which I believe is what your code snippet is based on:
https://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html
However, I now prefer to use the feature built into the IDE or the following "C" approach that can utilize a serial port or an IDE:
https://blog.feabhas.com/2013/02/developing-a-generic-hard-fault-handler-for-arm-cortex-m3cortex-m4/

Related

Can Lua send extended function Keys? ex F13-F24

I tried sending F13 with kb.stroke("F13");
Well it doesn't work, works fine with anything F12 and below.
I'm trying to use this in a custom remote in Unified Remote app, so my only workaround for know is using os.start to run an ahk script that does the key sending but it's a very slow approach.
Any help will be appreciated.
local ffi = require"ffi"
ffi.cdef[[
typedef struct {
uintptr_t type;
uint16_t wVk;
uint16_t wScan;
uint32_t dwFlags;
uint32_t time;
uintptr_t dwExtraInfo;
uint32_t x[2];
} INP;
int SendInput(int, void*, int);
]]
local inp_t = ffi.typeof"INP[2]"
local function PressAndReleaseKey(vkey)
local inp = inp_t()
for j = 0, 1 do
inp[j].type = 1
inp[j].wVk = vkey
inp[j].dwFlags = j * 2
end
ffi.C.SendInput(2, inp, ffi.sizeof"INP")
end
PressAndReleaseKey(0x57) -- W
PressAndReleaseKey(0x7C) -- F13
VKeys:
https://learn.microsoft.com/en-us/windows/win32/inputdev/virtual-key-codes

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate]

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
{
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.
Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
ret
.LC0:
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.LC1:
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.LC2:
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805
_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.
This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
{
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
{
printf("%02X ", buf[i]);
}
printf("\n");
}
int main(void)
{
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
print128(&a);
print128(&glob_a);
print128(&b);
print128(&glob_b);
print128(&m87);
print128(&glob_m87);
return 0;
}
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
#else
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
#endif
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

MPU not triggering faults in cortex M4

I want to protect a memory region from writing. I've configured MPU, but it is not generating any faults.
The base address of the region that I want to protect is 0x20000000. The region size is 64 bytes.
Here's a compiling code that demonstrates the issue.
#define MPU_CTRL (*((volatile unsigned long*) 0xE000ED94))
#define MPU_RNR (*((volatile unsigned long*) 0xE000ED98))
#define MPU_RBAR (*((volatile unsigned long*) 0xE000ED9C))
#define MPU_RASR (*((volatile unsigned long*) 0xE000EDA0))
#define SCB_SHCSR (*((volatile unsigned long*) 0xE000ED24))
void Registers_Init(void)
{
MPU_RNR = 0x00000000; // using region 0
MPU_RBAR = 0x20000000; // base address is 0x20000000
MPU_RASR = 0x0700110B; // Size is 64 bytes, no sub-regions, permission=7(ro,ro), s=b=c= 0, tex=0
MPU_CTRL = 0x00000001; // enable MPU
SCB_SHCSR = 0x00010000; // enable MemManage Fault
}
void MemManage_Handler(void)
{
__asm(
"MOV R4, 0x77777777\n\t"
"MOV R5, 0x77777777\n\t"
);
}
int main(void)
{
Registers_Init();
__asm(
"LDR R0, =0x20000000\n\t"
"MOV R1, 0x77777777\n\t"
"STR R1, [R0,#0]"
);
return (1);
}
void SystemInit(void)
{
}
So, in main function, I am writing in restricted area i.e. 0x20000000, but MPU is not generating any fault and instead of calling MemManage_Handler(), it writes successfully.
This looks fine to me. Make sure your hardware have a MPU. MPU has a register called MPU_TYPE Register. This is a read-only register that tells you if you have a MPU or not. If bits 15:8 in MPU_TYPE register read 0, there's no MPU.
And never use numbers when dealing with registers. This makes it really hard for you and other person to read your code. Instead, define a number of bit masks. See tutorials on how to do that.

Testing memory bandwidth - Odd results

Hy everyone,
Was asking myself the other day how much different access patterns affected memory read speed (mostly thinking about the frequency vs bus size discussion, and the impact of cache hit rate), so made a small program to test memory speed doing sequential and fully random accesses, but the results I got are quite odd, so I'm not trusting my code.
My idea was quite straightforward, just loop on an array and mov the data to a register. Made 3 versions, one moves 128 bits at a time with sse, the other 32 , and the last one 32 again but doing two movs, the first one loading a random number from an array, and the second one reading from the position specified by the prev value.
I got ~40 GB/s for the sse version, that it's reasonable considering i'm using an i7 4790K with DDR3 1600 cl9 memory at dual channel, that gives about 25 GB/s, so add to that cache and it feels ok, but then I got 3.3 GB/s for the normal sequential, and the worst, 15 GB/s for the random one. That last result makes me think that the bench is bogus.
Below is the code, if anyone could shed some light on this it would be appreciated. Did the inner loop in assembly to make sure it only did a mov.
EDIT: Managed to get a bit more performance by using vlddqu ymm0, buffL[esi] (avx) instead of movlps, went from 38 GB/s to 41 GB/s
EDIT 2: Did some more testing, unrolling the inner assembly loop, making a version that loads 4 times per iteration and another one that loads 8 times. Got ~35 GB/s for the x4 version and ~24 GB/s for the x8 version
#define PASSES 1000000
double bw = 0;
int main()
{
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL0:
movlps xmm0, buffL[esi]
add esi,16
cmp esi,maxByte
jb loopL0
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential (SSE) : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 16;
int buffL[l];
for(int t = 0;t < l;t++) buffL[t] = (t+1)*4;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
int maxByte = l*4;
__asm
{
push esi
mov esi,0
loopL1:
mov esi, buffL[esi]
cmp esi,maxByte
jb loopL1
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Sequential : " << bw << " GB/s " << endl;
cout << "Running : ";
bw = 0;
for(int n = 0; n < PASSES;n++)
{
if(n % 100000 == 0) cout << ".";
const int l = 1 << 14;
int buffL[l];
int maxByte = l*4;
int roffset[l];
for(int t = 0;t < l;t++) roffset[t] = (rand()*4) % maxByte;
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
__asm
{
push esi
push edi
mov esi,0
loopL2:
mov edi, roffset[esi]
mov edi, buffL[edi]
add esi,4
cmp esi,maxByte
jb loopL2
pop edi
pop esi
}
// stop timer
QueryPerformanceCounter(&t2);
// compute elapsed time in millisec
double ms = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
bw += (double(2*4ull*l)/1073741824.0) / (double(ms)*0.001);
}
bw /= double(PASSES);
cout << endl;
cout << " Random : " << bw << " GB/s " << endl;
return 0;
}
Gathering the measurement code into a Bandwidth class, creating some constants, having all three tests use the same buffer (and size) aligning the tops of the loops and computing random offset into the entire buffer (3rd test):
#include "stdafx.h"
#include "windows.h"
#include <iostream>
#include <vector>
using namespace std;
constexpr size_t passes = 1000000;
constexpr size_t buffsize = 64 * 1024;
constexpr double gigabyte = 1024.0 * 1024.0 * 1024.0;
constexpr double gb_per_test = double(long long(buffsize) * passes) / gigabyte;
struct Bandwidth
{
LARGE_INTEGER pc_tick_per_sec;
LARGE_INTEGER start_pc;
const char* _label;
public:
Bandwidth(const char* label): _label(label)
{
cout << "Running : ";
QueryPerformanceFrequency(&pc_tick_per_sec);
QueryPerformanceCounter(&start_pc);
}
~Bandwidth() {
LARGE_INTEGER end_pc{};
QueryPerformanceCounter(&end_pc);
const auto seconds = double(end_pc.QuadPart - start_pc.QuadPart) / pc_tick_per_sec.QuadPart;
cout << "\n " << _label << ": " << gb_per_test / seconds << " GB/s " << endl;
}
};
int wmain()
{
vector<char> buff(buffsize, 0);
const auto buff_begin = buff.data();
const auto buff_end = buff.data()+buffsize;
{
Bandwidth b("Sequential (SSE)");
for (size_t n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff_begin
mov edi, buff_end
align 16
loopL0:
movlps xmm0, [esi]
lea esi, [esi + 16]
cmp esi, edi
jne loopL0
pop edi
pop esi
}
}
}
{
Bandwidth b("Sequential (DWORD)");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
mov esi, buff
mov edi, buff_end
align 16
loopL1:
mov eax, [esi]
lea esi, [esi + 4]
cmp esi, edi
jne loopL1
pop edi
pop esi
}
}
}
{
uint32_t* roffset[buffsize];
for (auto& roff : roffset)
roff = (uint32_t*)(buff.data())+(uint32_t)(double(rand()) / RAND_MAX * (buffsize / sizeof(int)));
const auto roffset_end = end(roffset);
Bandwidth b("Random");
for (int n = 0; n < passes; ++n) {
__asm {
push esi
push edi
push ebx
lea edi, roffset //begin(roffset)
mov ebx, roffset_end //end(roffset)
align 16
loopL2:
mov esi, [edi] //fetch the next random offset
mov eax, [esi] //read from the random location
lea edi, [edi + 4] // point to the next random offset
cmp edi, ebx //are we done?
jne loopL2
pop ebx
pop edi
pop esi
}
}
}
}
I have also found more consistent results if I SetPriorityClass(GetCurrentProcess, HIGH_PRIORITY_CLASS); and SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
Your second test has one array on the stack that is 1 << 16 in size. That's 64k. Or more easier to read:
int buffL[65536];
Your third test has two arrays on the stack. Both at `1 << 14' in size. That's 16K each
int buffL[16384];
int roffset[16384];
So right away you are using a much smaller stack size (i.e. fewer pages being cached and swapped out). I think your loop is only iterating half as many times in the third test as it is in the second. Maybe you meant to declare 1 << 15 (or 1 << 16) as the size instead for each array instead?

Optimizing C code for ARM-based devices

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code
Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.
There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Resources