repeat string instructions uses extra segment (ES) or not? - pthreads

I'm investigating how rep string instructions work. Regarding the description of the instructions the rep movsl for example has the following mnemonic.
rep movsl 5+4*(E)CX Move (E)CX dwords from [(E)SI] to ES:[(E)DI]
Where ES is extra segment register that should contain some offset to the beginning of extra segment in the memory.
The pseudo code of the operation look like below
while (CX != 0) {
*(ES*16 + DI) = *(DS*16 + SI);
SI++;
DI++;
CX--;
}
But it seems in flat memory model it is not true that rep string operates with extra segment.
For example I've created a test that creates 2 threads which copy an array to TLS (thread local storage) array using rep movs. Logically that should not work as TLS data is kept in GS segment, not in ES. But works. At least a see the correct result running the test.
Intel compiler produces the following piece of code for coping.
movl %gs:0, %eax #27.18
movl $1028, %ecx #27.18
movl 32(%esp), %esi #27.18
lea TLS_data1#NTPOFF(%eax), %edi #27.18
movl %ecx, %eax #27.18
shrl $2, %ecx #27.18
rep #27.18
movsl #27.18
movl %eax, %ecx #27.18
andl $3, %ecx #27.18
rep #27.18
movsb #27.18
Here %edi points to TLS array and rep movs store there. In case rep mov uses ES offset implicitly (that I doubt), then such code should not produce the correct result.
Do I miss something here?
There is the test I created:
#define _MULTI_THREADED
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define NUMTHREADS 2
#define N 257
typedef struct {
int data1[N];
} threadparm_t;
__thread threadparm_t TLS_data1;
void foo();
void *theThread(void *parm)
{
int rc;
threadparm_t *gData;
pthread_t self = pthread_self();
printf("Thread %u: Entered\n", self);
gData = (threadparm_t *)parm;
TLS_data1 = *gData;
foo();
return NULL;
}
void foo() {
int i;
pthread_t self = pthread_self();
printf("\nThread %u: foo()\n", self/1000000);
for (i=0; i<N; i++) {
printf("%d ", TLS_data1.data1[i]);
}
printf("\n\n");
}
int main(int argc, char **argv)
{
pthread_t thread[NUMTHREADS];
int rc=0;
int i,j;
threadparm_t gData[NUMTHREADS];
printf("Enter Testcase - %s\n", argv[0]);
printf("Create/start threads\n");
for (i=0; i < NUMTHREADS; i++) {
/* Create per-thread TLS data and pass it to the thread */
for (j=0; j < N; j++) {
gData[i].data1[j] = i+1;
}
rc = pthread_create(&thread[i], NULL, theThread, &gData[i]);
}
printf("Wait for the threads to complete, and release their resources\n");
for (i=0; i < NUMTHREADS; i++) {
rc = pthread_join(thread[i], NULL);
}
printf("Main completed\n");
return 0;
}

What you are missing is that these ops:
movl %gs:0, %eax
...
lea TLS_data1#NTPOFF(%eax), %edi
are loading the zero-based address of the thread-local TLS_data1 into %edi, which will work fine with the zero-base ES segment.

Related

flex - Simple parser gives error: fatal flex scanner internal error--end of buffer missed

I'm trying to implement a simple parser that calculates addition, subtraction, multiplication and division using fractional numbers. Fractional numbers in this form: nominatorfdenominator like this 2f3 4f6 9f4 etc. Parser should run on REPL mode. To compile and run:
lex lexer.l
yacc -d parser.y
cc lex.yy.c y.tab.c -lm -o main
./main
flex code:
%{
#include "y.tab.h"
extern YYSTYPE yylval;
#include <math.h>
void to_int(char* num, int* arr);
%}
IDENTIFIER_ERROR [0-9][a-zA-Z0-9_]*
COMMENT ";;".*
VALUESTR \"(.*?)\"
%%
[ \t\f\v\n] { ; }
exit { return KW_EXIT; }
"+" { return OP_PLUS; }
"-" { return OP_MINUS; }
"/" { return OP_DIV; }
"*" { return OP_MULT; }
"(" { return OP_OP; }
")" { return OP_CP; }
(0)|([1-9]+"f"[1-9]*) { to_int(yytext, yylval.INT_ARR); return VALUEF; }
[a-zA-Z_]([a-zA-Z0-9_]*) { strcpy(yylval.STR, yytext); return IDENTIFIER; }
{COMMENT} { printf("%s: COMMENT\n", yytext); }
{IDENTIFIER_ERROR} { printf("%s: SYNTAX ERROR\n", yytext); exit(1); }
. { printf("%s: SYNTAX ERROR\n", yytext); exit(1); }
%%
// fractional number taken as a string, converting it to: arr[0] = nominator, arr[0] = nominator, arr[1] = denominator,
void to_int(char* num, int* arr) {
char* nominator, *denominator;
strcpy(nominator, num); // nominator contains whole number for now
strtok_r(nominator, "f", &denominator);
//printf ("lex: NUMS parsed as: %s %s\n", nominator, denominator);
arr[0] = atoi(nominator);
arr[1] = atoi(denominator);
//printf("lex: nom: %d denom: %d\n", arr[0], arr[1]);
}
int yywrap(){
return 1;
}
yacc file:
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
int yylex(void);
void yyerror(char *str);
void fractional_divide(int* num1, int* num2, int* RESULTF);
void fractional_multiply(int* num1, int* num2, int* RESULTF);
void fractional_sub(int* num1, int* num2, int* RESULTF);
void fractional_sum(int* num1, int* num2, int* RESULTF);
%}
%token KW_EXIT
%token OP_PLUS OP_MINUS OP_DIV OP_MULT OP_OP OP_CP OP_COMMA
%union{
int INTEGER;
int INT_ARR[2];
char STR[20];
};
%start START
%type<INT_ARR> EXP
%token <INT_ARR> VALUEF
%%
START : EXPLIST ;
EXPLIST : EXP | EXPLIST EXP ;
EXP: OP_OP OP_PLUS EXP EXP OP_CP { fractional_sum($3, $4, $$); printf("> %d%c%d\n", $$[0], 'f', $$[1]); }
| OP_OP OP_MINUS EXP EXP OP_CP { fractional_sub($3, $4, $$); printf("> %d%c%d\n", $$[0], 'f', $$[1]); }
| OP_OP OP_DIV EXP EXP OP_CP { fractional_divide($3, $4, $$); printf("> %d%c%d\n", $$[0], 'f', $$[1]); }
| OP_OP OP_MULT EXP EXP OP_CP { fractional_multiply($3, $4, $$); printf("> %d%c%d\n", $$[0], 'f', $$[1]); }
| VALUEF { $$[0] = $1[0]; $$[1] = $1[1];}
| KW_EXIT { printf("exiting...\n"); return 0; }
;
%%
void equalize_denominators(int* num1, int* num2) {
num1[0] *= num2[1];
num1[1] *= num2[1];
num2[0] *= num1[1];
num2[1] *= num1[1];
}
void fractional_sum(int* num1, int* num2, int* RESULTF) {
if (num1[1] != num2[1])
equalize_denominators(num1, num2);
RESULTF[0] = num1[0] + num2[0];
RESULTF[1] = num2[1];
}
void fractional_sub(int* num1, int* num2, int* RESULTF) {
if (num1[1] != num2[1])
equalize_denominators(num1, num2);
RESULTF[0] = num1[0] - num2[0];
RESULTF[1] = num2[1];
}
void fractional_divide(int* num1, int* num2, int* RESULTF) {
RESULTF[0] = num1[0] * num2[1];
RESULTF[1] = num1[1] * num2[0];
}
void fractional_multiply(int* num1, int* num2, int* RESULTF) {
RESULTF[0] = num1[0] * num2[0];
RESULTF[1] = num1[1] * num2[1];
}
void yyerror(char *str) {
printf("yyerror: %s\n", str);
}
int main(int argc, char *argv[]){
if(argc == 1)
yyparse();
else {
printf("Input error. Exiting...\n");
exit(1);
}
return 0;
}
sample output, first line is ok, but when I hit the enter after second line I get this error:
(+ 2f3 1f3)
result: 3f3
(* 2f1 2f6)
result: 4f6
fatal flex scanner internal error--end of buffer missed
That error message can occur in some specific circumstances involving the use of yymore() in the last token in the input, but probably the most common cause is memory corruption, which is what you've managed to do here.
It's likely that the issue is in to_int, where you do a strcpy whose destination is an uninitialised pointer:
void to_int(char* num, int* arr) {
char* nominator, *denominator;
strcpy(nominator, num); // FIXME nominator is uninitialised
It's actually not clear to me why you feel the need to make a copy of the argument num, since you are calling it with yytext. You're free to modify the contents of yytext as long as you don't write outside of its memory area. (The variable yyleng tells you how long yytext is.) Since strtok does not modify it's argument outside of the contents area, it's safe to apply to yytext. But if you are going to copy num, you obviously have to copy it to an actual validly initialized pointer; otherwise chaos will ensue. (Which could include triggering random flex error messages.)
I didn't check your code thoroughly nor did I attempt to run it, so there may be other errors. In particular, I did notice a couple of problems with your token patterns:
(0)|([1-9]+"f"[1-9]*) does not allow 10f2 or 2f103, since you only allow integers written with digits 1 through 9. It also allows 2f, whose meaning is opaque to me, and your to_int function could blow up on it. (At best, it would end up with a denominator of 0, which is also an error.) I'd recommend using two patterns, one for integers and the other for fractions:
0|[1-9][0-9]* { yylval.INT_ARG[0] = atoi(yytext);
yylval.INT_ARG[1] = 1;
return VALUEF;
}
0|[1-9][0-9]*f[1-9][0-9]* {
to_int(yytext, yylval.INT_ARR);
return VALUEF;
}
But you might want to add more meaningful error messages for illegal numbers like 03f2 and 3f0.
Although you don't use it anywhere, your pattern for character strings is incorrect, since (f)lex does not implement non-greedy matching. A better pattern would be \"[^"]*\" or \"[^"\n]*\" (which prohibits newlines inside strings); even better would be to allow backslash escapes with something like \"(\\.|[^"\\\n])*\". There are lots of other variants but that basically covers the general principle. (Some of us prefer ["] to \" but that's just stylistic; the meaning is the same.)
Also, it is bad style to call exit from a yylex action. It's better to arrange for some kind of error return. Similarly, you should not use a return statement from a yyparse action, since it leaves the parser's internal state inconsistent, and does not allow the parser to free the resources it has allocated. Use YY_ACCEPT (or YY_ABORT if you want to signal a syntax error). These are described in the documentation or any good guide.

CLANG is ignoring AVX2 intrinsics in this code

I'm testing LLVM's ability to vectorize some code in https://rust.godbolt.org/
Options : -mavx2 -ffast-math -fno-math-errno -O3
Compiler LLVM 13, but any LLVM actually does the same thing.
#include <immintrin.h>
template<class T>
struct V4
{
T A,B,C,D;
V4() { };
V4(T x) : A(x), B(x), C(x), D(x) { };
V4(T a, T b, T c, T d) : A(a), B(b), C(c), D(d) { };
void operator +=(const V4& x)
{
//A += x.A; B += x.B; C += x.C; D += x.D;
__m256 f = _mm256_loadu_ps(&A);
__m256 f2 = _mm256_loadu_ps(&x.A);
_mm256_store_ps(&A, _mm256_add_ps(f, f2));
};
T GetSum() const { return A + B + C + D; };
};
typedef V4<float> V4F;
double FN(float f[4], float g[4], int cnt)
{
V4F vec1(f[0], f[1], f[2], f[3]), vec2(g[0], g[1], g[2], g[3]);
for (int i=0; i<cnt; i++)
vec1 += vec2;
return vec1.GetSum();
};
This is the resulting disassembly:
FN(float*, float*, int): # #FN(float*, float*, int)
vmovddup xmm0, qword ptr [rdi + 8] # xmm0 = mem[0,0]
vaddps xmm0, xmm0, xmmword ptr [rdi]
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vaddss xmm0, xmm0, xmm1
vcvtss2sd xmm0, xmm0, xmm0
ret
So it is completely ignoring the intrinsics. If I uncomment that part that should be doing the same thing in C++, a really long code appears, so it apparently starts understanding it.
Am I missing something or is this a bug in LLVM?

How to disable Nvidia AGX Xavier CPU cache?

I would like to disable the CPU cache of the Nvidia AGX Xavier.
I've tried to use menuconfig to control the cache but it seems not working on this system. Then I tried to use ASM, and I got an error ' illegal instruction (core dumped )'.
#include <stdio.h>
#include <string.h>
#include <linux/kernel.h>
#include <sys/syscall.h>
#include <unistd.h>
void L1_off(void){
__asm__(
"mrs x1, sctlr_el1;" // read SCTLR_EL1
"ldr x1, = 0xefff;"
"and x11, x11, x1;" // turn off bit 12 and store into R1 0xefff
"msr sctlr_el1, x11;"); // restore R1 into SCTLR_EL1
}
void L1_on(void){
__asm__("mrs x11, sctlr_el1;"
"ldr x2, = 0x1000;"
"orr x11, x11, x2;" // turn on bit 12
"msr sctlr_el1, x11;");
}
int main(void){
L1_off();
//L1_on();
return 0;
}
Error message: illegal instruction (core dumped )

MPU not triggering faults in cortex M4

I want to protect a memory region from writing. I've configured MPU, but it is not generating any faults.
The base address of the region that I want to protect is 0x20000000. The region size is 64 bytes.
Here's a compiling code that demonstrates the issue.
#define MPU_CTRL (*((volatile unsigned long*) 0xE000ED94))
#define MPU_RNR (*((volatile unsigned long*) 0xE000ED98))
#define MPU_RBAR (*((volatile unsigned long*) 0xE000ED9C))
#define MPU_RASR (*((volatile unsigned long*) 0xE000EDA0))
#define SCB_SHCSR (*((volatile unsigned long*) 0xE000ED24))
void Registers_Init(void)
{
MPU_RNR = 0x00000000; // using region 0
MPU_RBAR = 0x20000000; // base address is 0x20000000
MPU_RASR = 0x0700110B; // Size is 64 bytes, no sub-regions, permission=7(ro,ro), s=b=c= 0, tex=0
MPU_CTRL = 0x00000001; // enable MPU
SCB_SHCSR = 0x00010000; // enable MemManage Fault
}
void MemManage_Handler(void)
{
__asm(
"MOV R4, 0x77777777\n\t"
"MOV R5, 0x77777777\n\t"
);
}
int main(void)
{
Registers_Init();
__asm(
"LDR R0, =0x20000000\n\t"
"MOV R1, 0x77777777\n\t"
"STR R1, [R0,#0]"
);
return (1);
}
void SystemInit(void)
{
}
So, in main function, I am writing in restricted area i.e. 0x20000000, but MPU is not generating any fault and instead of calling MemManage_Handler(), it writes successfully.
This looks fine to me. Make sure your hardware have a MPU. MPU has a register called MPU_TYPE Register. This is a read-only register that tells you if you have a MPU or not. If bits 15:8 in MPU_TYPE register read 0, there's no MPU.
And never use numbers when dealing with registers. This makes it really hard for you and other person to read your code. Instead, define a number of bit masks. See tutorials on how to do that.

Optimizing C code for ARM-based devices

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code
Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.
There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Resources