aabb frustum culling using NEON - ios

i'm trying to optimize aabb box in frustum using NEON (ios, >arm7) and i'm just confused of benchmark results.
NEON version (GLKVector4DotProduct using NEON):
FORCE_INLINE bool box_in_view1(const GLKVector4& min, const GLKVector4& max)
{
#define test_plane(i) { \
const GLKVector4& fp = frustum_plane[i]; \
if (GLKVector4DotProduct(fp, GLKVector4Make(min.x, min.y, min.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(max.x, min.y, min.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(min.x, max.y, min.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(max.x, max.y, min.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(min.x, min.y, max.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(max.x, min.y, max.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(min.x, max.y, max.z, 1.0f)) <= 0.0f && \
GLKVector4DotProduct(fp, GLKVector4Make(max.x, max.y, max.z, 1.0f)) <= 0.0f) { \
return false; \
} \
}
test_plane(0);
test_plane(1);
test_plane(2);
test_plane(3);
test_plane(4);
test_plane(5);
return true;
}
and without NEON:
FORCE_INLINE bool box_in_view2(const GLKVector4& min, const GLKVector4& max)
{
#define test_plane(i) { \
const GLKVector4& fp = frustum_plane[i]; \
float negw = -fp.w; \
if (fp.x * min.x + fp.y * min.y + fp.z * min.z <= negw && \
fp.x * max.x + fp.y * min.y + fp.z * min.z <= negw && \
fp.x * min.x + fp.y * max.y + fp.z * min.z <= negw && \
fp.x * max.x + fp.y * max.y + fp.z * min.z <= negw && \
fp.x * min.x + fp.y * min.y + fp.z * max.z <= negw && \
fp.x * max.x + fp.y * min.y + fp.z * max.z <= negw && \
fp.x * min.x + fp.y * max.y + fp.z * max.z <= negw && \
fp.x * max.x + fp.y * max.y + fp.z * max.z <= negw) { \
return false; \
} \
}
test_plane(0);
test_plane(1);
test_plane(2);
test_plane(3);
test_plane(4);
test_plane(5);
return true;
}
in simple benchmark timings are:
box_in_view1: 1.9704s
box_in_view2: 0.0013s
it's 10 000 000 tests with static aabb box and static frustum (cube is inside so all tests returns true).
Tested on ipad3, ios7, compiler opt flags: -Ofast -ffast-math
I'm sure that GLKVector4DotProduct() really using NEON instrics in ARM_NEON path. Any explanation why is NEON results so much slower and why?

Below is the FULLY OPTIMIZED version that must be much faster than your current one :
/*
fanicBoxInView
Copyright (C) 2014 Jake Lee
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
// int fanicBoxInView(void *pMin, void *pMax, void *pFp, unsigned int count);
// assert : count >= 4
.text
.arm
.global fanicBoxInView
pMin .req r0
pMax .req r1
pFp .req r2
count .req r3
.align 5
.func
fanicBoxInView:
vld1.32 {q12}, [pMin]
vld1.32 {q13}, [pMax]
subs count, count, #4
vmov.i32 q14, #0
vmov.i32 q15, #0
bxmi lr
vpush {q6-q7}
vzip.32 q12, q13
1:
vld1.32 {q0,q1}, [pFp]!
vld1.32 {q2,q3}, [pFp]!
pld [pFp, #64*3]
subs count, count, #4
vdup.32 d20, d1[1]
vdup.32 d21, d3[1]
vdup.32 d22, d5[1]
vdup.32 d23, d7[1]
vmul.f32 d12, d25, d0[1]
vmul.f32 d13, d25, d2[1]
vmul.f32 d14, d25, d4[1]
vmul.f32 d15, d25, d6[1]
vneg.f32 q10, q10
vneg.f32 q11, q11
vmul.f32 d16, d26, d1[0]
vmul.f32 d17, d26, d3[0]
vmul.f32 d18, d26, d5[0]
vmul.f32 d19, d26, d7[0]
vmls.f32 d20, d24, d0[0]
vmls.f32 d21, d24, d2[0]
vmls.f32 d22, d24, d4[0]
vmls.f32 d23, d24, d6[0]
vrev64.32 q0, q6
vrev64.32 q1, q7
vadd.f32 q6, q6, q8
vadd.f32 q7, q7, q9
vadd.f32 q8, q8, q0
vadd.f32 q9, q9, q1
vrev64.32 q0, q6
vrev64.32 q1, q7
vrev64.32 q2, q8
vrev64.32 q3, q9
vcgt.f32 q6, q6, q10
vcgt.f32 q7, q7, q11
vcgt.f32 q8, q8, q10
vcgt.f32 q9, q9, q11
vcgt.f32 q0, q0, q10
vcgt.f32 q1, q1, q11
vcgt.f32 q2, q2, q10
vcgt.f32 q3, q3, q11
vorr q6, q7, q6
vorr q8, q9, q8
vorr q0, q1, q0
vorr q2, q3, q2
vorr q14, q6, q14
vorr q15, q8, q15
vorr q14, q14, q0
vorr q15, q15, q2
bpl 1b
cmp count, #-4
add pFp, pFp, count, lsl #2
bgt 1b
vorr q14, q15, q14
vpop {q6-q7}
vmov r0, r1, d28
vmov r2, r3, d29
orr r0, r1, r0
orr r2, r3, r2
orr r0, r2, r0
bx lr
.endfunc
.end
The syntax above is for Linaro GCC. You'll have to do some changes for XCode:
How do I use the ARM assembler in XCode?
Note that you have to assert (count >= 4);
Also note that the code runs most efficiently when (count % 4 == 0);
Have fun.
PS : I'm a professional optimizer with "optimization on demand" practices, and my clients weren't very pleased with my recent activities providing fully optimized codes here for free. Therefore, I was forced to make it GPL.

Related

LLVM is optimizing the intrinsic code as well

We have some code which is manually written as intrinsics. But LLVM is trying to optimize further because of --fast-math flag. Manual intrinsic is better compared to LLVM optimized one.
Example source code:
inline __m256 simd_evaluate_polynomial<__m256, APPROX_DEFAULT>(__m256 x, const std::array<__m256, APPROX_DEFAULT + 1>& coeff)
{
__m256 power = _mm256_set1_ps(1.0f);
__m256 res = _mm256_set1_ps(0.0f);
for (unsigned int i = 0; i <= APPROX_DEFAULT; i++) {
__m256 term = _mm256_mul_ps(coeff[i], power);
power = _mm256_mul_ps(power, x);
res = _mm256_add_ps(res, term);
}
return res;
}
For above function LLVM ASSEMBLY
Address Source Line Assembly CPU Time: Total CPU Time: Self
0x1402bbf7d 0 Block 1:
0x1402bbf7d 19 vmovaps ymm5, ymmword ptr [rip+0x50e4b5b] 0.1% 15.584ms
0x1402bbf85 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b32] 0.1% 15.595ms
0x1402bbf8e 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b09] 0.6% 93.654ms
0x1402bbf97 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ae0] 0.2% 31.178ms
0x1402bbfa0 21 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ab7] 0.3% 46.992ms
Can anyone please explain this why this is happening?

iOS Neon assembler sample questions

Just trying http://api.madewithmarmalade.com/ExampleArmASM.html and using iOS; the program run if I comment out the loop and the res is printed as 28. But if not comment it out, it will abend without printing the res.
Any hint why and how to fix it.
Thanks in advance.
My code is as follows:
#include <stdio.h>
#include <stdlib.h>
#define ARRAY_SIZE 512
#if defined __arm__ && defined __ARM_NEON__
static int computeSumNeon(const int a[])
{
// Computes the sum of all elements in the input array
int res = 0;
asm(".align 4 \n\t" //dennis warning avoiding
"vmov.i32 q8, #0 \n\t" //clear our accumulator register
"mov r3, #512 \n\t" //Loop condition n = ARRAY_SIZE
// ".loop1: \n\t" // No loop add 0-7 works as 28
"vld1.32 {d0, d1, d2, d3}, [%[input]]! \n\t" //load 8 elements into d0, d1, d2, d3 = q0, q1
"pld [%[input]] \n\t" // preload next set of elements
"vadd.i32 q8, q0, q8 \n\t" // q8 += q0
"vadd.i32 q8, q1, q8 \n\t" // q8 += q1
"subs r3, r3, #8 \n\t" // n -= 8
// "bne .loop1 \n\t" // n == 0?
"vpadd.i32 d0, d16, d17 \n\t" // d0[0] = d16[0] + d16[1], d0[1] = d17[0] + d17[1]
"vpaddl.u32 d0, d0 \n\t" // d0[0] = d0[0] + d0[1]
"vmov.32 %[result], d0[0] \n\t"
: [result] "=r" (res) , [input] "+r" (a)
:
: "q0", "q1", "q8", "r3");
return res;
}
#else
static int computeSumNeon(const int a[])
{
int i, res = 0;
for (i = 0; i < ARRAY_SIZE; i++)
res += a[i];
}
#endif
...
#implementation AppDelegate
- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions {
// Override point for customization after application launch.
//int* inp;
int inp[ARRAY_SIZE];
//posix_memalign((void**)&inp, 64, ARRAY_SIZE*sizeof(int)); // Align to cache line size (64bytes on a cortex A8)
// Initialise the array with consecutive integers.
int i;
for (i = 0; i < ARRAY_SIZE; i++)
{
inp[i] = i;
}
for (i = 0; i < ARRAY_SIZE; i++)
{
printf("%i,", inp[i]);
}
printf("\n\n sum 0-7:%i\n", 0+1+2+3+4+5+6+7);
int res = 0;
res = computeSumNeon(inp);
printf("res NEO :%i\n", res);
// free(inp); // error pointer being free was not allocated !!!
UISplitViewController *splitViewController = (UISplitViewController *)self.window.rootViewController;
UINavigationController *navigationController = [splitViewController.viewControllers lastObject];
navigationController.topViewController.navigationItem.leftBarButtonItem = splitViewController.displayModeButtonItem;
splitViewController.delegate = self;
return YES;
}
- (void)applicationWillResignActive:(UIApplication *)application {
...
==== assembly code generated
.align 1
.code 16 # #computeSumNeon
.thumb_func _computeSumNeon
_computeSumNeon:
Lfunc_begin3:
.loc 18 133 0 is_stmt 1 # ...
.cfi_startproc
# BB#0:
sub sp, #8
movs r1, #0
str r0, [sp, #4]
.loc 18 135 9 prologue_end # ...
Ltmp18:
str r1, [sp]
.loc 18 136 5 # ...
ldr r0, [sp, #4]
# InlineAsm Start
.align 4
vmov.i32 q8, #0x0
movw r3, #504
.loop1:
vld1.32 {d0, d1, d2, d3}, [r0]!
vadd.i32 q8, q0, q8
vadd.i32 q8, q1, q8
subs r3, #8
bne .loop1
vpadd.i32 d0, d16, d17
vpaddl.u32 d0, d0
vmov.32 r1, d0[0]
# InlineAsm End
str r1, [sp]
str r0, [sp, #4]
.loc 18 155 12 # ...
ldr r0, [sp]
.loc 18 155 5 is_stmt 0 # ...
add sp, #8
bx lr
Ltmp19:
Lfunc_end3:
.cfi_endproc

mprotect errno 22 iOS

I'm developing a jailbroken app on iOS and getting errno 22 when calling
mprotect(p, 1024, PROT_READ | PROT_EXEC)
errno 22 means invalid arguments but I can't figure out whats wrong. I've aligned p to be a multiple of page size, and I've malloced the memory previously before calling mprotect.
Here's my code and sample output
#define PAGESIZE 4096
FILE * pFile;
pFile = fopen ("log.txt","w");
uint32_t code[] = {
0xe2800001, // add r0, r0, #1
0xe12fff1e, // bx lr
};
fprintf(pFile, "Before Execution\n");
p = (uint32_t *)malloc(1024+PAGESIZE-1);
if (!p) {
fprintf(pFile, "Couldn't malloc(1024)");
perror("Couldn't malloc(1024)");
exit(errno);
}
fprintf(pFile, "Malloced to %p\n", p);
p = (uint32_t *)(((uintptr_t)p + PAGESIZE-1) & ~(PAGESIZE-1));
fprintf(pFile, "Moved pointer to %p\n", p);
fprintf(pFile, "Before Compiling\n");
// copy instructions to function
p[0] = code[0];
p[1] = code[1];
fprintf(pFile, "After Compiling\n");
if (mprotect(p, 1024, PROT_READ | PROT_EXEC)) {
int err = errno;
fprintf(pFile, "Couldn't mprotect2: %i\n", errno);
perror("Couldn't mprotect");
exit(errno);
}
And output:
Before Execution
Malloced to 0x13611ec00
Moved pointer 0x13611f000
Before Compiling
After Compiling
Couldn't mprotect2: 22
Fixed this by using posix_memalign(). Turns out I wasn't aligning my pointer to the page size correctly

How CUDA constant memory allocation works?

I'd like to get some insight about how constant memory is allocated (using CUDA 4.2). I know that the total available constant memory is 64KB. But when is this memory actually allocated on the device? Is this limit apply to each kernel, cuda context or for the whole application?
Let's say there are several kernels in a .cu file, each using less than 64K constant memory. But the total constant memory usage is more than 64K. Is it possible to call these kernels sequentially? What happens if they are called concurrently using different streams?
What happens if there is a large CUDA dynamic library with lots of kernels each using different amounts of constant memory?
What happens if there are two applications each requiring more than half of the available constant memory? The first application runs fine, but when will the second app fail? At app start, at cudaMemcpyToSymbol() calls or at kernel execution?
Parallel Thread Execution ISA Version 3.1 section 5.1.3 discusses constant banks.
Constant memory is restricted in size, currently limited to 64KB which
can be used to hold statically-sized constant variables. There is an
additional 640KB of constant memory, organized as ten independent 64KB
regions. The driver may allocate and initialize constant buffers in
these regions and pass pointers to the buffers as kernel function
parameters. Since the ten regions are not contiguous, the driver
must ensure that constant buffers are allocated so that each buffer
fits entirely within a 64KB region and does not span a region
boundary.
A simple program can be used to illustrate the use of constant memory.
__constant__ int kd_p1;
__constant__ short kd_p2;
__constant__ char kd_p3;
__constant__ double kd_p4;
__constant__ float kd_floats[8];
__global__ void parameters(int p1, short p2, char p3, double p4, int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = p1;
*pp2 = p2;
*pp3 = p3;
*pp4 = p4;
return;
}
__global__ void constants(int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = kd_p1;
*pp2 = kd_p2;
*pp3 = kd_p3;
*pp4 = kd_p4;
return;
}
Compile this for compute_30, sm_30 and execute cuobjdump -sass <executable or obj> to disassemble you should see
Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = windows
compile_size = 32bit
identifier = c:/dev/constant_banks/kernel.cu
code for sm_30
Function : _Z10parametersiscdPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x40001de428004005*/ MOV R0, c [0x0] [0x150]; // pp1
/*0018*/ /*0x50009de428004005*/ MOV R2, c [0x0] [0x154]; // pp2
/*0020*/ /*0x0001dde428004005*/ MOV R7, c [0x0] [0x140]; // p1
/*0028*/ /*0x13f0dc4614000005*/ LDC.U16 R3, c [0x0] [0x144]; // p2
/*0030*/ /*0x60011de428004005*/ MOV R4, c [0x0] [0x158]; // pp3
/*0038*/ /*0x70019de428004005*/ MOV R6, c [0x0] [0x15c]; // pp4
/*0048*/ /*0x20021de428004005*/ MOV R8, c [0x0] [0x148]; // p4
/*0050*/ /*0x30025de428004005*/ MOV R9, c [0x0] [0x14c]; // p4
/*0058*/ /*0x1bf15c0614000005*/ LDC.U8 R5, c [0x0] [0x146]; // p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7; // *pp1 = p1
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3; // *pp2 = p2
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5; // *pp3 = p3
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8; // *pp4 = p4
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
...........................................
Function : _Z9constantsPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x00001de428004005*/ MOV R0, c [0x0] [0x140]; // p1
/*0018*/ /*0x10009de428004005*/ MOV R2, c [0x0] [0x144]; // p2
/*0020*/ /*0x0001dde428004c00*/ MOV R7, c [0x3] [0x0]; // kd_p1
/*0028*/ /*0x13f0dc4614000c00*/ LDC.U16 R3, c [0x3] [0x4]; // kd_p2
/*0030*/ /*0x20011de428004005*/ MOV R4, c [0x0] [0x148]; // p3
/*0038*/ /*0x30019de428004005*/ MOV R6, c [0x0] [0x14c]; // p4
/*0048*/ /*0x20021de428004c00*/ MOV R8, c [0x3] [0x8]; // kd_p4
/*0050*/ /*0x30025de428004c00*/ MOV R9, c [0x3] [0xc]; // kd_p4
/*0058*/ /*0x1bf15c0614000c00*/ LDC.U8 R5, c [0x3] [0x6]; // kd_p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7;
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3;
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5;
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8;
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
.....................................
I annotated to the right of the SASS.
On sm30 you can see that parameters are passed in constant bank 0 starting at offset 0x140.
User defined __constant__ variables are defined in constant bank 3.
If you execute cuobjdump --dump-elf <executable or obj> you can find other interesting constant information.
32bit elf: abi=6, sm=30, flags = 0x1e011e
Sections:
Index Offset Size ES Align Type Flags Link Info Name
1 34 142 0 1 STRTAB 0 0 0 .shstrtab
2 176 19b 0 1 STRTAB 0 0 0 .strtab
3 314 d0 10 4 SYMTAB 0 2 a .symtab
4 3e4 50 0 4 CUDA_INFO 0 3 b .nv.info._Z9constantsPiPsPcPd
5 434 30 0 4 CUDA_INFO 0 3 0 .nv.info
6 464 90 0 4 CUDA_INFO 0 3 a .nv.info._Z10parametersiscdPiPsPcPd
7 4f4 160 0 4 PROGBITS 2 0 a .nv.constant0._Z10parametersiscdPiPsPcPd
8 654 150 0 4 PROGBITS 2 0 b .nv.constant0._Z9constantsPiPsPcPd
9 7a8 30 0 8 PROGBITS 2 0 0 .nv.constant3
a 7d8 c0 0 4 PROGBITS 6 3 a00000b .text._Z10parametersiscdPiPsPcPd
b 898 c0 0 4 PROGBITS 6 3 a00000c .text._Z9constantsPiPsPcPd
.section .strtab
.section .shstrtab
.section .symtab
index value size info other shndx name
0 0 0 0 0 0 (null)
1 0 0 3 0 a .text._Z10parametersiscdPiPsPcPd
2 0 0 3 0 7 .nv.constant0._Z10parametersiscdPiPsPcPd
3 0 0 3 0 b .text._Z9constantsPiPsPcPd
4 0 0 3 0 8 .nv.constant0._Z9constantsPiPsPcPd
5 0 0 3 0 9 .nv.constant3
6 0 4 1 0 9 kd_p1
7 4 2 1 0 9 kd_p2
8 6 1 1 0 9 kd_p3
9 8 8 1 0 9 kd_p4
10 16 32 1 0 9 kd_floats
11 0 192 12 10 a _Z10parametersiscdPiPsPcPd
12 0 192 12 10 b _Z9constantsPiPsPcPd
The kernel parameter constant bank is versioned per launch so that concurrent kernels can be executed. The compiler and user constants are per CUmodule. It is the responsibility of the developer to manage coherency of this data. For example, the developer has to ensure that a cudaMemcpyToSymbol is update in a safe manner.

SSE2: How to reduce a _m128 to a word

What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word?
I want the low part of each _m128 components:
int result = ( _m128.a & 0x000000ff ) << 24
| ( _m128.b & 0x000000ff ) << 16
| ( _m128.c & 0x000000ff ) << 8
| ( _m128.d & 0x000000ff ) << 0
Is there an intrinsics for that ? thanks !
FYI, the sse3 intrinsics _mm_shuffle_epi8 do the job: (with the mask 0x0004080c in this case )
The SSE2 answer takes more than one instructions:
unsigned benoit(__m128i x)
{
__m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
return _mm_cvtsi128_si32(
_mm_packus_epi16(
_mm_packus_epi16(
_mm_and_si128(x, mask), zero), zero));
}
The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:
pxor %xmm0, %xmm0
pand MASK, %xmm1
packuswb %xmm0, %xmm1
packuswb %xmm0, %xmm1
movd %xmm1, %rax
If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.
Anyway, hope that helps.

Resources