I am using inline assembly to do ARM operations in C code.
In the C code I'm allocating memory with calloc. This memory block is divided into different buffers, such that:
int * SCRATCH = (int *)calloc(LEN, sizeof(int));
buffer1 = (int *)SCRATCH;
buffer2 = (short *)((int *) buffer1 + sizeof(* buffer1) * LEN_BUF_1);
where the lengths are all known. Now I'm accessing values from those buffers with
LDRD r4, r4, [r0], #8;
which gives me an error since the memory is not aligned properly. How would I manage to align all buffers that I'm using to be able to use double loads and stores?
Thank you!
Related
I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.
Here's a working example of vector matrix multiplication I wrote:
//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix
for (int i = 0, max_i = matCols; i < max_i; i++) {
for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
float32x4_t mat_val = vld1q_f32(mat_ptr); //get 4 elements from matrix
float32x4_t vec_val = vld1q_f32(vec_ptr); //get 4 elements from vector
float32x4_t out_val = vmulq_f32(mat_val, vec_val); //multiply vectors
float32_t total_sum = vaddvq_f32(out_val); //sum elements of vector together
out_ptr[i] += total_sum;
}
vec_ptr = &myVec[0]; //switch ptr back again to zero element
}
The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.
Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.
I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.
Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?
Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?
And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.
The solution:
float32x4x4_t myMat;
float32x2_t myVecLow, myVecHigh;
myVecLow = vld1_f32(&pVec[0]);
myVecHigh = vld1_f32(&pVec[2]);
myMat = vld4q_f32(pMat);
myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);
vst1q_f32(pDst, myMat.val[0]);
Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.
You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.
I have a buffer of 12-bit data (stored in 16-bit data)
and need to converts into 8-bit (shift by 4)
How can the NEON accelerate this processing ?
Thank you for your help
Brahim
Took the liberty to assume a few things explained below, but this kind of code (untested, may require a few modifications) should provide a good speedup compared to naive non-NEON version:
#include <arm_neon.h>
#include <stdint.h>
void convert(const restrict *uint16_t input, // the buffer to convert
restrict *uint8_t output, // the buffer in which to store result
int sz) { // their (common) size
/* Assuming the buffer size is a multiple of 8 */
for (int i = 0; i < sz; i += 8) {
// Load a vector of 8 16-bit values:
uint16x8_t v = vld1q_u16(buf+i);
// Shift it by 4 to the right, narrowing it to 8 bit values.
uint8x8_t shifted = vshrn_n_u16(v, 4);
// Store it in output buffer
vst1_u8(output+i, shifted);
}
}
Things I assumed here:
that you're working with unsigned values. If it's not the case, it will be easy to adapt anyway (uint* -> int*, *_u8->*_s8 and *_u16->*_s16)
as the values are loaded 8 by 8, I assumed the buffer length was a multiple of 8 to avoid edge cases. If that's not the case, you should probably pad it artificially to a multiple of 8.
Finally, the 2 resource pages used from the NEON documentation:
about loads and stores of vectors.
about shifting vectors.
Hope this helps!
prototype : void dataConvert(void * pDst, void * pSrc, unsigned int count);
1:
vld1.16 {q8-q9}, [r1]!
vld1.16 {q10-q11}, [r1]!
vqrshrn.u16 d16, q8, #4
vqrshrn.u16 d17, q9, #4
vqrshrn.u16 d18, q10, #4
vqrshrn.u16 d19, q11, #4
vst1.16 {q8-q9}, [r0]!
subs r2, #32
bgt 1b
q flag : saturation
r flag : rounding
change u16 to s16 in case of signed data.
I initialized the array like so
CGImageRef imageRef = CGImageCreateWithImageInRect(image.CGImage, bounds);
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
NSUInteger width = CGImageGetWidth(imageRef);
NSUInteger height = CGImageGetHeight(imageRef);
unsigned char *rawData = malloc(height * width * 4);
NSUInteger bytesPerPixel = 4;
NSUInteger bytesPerRow = bytesPerPixel * width;
NSUInteger bitsPerComponent = 8;
CGContextRef context = CGBitmapContextCreate(rawData, width, height, bitsPerComponent, bytesPerRow, colorSpace, kCGImageAlphaPremultipliedLast | kCGBitmapByteOrder32Big);
However, when I tried checking the count through an NSLog, I always get 4 (4/1, specifically).
int count = sizeof(rawData)/sizeof(rawData[0]);
NSLog(#"%d", count);
Yet when I NSLog the value of individual elements, it returns non zero values.
ex.
CGFloat f1 = rawData[15];
CGFloat f2 = rawData[n], where n is image width*height*4;
//I wasn't expecting this to work since the last element should be n-1
Finally, I tried
int n = lipBorder.size.width *lipBorder.size.height*4*2; //lipBorder holds the image's dimensions, I tried multiplying by 2 because there are 2 pixels for every CGPoint in retina
CGFloat f = rawData[n];
This would return different values each time for the same image, (ex. 0.000, 115.000, 38.000).
How do I determine the count / how are the values being stored into the array?
rawData is a pointer to unsigned char, as such its size is 32 bits (4 bytes)[1]. rawData[0] is an unsigned char, as such its size is 8 bits (1 byte). Hence, 4/1.
You've probably seen this done with arrays before, where it does work as you would expect:
unsigned char temp[10] = {0};
NSLog(#"%d", sizeof(temp)/sizeof(temp[0])); // Prints 10
Note, however, that you are dealing with a pointer to unsigned char, not an array of unsigned char - the semantics are different, hence why this doesn't work in your case.
If you want the size of your buffer, you'll be much better off simply using height * width * 4, since that's what you passed to malloc anyway. If you really must, you could divide that by sizeof(char) or sizeof(rawData[0]) to get the number of elements, but since they're chars you'll get the same number anyway.
Now, rawData is just a chunk of memory somewhere. There's other memory before and after it. So, if you attempt to do something like rawData[height * width * 4], what you're actually doing is attempting to access the next byte of memory after the chunk allocated for rawData. This is undefined behaviour, and can result in random garbage values being returned[2] (as you've observed), some "unassigned memory" marker value being returned, or a segmentation fault occurring.
[1]: iOS is a 32-bit platform
[2]: probably whatever value was put into that memory location last time it was legitimately used.
The pointer returned by malloc is a void* pointer meaning that it returns a pointer to an address in memory. It seems that the width and the height that are being returned are 0. This would explain why you are only being allocated 4 bytes for your array.
You also said that you tried
int n = lipBorder.size.width *lipBorder.size.height*4*2; //lipBorder holds the image's dimensions, I tried multiplying by 2 because there are 2 pixels for every CGPoint in retina
CGFloat f = rawData[n];
and were receiving different values each time. This behavior is to be expected given that your array is only 4 bytes long and you are accessing an area of memory that is much further ahead in memory. The reason that the value was changing was that you were accessing memory that was not in your array, but in a memory location that was
lipBorder.size.width *lipBorder.size.height*4*2 - 4 bytes passed the end of your array. C in no way prevent you from accessing any memory within your program. If you had accessed memory that is off limits to your program you would have received a segmentation fault.
You can therefore access n + 1 or n + 2 or n + whatever element. It only means that you are accessing memory that is passed the end of your array.
Incrementing the pointer rawdata would move the memory address by one byte. Incrementing and int pointer would increment move the memory address by 4 bytes (sizeof(int)).
I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.
Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?
The bitmaps are 8bpp and typically 2048x2400x8bpp
Currently I do this by simply copying with argument inversion, roughly (pseudo code:
for x = 0 to 2048-1
for y = 0 to 2048-1
dest[x][y]=src[y][x];
(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)
GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.
Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.
Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler
Follow up to Nils' answer
Image 2048x2700 -> 2700x2048
Compiler Turbo Explorer 2006 with optimization on.
Windows: Power scheme set to "Always on". (important!!!!)
Machine: Core2 6600 (2.4 GHz)
time with old routine: 32ms (step 1)
time with stepsize 8 : 12ms
time with stepsize 16 : 10ms
time with stepsize 32+ : 9ms
Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).
The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:
for n:=0 to 7 do
begin
load r0, <source+n*rowsize>
shift byte from r0 into r1
shift byte from r0 into r2
..
shift byte from r0 into r8
end;
store r1, <target>
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>
So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)
Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.
The code (validated by subtracting result from the "naieve" rotate1 implementation):
const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);
var stepsx,stepsy,restx,resty : Integer;
RowPitchSource, RowPitchTarget : Integer;
pSource, pTarget,ps1,ps2 : pchar;
x,y,i,j: integer;
rpstep : integer;
begin
RowPitchSource := source.RowPitch; // bytes to jump to next line. Can be negative (includes alignment)
RowPitchTarget := target.RowPitch; rpstep:=RowPitchTarget*stepsize;
stepsx:=source.ImageWidth div stepsize;
stepsy:=source.ImageHeight div stepsize;
// check if mod 16=0 here for both dimensions, if so -> SSE2.
for y := 0 to stepsy - 1 do
begin
psource:=source.GetImagePointer(0,y*stepsize); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
// 3 more areas to do, with dimensions
// - stepsy*stepsize * restx // right most column of restx width
// - stepsx*stepsize * resty // bottom row with resty height
// - restx*resty // bottom-right rectangle.
restx:=source.ImageWidth mod stepsize; // typically zero because width is
// typically 1024 or 2048
resty:=source.Imageheight mod stepsize;
if restx>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(source.ImageWidth-restx,0); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
for y := 0 to stepsy - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize*RowPitchSource);
dec(ptarget,stepsize);
end;
end;
if resty>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(0,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to resty- 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
if (resty>0) and (restx>0) then
begin
// another loop less, since only one block
psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
for i := 0 to resty- 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
end;
end;
Update 2 Generics
I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it :-)
Update 3 Generics
Works now in XE10
Update 4
In 2017 i did some work on a assembler version for 8x8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8x8 block iterations into pseudo larger ones like 64x64. Now it is whole lines again and that is wasteful.
Yes, there are faster ways to do this.
Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.
You can improve this rotation algorithm if you improve the locality of your memory accesses.
A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.
E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):
// this is the outer-loop that breaks your image rotation
// into chunks of 8x8 pixels each:
for (int block_x = 0; block_x < 2048; block_x+=8)
{
for (int block_y = 0; blocky_y < 2048; block_y+=8)
{
// this is the inner-loop that processes a block
// of 8x8 pixels.
for (int x= 0; x<8; x++)
for (int y=0; y<8; y++)
dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
}
}
There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.
Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.
EDIT:
Just checked:
With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.
Seems like it's a good idea to experiment with different block-sizes.
Here is the (very simple) test-program I've used:
#include <stdio.h>
#include <windows.h>
char temp1[2048*2048];
char temp2[2048*2048];
void rotate1 (void)
{
int x,y;
for (y=0; y<2048; y++)
for (x=0; x<2048; x++)
temp2[2048*y+x] = temp1[2048*x+y];
}
void rotate2 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=8)
for (bx=0; bx<2048; bx+=8)
for (y=0; y<8; y++)
for (x=0; x<8; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
void rotate3 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=16)
for (bx=0; bx<2048; bx+=16)
for (y=0; y<16; y++)
for (x=0; x<16; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
int main (int argc, char **args)
{
int i, t1;
t1 = GetTickCount();
for (i=0; i<20; i++) rotate1();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate2();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate3();
printf ("%d\n", GetTickCount()-t1);
}
If you can use C++ then you may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You might be able to improve it by copying in cache-aligned blocks rather than by rows, as at the moment the stride of either src dest will be a miss ( depending whether delphi is row major or column major ).
If the image isn't square, you can't do in-place. Even if you work in square images, the transform isn't conducive to in-place work.
If you want to try to do things a little faster, you can try to take advantage of the row strides to make it work, but I think the best you would do is to read 4 bytes at a time in a long from the source and then write it into four consecutive rows in the dest. That should cut some of your overhead, but I wouldn't expect more than a 5% improvement.