AVX2: How to move a single element from vector to vector - sse

The problem is:
a1[7] = b[6];
a1[15] = b[14];
a2[7] = b[0];
a2[15] = b[8];
All three vectors are uint8x16_t
On aarch64 NEON, it would be rather trivial:
mov a1.b[7], b.b[6]
mov a1.b[15], b.b[14]
mov a2.b[7], b.b[0]
mov a2.b[15], b.b[8]
How can I do this on AVX2?
I already loaded the vectors accordingly into __m256i a, b; where b contains the same 128bit vector twice, and then:
const __m256i shuffle=_mm256_set_epi64x(0x0808080808080808, 0x0000000000000000, \
0x0e0e0e0e0e0e0e0e, 0x0606060606060606);
const __m256i mask=_mm256_set1_epi64x(0x8000000000000000);
b = _mm256_shuffle_epi8(b, shuffle);
a = _mm256_blendv_epi8(a, b, mask);
Yes, it works the way I want, but I cannot get rid of the feeling that it's anything else than optimal, sacrificing two registers for this kind of trivial operations.
Am I missing something? Are there more efficient ways dealing with this problem?
Shall I modify this to a 64-bit shift then blend? That would need the same number of registers and instructions. Any suggestions?
Pleas note that I cannot overwrite other lanes in a
Thanks in advance.


Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

Check that at least 1 element is true in each of multiple vectors of compare results - horizontal OR then AND

I'm looking for an SSE Bitwise OR between components of same vector. (Editor's note: this is potentially an X-Y problem, see below for the real comparison logic.)
I am porting some SIMD logic from SPU intrinsics. It has an instruction
Which according to the docs
spu_orx: OR word across d = spu_orx(a) The four word elements of
vector a are logically Ored. The result is returned in word element 0
of vector d. All other elements (1,2,3) of d are assigned a value of
How can I do that with SSE 2 - 4 involving minimum instruction? _mm_or_ps is what I got here.
Here is the scenario from SPU based code:
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
So it first ORs two 'greater' comparisons, then ORs its result.
Later couples of those results are ANDed to get final comparison value.
This is effectively doing (A||B||C||D||E||F||G||H) && (I||J||K||L||M||N||O||P) && ... where A..D are the 4x 32-bit elements of the fcgt(x,y) and so on.
Obviously vertical _mm_or_ps of _mm_cmp_ps results is a good way to reduce down to 1 vector, but then what? Shuffle + OR, or something else?
Regarding "but then what?"
I perform
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
On SPU it goes like this:
qword aRes = si_and(res, res1);
qword aRes1 = si_and(aRes, res2);
qword aRes2 = si_and(aRes1 , res3);
return si_to_uint(aRes2 );
several times on different inputs,then AND those all into a single result,which is finally cast to integer 0 or 1 (false/true test)
SSE4.1 PTEST bool any_nonzero = !_mm_testz_si128(v,v);
That would be a good way to horizontal OR + booleanize a vector into a 0/1 integer. It will compile to multiple instructions, and ptest same,same is 2 uops on its own. But once you have the result as a scalar integer, scalar AND is even cheaper than any vector instruction, and you can branch on the result directly because it sets integer flags.
#include <immintrin.h>
bool any_nonzero_bit(__m128i v) {
return !_mm_testz_si128(v,v);
On Godbolt with gcc9.1 -O3 -march=nehalem:
any_nonzero(long long __vector(2)):
ptest xmm0, xmm0 # 2 uops
setne al # 1 uop with false dep on old value of RAX
This is only 3 uops on Intel for a horizontal OR into a single bit in an integer register. AMD Ryzen ptest is only 1 uop so it's even better.
The only risk here is if gcc or clang creates false dependencies by not xor-zeroing eax before doing a setcc into AL. Usually gcc is pretty fanatical about spending extra uops to break false dependencies so I don't know why it doesn't here. (I did check with -march=skylake and -mtune=generic in case it was relying on Nehalem partial-register renaming for -march=nehalem. Even -march=znver1 didn't get it to xor-zero EAX before the ptest.)
It would be nice if we could avoid the _mm_or_ps and have PTEST do all the work. But even if we consider inverting the comparisons, the vertical-AND / horizontal-OR behaviour doesn't let us check something about all 8 elements of 2 vectors, or about any of those 8 elements.
e.g. Can PTEST be used to test if two registers are both zero or some other condition?
// 1 if all the vertical pairs AND to zero.
// but 0 if even one vertical AND result is non-zero
_mm_testz_si128( _mm_castps_si128(_mm_cmpngt_ps(x,y)),
I mention this only to rule it out and save you the trouble of considering this optimization idea. (#chtz suggested it in comments. Inverting the comparison is a good idea that can be useful for other ways of doing things.)
Without SSE4.1 / delaying the horizontal OR
We might be able to delay horizontal ORing / booleanizing until after combining some results from multiple vectors. This makes combining more expensive (imul or something), but saves 2 uops in the vector -> integer stage vs. PTEST.
x86 has cheap vector mask->integer bitmap with _mm_movemask_ps. Especially if you ultimately want to branch on the result, this might be a good idea. (But x86 doesn't have a || instruction that booleanizes its inputs either so you can't just & the movemask results).
One thing you can do is integer multiply movemask results: x * y is non-zero iff both inputs are non-zero. Unlike x & y which can be false for 0b0101 &0b1010for example. (Our inputs are 4-bit movemask results andunsigned` is 32-bit so we have some room before we overflow). AMD Bulldozer family has an integer multiply that isn't fully pipelined so this could be a bottleneck on old AMD CPUs. Using just 32-bit integers is also good for some low-power CPUs with slow 64-bit multiply.
This might be good if throughput is more of a bottleneck than latency, although movmskps can only run on one port.
I'm not sure if there are any cheaper integer operations that let us recover the logical-AND result later. Adding doesn't work; the result is non-zero even if only one of the inputs was non-zero. Concatenating the bits together (shift+or) is also of course like an OR if we eventually just test for any non-zero bit. We can't just bitwise AND because 2 & 1 == 0, unlike 2 && 1.
Keeping it in the vector domain
Horizontal OR of 4 elements takes multiple steps.
The obvious way is _mm_movehl_ps + OR, then another shuffle+OR. (See Fastest way to do horizontal float vector sum on x86 but replace _mm_add_ps with _mm_or_ps)
But since we don't actually need an exact bitwise-OR when our inputs are compare results, we just care if any element is non-zero. We can and should think of the vectors as integer, and look at integer instructions like 64-bit element ==. One 64-bit element covers/aliases two 32-bit elements.
__m128i cmp = _mm_castps_si128(cmpps_result); // reinterpret: zero instructions
// SSE4.1 pcmpeqq 64-bit integer elements
__m128i cmp64 = _mm_cmpeq_epi64(cmp, _mm_setzero_si128()); // -1 if both elements were zero, otherwise 0
__m128i swap = _mm_shuffle_epi32(cmp64, _MM_SHUFFLE(1,0, 3,2)); // copy and swap, no movdqa instruction needed even without AVX
__m128i bothzero = _mm_and_si128(cmp64, swap); // both halves have the full result
After this logical inversion, ORing together multiple bothzero results will give you the AND of multiple conditions you're looking for.
Alternatively, SSE4.1 _mm_minpos_epu16(cmp64) (phminposuw) will tell us in 1 uop (but 5 cycle latency) if either qword is zero. It will place either 0 or 0xFFFF in the lowest word (16 bits) of the result in this case.
If we inverted the original compares, we could use phminposuw on that (without pcmpeqq) to check if any are zero. So basically a horizontal AND across the whole vector. (Assuming that it's elements of 0 / -1). I think that's a useful result for inverted inputs. (And saves us from using _mm_xor_si128 to flip the bits).
An alternative to pcmpeqq (_mm_cmpeq_epi64) would be SSE2 psadbw against a zeroed vector to get 0 or non-zero results in the bottom of each 64-bit element. It won't be a mask, though, it's 0xFF * 8. Still, it's always that or 0 so you can still AND it. And it doesn't invert.

return floats to objective-c from arm assembly function

I've written an assembly function that runs fine on an iPhone 4 (32-bit code) as well as on an iPhone 6s (64-bit code). I pass in four floating point numbers from a calling function in objective-c.
Here is the structure I use for the 4 floating point numbers and below that is the prototype for the function - as found at the top of my objective-c code.
struct myValues{ // This is a structure. It is used to conveniently group multiple data items logically.
float A; // I am using it here because i want to return multiple float values from my ASM code
float B; // They get passed in via S0, S1, S2 etc. and they come back out that way too
float number;
float divisor; //gonna be 2.0
struct myValues my_asm(float e, float f, float g, float h); // Prototype for the ASM function
Down in my objective-c code I call my assembly function like this:
myValues = my_asm(myValues.A, myValues.B, myValues.number,myValues.divisor); // ASM function
When running against the iPhone 6S the code runs like a champ (64-bit code). The 4 floating point values are passed from the objective-c code to the assembly code via the ARM single float registers S0-S4. The results returned are also passed via S0-S4.
When running against the iPhone 4 the code runs fine as well (32-bit code). The 4 floating point values are passed from the obj-c code to the assembly code via the ARM single float registers S0, S2, S4, and S6 (not sure why skips odd registers). The code runs fine but the values that get returned to my obj-c structure are garbage.
Where/how do i pass floating point values from the ARM 32-bit code so they arrive back in the obj-c structure?
p.s. Below is my assembly code from my Xcode S file.
.ios_version_min 9, 0
.globl _my_asm
.align 2
#ifdef __arm__
.thumb_func _my_asm
.syntax unified
.code 16
_my_asm: // 32 bit code
// S0 = A, S2 = B, S4 = Number, S6 = 2.0 - parameters passed in when called by the function
vadd.f32 s0, s0, s2
vdiv.f32 s0, s0, s6
vdiv.f32 s1, s4, s0
vcvt.u32.f32 r0,s0
bx lr
_my_asm: // 64 bit code
//add W0, W0, W1
; S0 = A, S1 = B, S2 = Number, S3 = 2.0 parameters passed in when called by the function
fadd s0, s0, s1
fdiv s0, s0, s3
fdiv s1, s2, s0
Neither of your functions are correctly returning a structure. You need to understand the ARM ABI. You can start by reading Apple's iOS ABI Function Call Guide. If after studying the ABI you do not understand ask a new question showing what you've tried.

Fastest way of storing non-adjacent d registers with NEON intrinsics

I am porting 32bit NEON asm code to NEON intrinsics, and I am wondering if this code can be written in a concise way using intrinsics:
vst4.32 {d0[0], d2[0], d4[0], d6[0]}, [%[v1]]!
1) The previous code operates on q registers, but when it comes to storage, instead of using q0, q1, q2 and q3, it has to recreate vectors which have each part in one of the d registers, e.g. v1[0] = d0[0], v1[1] = d2[0] ... v2[0] = d0[1], v2[1] = d2[1] ... v3[0] = d1[0], v3[1] = d3[0] ... etc.
This operation is a one-liner in asm, but with intrinsics I don't know if I can do that without first splitting high and low bits and building a new float32x4x4_t variable to feed to vst4_f32.
Is that possible?
2) I'm not entirely sure of what [%[v1]]! does (yes, I googled quite a bit): it should be a reference to a variable named v1 and the exclamation mark will do writeback, which should mean the pointer is increased by the same amount that was written by the instruction on the same line.
Correct? Any way of replicating that with intrinsics?
After some more investigation I found this specific instruction to store a specific lane of an array of 4 vectors, so no need to split into high and low bits variables:
float32x4x4_t u = { q0, q1, q2, q3 };
vst4q_lane_f32(v1, u, 0);
v1 += 4;
Writeback is just an increased pointer, as #charlesbaylis wrote.
In principle, a sufficiently smart compiler could use the instruction you want for the vst4_f32 intrinsic, but in practice, no compiler is that good.
To get the post-index writeback, you can write
vst4_f32(ptr, v);
ptr += 4;
Some compilers will recognise this. GCC 5.1 (when released) will do this in at least some cases.
[Edit: misread the question, vst4q_lane_f32 does map to the required instruction perfectly]
It seems to be inline assembly.
Anyway, the answers are:
1) No
2) Yes

arm asm/neon optimisation for image processing

I m currently working on a painting app on ios.
I use a directly draw into a NSMutableData buffer and apply blending with my brush like this:
- (void) combineColorDestination:(unsigned char*) dest source:(unsigned char*) src
const unsigned char sra = ((unsigned char *)src)[3];
const float oneminusalpha = 1.0f - (sra / 255.f);
int d[4];
for (int i=0;i<4;i++)
d[i] = oneminusalpha * ((unsigned char *)dest)[i] + ((unsigned char *)src)[i];
if (d[i]>255)
d[i] = 255;
((unsigned char *)dest)[i] = (unsigned char)d[i];
Any suggestions for optimisations ?
I previously tried to use neon , but i ve got a bug I wasnt able to fix (the bordering pixels was buggy)
I was iterating pixels 2 by 2 like this :
uint8x8_t va = vld1_u8(dest);
uint8x8_t vb = vld1_u8(src);
uint8x8_t res = vqadd_u8(va,vb);
vst1_u8(dest, res);
Suggestions? Alright. Note that these are valid whichever multimedia manipulation you are doing and is hardly restricted to your case.
First, before you even do NEON, you should change your code to have one function that changes a bunch of pixels (at least a row, a rectangle if you can) at once, instead of a function (or method - even worse) that changes one pixel and is called a bunch of times: somehow I doubt the brush is only 1x1 pixel.
Second, except for the column loop (and eventual row loop), there should be no branch (that is, flow control structures). No for (i=0;i<4;i++); just write the code for the four channels in sequence (use a macro if necessary). No if (d[i]>255); express that as an alternative: dest[i] = (temp>255?255:temp); at the very least, if not replacing it by a more efficient way to do saturation (tricks using subtractions, shifts, and masks exist).
Third, avoid any conversion between floating-point and integer; this is always valid advice, but float->int conversions are particularly devastating on ARM. Since you're manipulating integers, this means foregoing floating-point here.
And once you've done that, surprise, besides making your code faster you have in fact done the preparation work for NEON: NEON is only remotely useful if you process a bunch of pixels at once, if there is no branch, and if you don't convert between floating-point and integer all over the place. So only then will we talk about NEON, if it is even necessary at this point.
