Summing 3 lanes in a NEON float32x4_t - ios

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector.
I can drop back to C floats at this point and vst1q_f32 to get the four values out and add up the three I need. But I figure it may be more effective if there's a way to do it directly with the vector in an instruction or two, and then just grab a single lane result, but I couldn't figure out any clear path to doing this.
I'm new to NEON programming, and the existing "documentation" is pretty horrific. Any ideas? Thanks!

You should be able to use VFP unit for such task. NEON and VFP shares the same register bank, meaning you don't need to shuffle around registers to get advantage of one unit and they can also have different views of the same register bits.
Your float32x4_t is 128 bit so it must sit on a Quad (Q) register. If you are solely using arm intrinsic you wouldn't know which one you are using. Problem there is if it is sitting above 4, VFP can't see it as a single precision (for the curious reader: I kept this simple since there are differences between VFP versions and this is the bare minimum requirement.). So it would be best to move your float32x4_t to a fixed register like Q0. After this you can just sum registers like S0, S1, S2 with vadd.f32 and move the result back to an ARM register.
Some warnings... VFP and NEON are theoretically different execution units sharing same register bank and pipeline. I am not sure if this approach is any better than others, I don't need to say but again, you should do benchmark. Also this approach isn't streamlined with neon intrinsic so you probably would need to craft your code with inline assembly.
I did a simple snippet to see how this can look like and I've come up with this:
#include "arm_neon.h"
float32_t sum3() {
register float32x4_t v asm ("q0");
float32_t ret;
asm volatile(
"vadd.f32 s0, s1\n"
"vadd.f32 s0, s2\n"
"vmov %[ret], s0\n"
: [ret] "=r" (ret)
:
:);
return ret;
}
objdump of it looks like (compiled with gcc -O3 -mfpu=neon -mfloat-abi=softfp)
00000000 <sum3>:
0: ee30 0a20 vadd.f32 s0, s0, s1
4: ee30 0a01 vadd.f32 s0, s0, s2
8: ee10 3a10 vmov r0, s0
c: 4770 bx lr
e: bf00 nop
I really would like to hear your impressions if you give this a go!

Can you zero-out the fourth element? Perhaps just by copying it and using vset_lane_f32?
If so, you can use the answers from Sum all elements in a quadword vector in ARM assembly with NEON like:
float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input));
return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements
Though this actually does a bit more work than you need, so it might be faster to just extract the three floats with vget_lane_f32 and add them.

It sounds like you want to use (some version of) VLD1 to load zero into your extra lane (unless you can arrange for it to be zero already), followed by two VPADDL instructions to pairwise-sum four lanes into two and then two lanes into one.

Related

What instruction set would be easiest to implement on a homemade ALU?

I'm designing a basic 8 or 16 bit computer (haven't really decided yet) using eeprom chips, sram, and an ALU made (mostly) out of individual transistors on a PCB using cmos logic that I already have partially designed and tested. And I thought it would be cool to use an already existing instruction set so I can compile C++ code for it instead of writing everything in machine code.
I looked at the AVR gcc compiler on Compiler Explorer and the machine code it produces, it looks very simple and I think it is only 8-bits. Or should I go for 32-bits and try to use x86? That would make the ALU a lot bigger. Are there compilers that let you use limited instructions so I don't have to make every single one? Or would it even be easier to just write an interpreter for a custom instruction set? Any advice is welcome, thank you.
After a bit of research it has become apparent that trying to recreate modern ALUs and instructions would be very complicated and time consuming, and I should definitely make my own simplistic architecture and if I really want to compile C code for it I could probably just interpret x86 or AVR assembly from gcc.
I would also love some feedback on my design, I came up with a really weird ISA last night that is focused mainly on being easy to engineer the hardware.
There are two registers in the ALU, all other registers perform functions based off those two numbers all at the same time. For instance, there is a register that holds the added result of A and B, one that holds the result of A shifted right B times, a "jump if A > B" branch, and so on.
And so to add a number, it would take 3 clock cycles, you would move two values from ram into A and B, then copy the data back to ram afterwards. It would look like this:
setA addressInRam1 (6-bit opcode, 18-bit address/value)
setB addressInRam2
copyAddedResult addressInRam1
And program code is executed directly from EEPROM memory. I don't know if I should think of it as having two general purpose registers or it having 2^18 registers. Either way, it makes it much easier and simpler to build when you're executing instructions one at a time like that. Again any advice is welcome, I am somewhat of a noob in this field, thank you!
Oh and then an additional C register to hold a value to be stored in RAM the next clock cycle specified in the set register. This is what the Fibonacci sequence would look like:
1: setC 1; // setting C reg to 1
2: set 0; // setting address 0 in ram to the C register
3: setA 0; // copying value in address 0 of ram into A reg
// repeat for B reg
4: set 1; // setting this to the same as the other
5: setB 1;
6: jumpIf> 9; // jump to line 9 if A > B
7: getSum 0; // put sum of A and B into address 0 of ram
8: setA 0; // set the A register to address 0 of ram
9: getSum 1; // "else" put the sum into the second variable
10: setB 1;
11: jump 6; // loop back to line 6 forever
I made a C++ equivalent and put it through compiler explorer and despite the many drawbacks of this architecture it uses the same amount of clock cycles as x64 in the loop and two more in total. But I think this function in particular works pretty well with it as I don't have to reassign A and B often.

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?
It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

SIMD zero vector test

Does there exist a quick way to check whether a SIMD vector is a zero vector (all components equal +-zero). I am currently using an algorithm, using shifts, that runs in log2(N) time, where N is the dimension of the vector. Does there exist anything faster? Note that my question is broader (tags), than the proposed answer and it refers to vectors of all types (integer, float, double, ...).
How about this straightforward avx code? I think it's O(N) and don't know how you could possibly do better without making assumptions about the input data - you have to actually read every value to know if its 0 so it's about doing as much of that as possible per cycle.
You should be able to massage the code to your needs. Should treat both +0 and -0 as zero. Will work for unaligned memory addresses but aligning to 32 byte addresses will make the loads faster. You may need to add something to deal with remaining bytes if size isn't a multiple of 8.
uint64_t num_non_zero_floats(float *mem_address, int size) {
uint64_t num_non_zero = 0;
__m256 zeros _mm256_setzero_ps ();
for(i = 0; i != size; i+=8) {
__m256 vec _mm256_loadu_ps (mem_addr + i);
__m256 comparison_out _mm256_cmp_ps (zeros, vec, _CMP_EQ_OQ); //3 cycles latency, throughput 1
uint64_t bits_non_zero = _mm256_movemask_ps(comparison_out); //2-3 cycles latency
num_non_zero += __builtin_popcountll(bits_non_zero);
}
return num_non_zero;
}
If you want to test floats for +/- 0.0, then you can check for all the bits being zero, except the sign bit. Any set-bits anywhere except the sign bit mean the float is non-zero. (http://www.h-schmidt.net/FloatConverter/IEEE754.html)
Agner Fog's asm optimization guide points out that you can test a float or double for zero using integer instructions:
; Example 17.4b
mov eax, [rsi]
add eax, eax ; shift out the sign bit
jz IsZero
For vectors, though, using ptest with a sign-bit mask is better than using paddd to get rid of the sign bit. Actually, test [rsi], $0x7fffffff may be more efficient than Agner Fog's load/add sequence, but a 32bit immediate probably stops the load from micro-fusing on Intel, and maybe have a larger code-size.
x86 PTEST (SSE4.1) does a bitwise AND and sets flags based on the result.
movdqa xmm0, [mask]
.loop:
ptest xmm0, [rsi+rcx]
jnz nonzero
add rcx, 16 # count up towards zero
jl .loop # with rsi pointing to past the end of the array
...
nonzero:
Or cmov could be useful to consume the flags set by ptest.
IDK if it'd be possible to use a loop-counter instruction that didn't set the zero flag, so you could do both tests with one jump instruction or something. Probably not. And the extra uop to merge the flags (or the partial-flags stall on earlier CPUs) would cancel out the benefit.
#Iwillnotexist Idonotexist: re one of your comments on the OP: you can't just movemask without doing a pcmpeq first, or a cmpps. The non-zero bit might not be in the high bit! You probably knew that, but one of your comments seemed to leave it out.
I do like the idea of ORing together multiple values before actually testing. You're right that sign-bits would OR with other sign-bits, and then you ignore them the same way you would if you were testing one at a time. A loop that PORs 4 or 8 vectors before each PTEST would probably be faster. (PTEST is 2 uops, and can't macro-fuse with a jcc.)

Implement SWAP in Forth

I saw that in an interview with Chuck Moore, he says:
The words that manipulate that stack are DUP, DROP and OVER period.
There's no, well SWAP is very convenient and you want it, but it isn't
a machine instruction.
So I tried to implement SWAP in terms of only DUP, DROP and OVER, but couldn't figure out how to do it, without increasing the stack at least.
How is that done, really?
You are right, it seems hard or impossible with just dup, drop, and over.
I would guess the i21 probably also has some kind return stack manipulation, so this would work:
: swap over 2>r drop 2r> ;
Edit: On the GA144, which also doesn't have a native swap, it's implemented as:
over push over or or pop
Push and pop refer to the return stack, or is actually xor. See http://www.colorforth.com/inst.htm
In Standard Forth it is
: swap ( a b -- b a ) >r >r 2r> ;
or
: swap ( a b -- b a ) 0 rot nip ;
or
: swap ( a b -- b a ) 0 rot + ;
or
: swap ( a b -- b a ) 0 rot or ;
This remark of Charles Moore can easily be misunderstood, because it is in the context of his Forth processors. A SWAP is not a machine instruction for a hardware Forth processor. In general in Forth some definitions are in terms of other definitions, but this ends with certain so called primitives. In a Forth processor those are implemented in hardware, but in all Forth implementations on e.g. host systems or single board computers they are implemented by a sequence of machine instruction, e.g. for Intel :
CODE SWAP pop, ax pop, bx push, ax push, bx END-CODE
He also uses the term "convenient", because SWAP is often avoidable. It is a situation where you need to handle two data items, but they are not in the order you want them. SWAP means a mental burden because you must imagine the stack content changed. One can often keep the stack straight by using an auxiliary stack to temporarily hold an item you don't need right now. Or if you need an item twice OVER is preferable. Or a word can be defined differently, with its parameters in a different order.
Going out of your way to implement SWAP in terms of 4 FORTH words instead of 4 machine instructions is clearly counterproductive, because those FORTH words each must have been implemented by a couple of machine instructions themselves.

Does Z3 have support for optimization problems

I saw in a previous post from last August that Z3 did not support optimizations.
However it also stated that the developers are planning to add such support.
I could not find anything in the source to suggest this has happened.
Can anyone tell me if my assumption that there is no support is correct or was it added but I somehow missed it?
Thanks,
Omer
If your optimization has an integer valued objective function, one approach that works reasonably well is to run a binary search for the optimal value. Suppose you're solving the set of constraints C(x,y,z), maximizing the objective function f(x,y,z).
Find an arbitrary solution (x0, y0, z0) to C(x,y,z).
Compute f0 = f(x0, y0, z0). This will be your first lower bound.
As long as you don't know any upper-bound on the objective value, try to solve the constraints C(x,y,z) ∧ f(x,y,z) > 2 * L, where L is your best lower bound (initially, f0, then whatever you found that was better).
Once you have both an upper and a lower bound, apply binary search: solve C(x,y,z) ∧ 2 * f(x,y,z) > (U - L). If the formula is satisfiable, you can compute a new lower bound using the model. If it is unsatisfiable, (U - L) / 2 is a new upper-bound.
Step 3. will not terminate if your problem does not admit a maximum, so you may want to bound it if you are not sure it does.
You should of course use push and pop to solve the succession of problems incrementally. You'll additionally need the ability to extract models for intermediate steps and to evaluate f on them.
We have used this approach in our work on Kaplan with reasonable success.
Z3 currently does not support optimization. This is on the TODO list, but it has not been implemented yet. The following slide decks describe the approach that will be used in Z3:
Exact nonlinear optimization on demand
Computation in Real Closed Infinitesimal and Transcendental Extensions of the Rationals
The library for computing with infinitesimals has already been implemented, and is available in the unstable (work-in-progress) branch, and online at rise4fun.

Resources