How to access full 128 bits in NEON instructions? - arm64

I recently wrote a program that does some floating point calculations in Arm64 Assembly.
Since the numbers I'm dealing with can become really tiny, I now want to optimise the code so that it uses as much precision as possible.
I found out the NEON engine has 128-bit floating point registers instead of the 64 bits I'm currently working with, so I searched a way to use these for calculations. Every website I looked at tells me this should be possible, but when I try to do something like
fmul v0, v1, v2
I just get "error: invalid operand for instruction".
I'm using the M1 chip that should be capable of working with NEON instructions, and when I change it to
fmul v0.2d, v1.2d, v2.2d
there's no problem at all.
Does anyone have an idea what I'm doing wrong? Or is it just impossible to use all the 128 bits of these registers at once?

You can't.
True, the NEON registers are 128bit wide, but the maximum data type width is 64.
No consumer architecture known to me is capable of handling any 128bit data type.
PS : Is there a quad data type to begin with? I'm curious.

Related

OpenCL code behavior is different for AMD vs NVIDIA cards

I have a constant at the top of my code...
__constant uint uintmaxx = (uint)( (((ulong)1)<<32) - 1 );
It compiles fine on AMD and NVIDIA OpenCL compilers... then executes.
(correct) on ATI cards, returns... 4294967295 or (all 32 bits = 1)
(wrong) on NVIDIA cards, returns... 2147483648 or (only 32'nd bit = 1)
I also tried -1 + 1<<32 and it worked on ATI but not NVIDIA.
What gives? Am I just missing something?
While I'm on the topic of OpenCL compiler differences, does anyone know a good resource that lists the compiler differences between AMD and NVIDIA?
OpenCL conveniently provides that for you already. You can use the predefined UINT_MAX in your kernel code and the implementation will guarantee that it holds the correct value.
However there is also nothing wrong in the method you use. The spec guarantees that uint is 32bits and ulong 64bits, ints are twos complement and everything that is not explicitly mentioned works exactly as is written in C99 spec.
Even just this should work and give you the correct result:
uint uintmaxx = -1;
It seems that NVidia just has a broken compiler, if not I really hope I'll be corrected on the issue. The really odd part there is that how on earth the 32nd bit is 1? Shift to left by 32 moves the original bit to the 33rd place. So what on earth places a bit in the 32nd spot? The only thing I got in my mind is that they don't respect operator ordering at all and transform the formula into (ulong)1 << (32-1) or something like that.
You probably should file a bug report. But to be frank considering that they hate OpenCL as much as Microsoft hates OpenGL, if not even more, I wouldn't really anticipate fast response times.
I fully agree with #sharpneli answer. But just try this:
__constant uint uintmaxx = -1;
And like sharpneli said, use the UINT_MAX macro, it is the safer way.

Fast way to swap endianness using opencl

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.
Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.
Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?
Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.
I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.
I thought about a solution like this one (neither implemented nor tested, yet):
uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow
Hoping there's a better solution out there.
Thanks in advance,
runtimeterror
Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.
output[tid] = input[tid].wzyx;
swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.
Hope this helps!
Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.
The most efficient way is as follows
#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)
n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

How will 64 bit variable be referenced in a 32 bit process?

I have a 64 bit kernel and i run 32 bit processes in userland.In the user process code ,if i declare a 64 bit variable ,how will it be referred.Will it incur 2 memory reads.?
basically the scenario is:
I need to use a 64 bit mask in my user process.
Approach 1 :
-> Use a u64bits variable.
Approach
-> Use a array of 2 32 bit variables.
First off: the kernel has no bearing on the answer to this question.
Second, I assume this is x86 you're talking about. Where possible, the compiler will place 64-bit values across 2 32-bit registers. For example, if you return a uint64_t from a function, the low 32 bits will be stored in the eax register, and the high bits will be in edx.
The compiler will generally do the right thing for performance and correctness: using an array will likely just confuse it and lead to worse results.
By the way, x86-64 CPUs will normally perform reads of 2 adjacent 32-bit words at the same speed as a single 64-bit read. The advantages of 64-bit mode are that arithmetic can be done directly on 64-bit values (1 64x64 multiplication instruction vs 3-4 32x32 instructions), there is much more space available in registers (16 registers instead of 8, registers are twice as wide), and of course the larger possible virtual address space.

Inverse Mappings

Saying right now: Yes, this is homework. I'm not asking for an answer, but I would love any help into a general direction to look at this problem at. I've been working on it now for hours and have not made any real progress.
Can a function, with a well defined inverse, be implemented to map 32 bit integers to 64 bit integers. Do all functions from 32bit to 64bit integers have well defined inverses?
Of course not.
Take the identity function for example. All 32-bit values have an identity in the 64-bit value space (just use 0 in the top 32 bits, using only the bottom 32 bits for the value). However, any 64-bit value where the top 32 bits is not 0, will not have a corresponding value in the 32-bit value space.
The above is a layman's explanation, and is probably not rigorous enough as a homework solution (as intended). You'd do well to read up on the pigeonhole principle.

What's the deal with 17- and 40-bit math in TI DSPs?

The TMS320C55x has a 17-bit MAC unit and a 40-bit accumulator. Why the non-power-of-2-width units?
The 40-bit accumulator is common in a few TI DSPs. The idea is basically that you can accumulate up to 256 arbitrary 32-bit products without overflow. (vs. in C where if you take a 32-bit product, you can overflow fairly quickly unless you resort to using 64-bit integers.)
The only way you access these features is by assembly code or special compiler intrinsics. If you use regular C/C++ code, the accumulator is invisible. You can't get a pointer to it.
So there's not any real need to adhere to a power-of-2 scheme. DSP cores have been fairly optimized for power/performance tradeoffs.
I may be talking through my hat here, but I'd expect to see the 17-bit stuff used to avoid the need for a separate carry bit when adding/subtracting 16-bit samples.

Resources