Scatter/Gather in Xeon Phi - vectorization

I was referring to Intel's manual on the Xeon Phi instruction set and wasn't able to understand how the scatter/gather instructions work.
Suppose if I have the following vector of doubles:
A-> |b4|a4|b3|a3|b2|a2|b1|a1|
Is it possible to create 4 vectors as follows:
V1->|b1|a1|b1|a1|b1|a1|b1|a1|
V2->|b2|a2|b2|a2|b2|a2|b2|a2|
V3->|b3|a3|b3|a3|b3|a3|b3|a3|
V4->|b4|a4|b4|a4|b4|a4|b4|a4|
using these instructions? Is there any other way to achieve this?

Got this from the Intel Forums (answered by Evgueni Petrov):
__m512d V1 = (__m512d)_mm512_extload_epi32(&Addr, _MM_UPCONV_EPI32_NONE, _MM_BROADCAST_4X16, _MM_HINT_NONE);
where 'Addr' is the address of the location in memory, from which we loaded the doubles into vector 'A'.
We can do a similar operation for V2,V3,V4, by using &(Addr+2), &(Addr+4) and &(Addr+6) respectively.

Related

Efficient way of copying between std::complex vector and Intel IPP complex array

I'm using Intel IPP for signal processing.The top-level function are using std::vectorstd::complex > data types whereas the Intel IPP equivalent is Ipp32fc[]. The Ipp32fc data type is defined as
typedef struct {
Ipp32f re;
Ipp32f im;
} Ipp32fc;
From what I know, the Ipp32f data type is simply a C/C++ float. So far, I have been using for loop for copying, and it squeezes the processor a lot, considering the symbol rate I'm processing. I have tried to use standard memcpy without much luck.
All suggestions are welcomed.
There is a function named ippsCopy_32f which copies the content from one vector to other vector. Maybe you can try using the function for copying and see if it helps. Please refer to the below link which helps you to get more details regarding the respective functions which are present under the Vector Initialization Functions section in the IPP developer reference guide.
https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/vector-initialization-functions/vector-initialization-functions-1/copy.html

How to access full 128 bits in NEON instructions?

I recently wrote a program that does some floating point calculations in Arm64 Assembly.
Since the numbers I'm dealing with can become really tiny, I now want to optimise the code so that it uses as much precision as possible.
I found out the NEON engine has 128-bit floating point registers instead of the 64 bits I'm currently working with, so I searched a way to use these for calculations. Every website I looked at tells me this should be possible, but when I try to do something like
fmul v0, v1, v2
I just get "error: invalid operand for instruction".
I'm using the M1 chip that should be capable of working with NEON instructions, and when I change it to
fmul v0.2d, v1.2d, v2.2d
there's no problem at all.
Does anyone have an idea what I'm doing wrong? Or is it just impossible to use all the 128 bits of these registers at once?
You can't.
True, the NEON registers are 128bit wide, but the maximum data type width is 64.
No consumer architecture known to me is capable of handling any 128bit data type.
PS : Is there a quad data type to begin with? I'm curious.

Does Z3 have support for optimization problems

I saw in a previous post from last August that Z3 did not support optimizations.
However it also stated that the developers are planning to add such support.
I could not find anything in the source to suggest this has happened.
Can anyone tell me if my assumption that there is no support is correct or was it added but I somehow missed it?
Thanks,
Omer
If your optimization has an integer valued objective function, one approach that works reasonably well is to run a binary search for the optimal value. Suppose you're solving the set of constraints C(x,y,z), maximizing the objective function f(x,y,z).
Find an arbitrary solution (x0, y0, z0) to C(x,y,z).
Compute f0 = f(x0, y0, z0). This will be your first lower bound.
As long as you don't know any upper-bound on the objective value, try to solve the constraints C(x,y,z) ∧ f(x,y,z) > 2 * L, where L is your best lower bound (initially, f0, then whatever you found that was better).
Once you have both an upper and a lower bound, apply binary search: solve C(x,y,z) ∧ 2 * f(x,y,z) > (U - L). If the formula is satisfiable, you can compute a new lower bound using the model. If it is unsatisfiable, (U - L) / 2 is a new upper-bound.
Step 3. will not terminate if your problem does not admit a maximum, so you may want to bound it if you are not sure it does.
You should of course use push and pop to solve the succession of problems incrementally. You'll additionally need the ability to extract models for intermediate steps and to evaluate f on them.
We have used this approach in our work on Kaplan with reasonable success.
Z3 currently does not support optimization. This is on the TODO list, but it has not been implemented yet. The following slide decks describe the approach that will be used in Z3:
Exact nonlinear optimization on demand
Computation in Real Closed Infinitesimal and Transcendental Extensions of the Rationals
The library for computing with infinitesimals has already been implemented, and is available in the unstable (work-in-progress) branch, and online at rise4fun.

Difference in output of Intel IPP fir and convolution functions

I am working with Intel IPP 7.1 (composer XE 2013) and noticed difference in tail end of output samples between IPP 'fir' and 'convolution' calls.
So in the calls below
status = ippsFIR_Direct_64f(pSrc, pDst_f, N+M-1, pTaps,M, pDlyLine,&pDlyLineIndex);
status = ippsConv_64f(pSrc, N, pTaps, M, pDst);
with M=7, N=11 and pDlyLine initialized to all zeros, everything else being same:
pDst_f and p_Dst differ in last three indices i.e pDst_f[k]!=pDst[k] for k=14,15,16
I expected them to be exactly equal with third parameter (number of iterations)=N+M-1 in the fir call. Any ideas?
Looks like there is indeed a problem with Intel IPP fir function, please look at this thread on Intel Developer site.
http://software.intel.com/en-us/forums/topic/331143

What's the deal with 17- and 40-bit math in TI DSPs?

The TMS320C55x has a 17-bit MAC unit and a 40-bit accumulator. Why the non-power-of-2-width units?
The 40-bit accumulator is common in a few TI DSPs. The idea is basically that you can accumulate up to 256 arbitrary 32-bit products without overflow. (vs. in C where if you take a 32-bit product, you can overflow fairly quickly unless you resort to using 64-bit integers.)
The only way you access these features is by assembly code or special compiler intrinsics. If you use regular C/C++ code, the accumulator is invisible. You can't get a pointer to it.
So there's not any real need to adhere to a power-of-2 scheme. DSP cores have been fairly optimized for power/performance tradeoffs.
I may be talking through my hat here, but I'd expect to see the 17-bit stuff used to avoid the need for a separate carry bit when adding/subtracting 16-bit samples.

Resources