Effective way to extract from SSE vector on AMD processors - sse

I'm looking for an effective way to extract lower 64 bit integer from __m128i on AMD Piledriver. Something like this:
static inline int64_t extractlo_64(__m128i x)
{
int64_t result;
// extract into result
return result;
}
Instruction tables say that common approach - using _mm_extract_epi64() - is ineffective on this processor. It generates PEXTRQ instruction which has a latency of 10 cycles (compared to 2-3 cycles in Intel processors).
Is there any better way to do this?

On x86-64 you can use _mm_cvtsi128_si64, which translates to a single MOVQ r64, xmm instruction

One possibility might be to use MOVDQ2Q, which has a latency of 2 instructions on Piledriver:
static inline int64_t extractlo_64(const __m128i v)
{
return _m_to_int64(_mm_movepi64_pi64(v)); // MOVDQ2Q + MOVQ
}

Related

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023
People have also measured and found out that VZEROUPPER and VZEROALL are expensive on Knights Landing:
36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
See the above link.
So my code will be the following, if I have just used ymm0 and ymm1:
if [we are running on a Xeon Phi]
vpxor ymm0,ymm0,ymm0
vpxor ymm1,ymm1,ymm1
else
vzeroall
endif
How can I detect Xeon Phi (Knights Landing and later Xeon Phi processors) to implement the above code?
We now have the following situation now about the VZEROUPPER/VZEROALL:
These instructions are not needed and are very costly on Xeon Phi Knight Landing 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
These instructions are very cheap and are needed on Xeon and Core processors (Skylake/Kaby Lake) and will be needed for Xeon in the foreseeble future, to avoid costly transition to non-VEX state.
The advertising materials claim that Xeon Phi (Knights Landing) is fully compatible with other Xeon processors.
Is there a reliable way to detect Xeon Phi, for the purpose of avoiding VZEROUPPER/VZEROALL?
There is an article "How to detect Knights Landing AVX-512 support (Intel® Xeon Phi™ processor)" by James R., Updated February 22, 2016, but it only focuses specific new instructions that became available on the Knights Landing. So it is still not very clear about the VEX transitions.
It would have been good to know whether Intel plans to implement a CPUID bit to show whether non-VEX state are costly? For example:
Bit is set to 0 - VEX state transitions are costly, but VZEROUPPER/VZEROALL are cheap and should be used to clear the state;
Bit is set to 1 – there is no transition penalty, VZEROUPPER/VZEROALL is not needed.
The above mentioned article about detecting Knights Landing suggests to check the bits AVX-512F+CD+ER+PF as introduced in Knights Landing.
So the code suggests to check all these bits at once, and if all are set, then we are on the Knights Landing:
uint32_t avx2_bmi12_mask = (1 << 16) | // AVX-512F
(1 << 26) | // AVX-512PF
(1 << 27) | // AVX-512ER
(1 << 28); // AVX-512CD
It would have been good to know whether Intel plans to add these all bits to a simple Xeon (non Phi) or Core processors in the near future, so they will also support the AVX-512F+CD+ER+PF features introduced in the Knight Landding?
In case that Xeon and Core processor will support AVX-512F+CD+ER+PF, we won’t be able to distinguish Xeon from Xeon Phi.
Please advise.
If you specifically want to check for being on a KNL (rather than the more general "Does the CPU I am running on have feature X?") you can do that by looking at the "Extended Family", "Family" and "Model" fields in %eax after calling cpuid with %eax==1 and %ecx == 0. C++ code something like that below will do the job.
However, as others are implicitly pointing out, this is a very specific test, and will, for instance, fail on future Knights cores, so you would likely be better doing as has been suggested and checking for AVX-512 features that are not in Xeon, so AVX512-ER and AVX512-PF. (Of course, such instructions could appear in future Xeons, so this is not guaranteed in the long term, but, quoting Keynes: "In the long term we're all dead" :-))
class cpuidState
{
uint32_t orig_eax; /* Values sent in to the cpuid instruction */
uint32_t orig_ecx;
uint32_t eax; /* Values received back from it. */
uint32_t ebx;
uint32_t ecx;
uint32_t edx;
void cpuid()
{
__asm__ __volatile__("cpuid"
: "+a" (eax), "=b" (ebx), "+c" (ecx), "=d" (edx));
}
void update (uint32_t eaxVal, uint32_t ecxVal)
{
orig_eax = eaxVal;
orig_ecx = ecxVal;
eax = eaxVal;
ecx = ecxVal;
cpuid();
}
void ensureCorrectLeaf(uint32_t eaxVal, uint32_t ecxVal)
{
if (orig_eax != eaxVal || orig_ecx != ecxVal)
update (eaxVal, ecxVal);
}
public:
cpuidState() : orig_eax (-1), orig_ecx(-1) { }
// Include the Extended Model in the test. Without it we see some Xeons as KNL :-(
bool onKNL() { ensureCorrectLeaf(1,0); return (eax & 0x0f0ff0) == 0x50670; }
};

How to declare local memory in OpenCL?

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

c stream buffer

I am using C and need a stream buffer mechanism that I can write arbitrary bytes two and read bytes from. I would prefer something that is platform independent (or that can at least run on osx and linux). Is anyone aware of any permissive lightweight libraries or code than I can drop in?
I've used buffers within libevent and I may end up going that route, but it seems overkill to have libevent as a dependency when I don't do any sort of event based io.
If you don't mind depending on C++ and possibly some bits of STL, you can use std::stringstream. It shouldn't be too difficult to write a thin C wrapper around it.
Is setbuf(3) (and its aliases) the 'mechanism' you are searching for?
Please consider the following example:
#include <stdio.h>
int main()
{
char buf[256];
setbuffer(stderr, buf, 256);
fprintf(stderr, "Error: no more oxygen.\n");
buf[1] = 'R';
buf[2] = 'R';
buf[3] = 'O';
buf[4] = 'R';
fflush(stderr);
}

What's the best way to load 2 unaligned 64-bit values into an sse register with SSSE3?

There are 2 pointers to 2 unaligned 8 byte chunks to be loaded into an xmm register. If possible, using intrinsics. And if possible, without using an auxiliary register. Without pinsrd. (SSSE Core 2)
From the msvc specs, it looks like you can do the following:
__m128d xx; // an uninitialised xmm register
xx = _mm_loadh_pd(xx, ptra); // load the higher 64 bits from (unaligned) ptra
xx = _mm_loadl_pd(xx, ptrb); // load the lower 64 bits from (unaligned) ptrb
Loading from unaligned storage (in my experience) is very much slower than loading from aligned pointers, so you properly wouldn't want to be doing this type of operation too often - if you really want higher performance.
Hope this helps.
Unaligned access is so much slower than aligned access (at least pre-Nehalem );
you may get better speed by loading the aligned 128 bit words that contain the desired unaligned 64 bit words, then shuffle them to make the result you want.
Assumes:
you have memory read access to the full 128 word
the 64 bit words are aligned on at least 32 bit boundaries
e.g. (not tested)
int aoff = ptra & 15;
int boff = ptrb & 15;
__m128 va = _mm_load_ps( (char*)ptra - aoff );
__m128 vb = _mm_load_ps( (char*)ptrb - boff );
switch ( (aoff<<4) | boff )
{
case 0: _mm_shuffle_ps(va,vb, ...
The number of cases depends on whether you can assume 64 bit alignment

Timeout in CUDA? / fermi / gtx465

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
and this is function which calls kernel function:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.
The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.
When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...
Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?
It will be better to call
CUT_CHECK_ERROR after
cudaThreadSynchronize(). Because
kernel run asynchronous and you must
wait for kernel ending to know about
errors... Maybe in second iteration you receive an error
from first kernel usage.
Be sure
that you have some valid number in the most interesting variable
pointsNumber[0] (it might cause a
long internal loop).
You could also
improve speed of your kernel
function:
Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
You don't use shared mem, so __syncthreads() is not required.
PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.
PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function.
You could also use bool firstTime - it will be MUCH better then float.
Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page
In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.

Resources