How to sum all 32-bit or 64-bit sub-registers in an SSE XMM, or AVX YMM, and ZMM register?

How to sum all 32-bit or 64-bit sub-registers in an SSE XMM, or AVX YMM, and ZMM register? - sse

Say your task results in a subtotal in each floating-point subregister. I'm not seeing an instruction that would sum the subtotals down to one floating-point total. Do I need to store the MM register in plain old memory then do the sum with simple instructions?
(It's unresolved whether these will be double or single-precision, and I plan on coding for every CPU variation up to the forthcoming (?) 512-bit AVX version if I can find the opcodes.)

wget http://www.agner.org/optimize/vectorclass.zip
unzip vectorclass.zip -d vectorclass
cd vectorclass/
This code is GPLv3.
SSE
grep -A11 horizontal_add vectorf128.h
static inline float horizontal_add (Vec4f const & a) {
#if INSTRSET >= 3 // SSE3
__m128 t1 = _mm_hadd_ps(a,a);
__m128 t2 = _mm_hadd_ps(t1,t1);
return _mm_cvtss_f32(t2);
#else
__m128 t1 = _mm_movehl_ps(a,a);
__m128 t2 = _mm_add_ps(a,t1);
__m128 t3 = _mm_shuffle_ps(t2,t2,1);
__m128 t4 = _mm_add_ss(t2,t3);
return _mm_cvtss_f32(t4);
#endif
--
static inline double horizontal_add (Vec2d const & a) {
#if INSTRSET >= 3 // SSE3
__m128d t1 = _mm_hadd_pd(a,a);
return _mm_cvtsd_f64(t1);
#else
__m128 t0 = _mm_castpd_ps(a);
__m128d t1 = _mm_castps_pd(_mm_movehl_ps(t0,t0));
__m128d t2 = _mm_add_sd(a,t1);
return _mm_cvtsd_f64(t2);
#endif
}
AVX
grep -A6 horizontal_add vectorf256.h
static inline float horizontal_add (Vec8f const & a) {
__m256 t1 = _mm256_hadd_ps(a,a);
__m256 t2 = _mm256_hadd_ps(t1,t1);
__m128 t3 = _mm256_extractf128_ps(t2,1);
__m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
return _mm_cvtss_f32(t4);
}
--
static inline double horizontal_add (Vec4d const & a) {
__m256d t1 = _mm256_hadd_pd(a,a);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
return _mm_cvtsd_f64(t3);
}
AVX512
grep -A3 horizontal_add vectorf512.h
static inline float horizontal_add (Vec16f const & a) {
#if defined(__INTEL_COMPILER)
return _mm512_reduce_add_ps(a);
#else
return horizontal_add(a.get_low() + a.get_high());
#endif
}
--
static inline double horizontal_add (Vec8d const & a) {
#if defined(__INTEL_COMPILER)
return _mm512_reduce_add_pd(a);
#else
return horizontal_add(a.get_low() + a.get_high());
#endif
}
get_high() and get_low()
Vec8f get_high() const {
return _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));
}
Vec8f get_low() const {
return _mm512_castps512_ps256(zmm);
}
Vec4d get_low() const {
return _mm512_castpd512_pd256(zmm);
}
Vec4d get_high() const {
return _mm512_extractf64x4_pd(zmm,1);
}
For integers look for horizontal_add in vectori128.h, vectori256.h, and vectori512.h.
You can also use the Vector Class Library (VCL) directly
#include <stdio.h>
#define MAX_VECTOR_SIZE 512
#include "vectorclass.h"
int main(void) {
float x[16]; for(int i=0;i<16;i++) x[i]=i+1;
Vec4f v4 = Vec4f().load(x);
Vec8f v8 = Vec8f().load(x);
Vec16f v16 = Vec16f().load(x);
printf("%f %d\n", horizontal_add(v4), 4*5/2);
printf("%f %d\n", horizontal_add(v8), 8*9/2);
printf("%f %d\n", horizontal_add(v16), 16*17/2);
}
Compile like this (GCC only my KNL is too old for AVX512)
SSE2: g++ -O3 test.cpp
AVX: g++ -O3 -mavx test.cpp
AVX512ER: icpc -O3 -xMIC-AVX512 test.cpp
output
10.000000 10
36.000000 36
136.000000 136
One nice thing with the VCL library is that if you use e.g. Vec8f with a system that only has SSE2 it will emulate AVX using SSE twice.
See the section "Instruction sets and CPU dispatching" in the vectorclass.pdf manual for how to compile for different instruction sets with MSVC, ICC, Clang, and GCC.

I have implemented the following inline function for AVX2. It sums all elements and returns the result. You can look this as a suggestion answer to develop your own function for this purpose.
Note: _mm256_extract_epi32 is not presented for AVX you can use your own method with vmovss such as float _mm256_cvtss_f32 (__m256 a) instead and develop your horizontal addition functions.
// my horizontal addition of epi32
inline int _mm256_hadd2_epi32(__m256i a)
{
__m256i a_hi;
a_hi = _mm256_permute2x128_si256(a, a, 1); //maybe it should be 4
a = _mm256_hadd_epi32(a, a_hi);
a = _mm256_hadd_epi32(a, a);
a = _mm256_hadd_epi32(a, a);
return _mm256_extract_epi32(a,0);
}

Related

How can I generate SVE vectors with LLVM

clang version 11.0.0
example.c:
#define ARRAYSIZE 1024
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
{
for (int i = 0; i < ARRAYSIZE; i++)
{
a[i] = b[i] - c[i];
}
}
int main()
{
subtract_arrays(a, b, c);
}
command:
clang --target=aarch64-linux-gnu -march=armv8-a+sve -O3 -S example.c
LLVM always generate NEON vectors, but I want it generate SVE vectors.
How can I do this?

Unfortunately Clang version 11 does not support SVE auto-vectorization.
This will come with LLVM 13: Architecture support in LLVM
You can however generate SVE code with intrinsic functions or inline assembly.
Your code with intrinsic functions would look something along the lines of:
#include <arm_sve.h>
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c) {
int i = 0;
svbool_t pg = svwhilelt_b32(i, ARRAYSIZE);
do
{
svint32_t db_vec = svld1(pg, &b[i]);
svint32_t dc_vec = svld1(pg, &c[i]);
svint32_t da_vec = svsub_z(pg, db_vec, dc_vec);
svst1(pg, &a[i], da_vec);
i += svcntw();
pg = svwhilelt_b32(i, ARRAYSIZE);
}
while (svptest_any(svptrue_b32(), pg));
}
I had a similar problem thinking that SVE auto-vectorization is supported.
When targeting SVE with Clang, optimization reports show successful vectorization despite only vectorizing for Neon.

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate]

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
{
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.

Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
ret
.LC0:
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.LC1:
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.LC2:
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805

_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.

This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
{
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
{
printf("%02X ", buf[i]);
}
printf("\n");
}
int main(void)
{
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
print128(&a);
print128(&glob_a);
print128(&b);
print128(&glob_b);
print128(&m87);
print128(&glob_m87);
return 0;
}
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
#else
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
#endif
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

Data Clauses (output is zero when i use OpenACC)

I want to reduce runtime of my code by use the OpenACC but unfortunately when i use OpenACC the output becomes zero.
sajad.**
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <openacc.h>
#include<time.h>
#include <string.h>
#include <malloc.h>
#define NX 201
#define NY 101
#define NZ 201
int main(void)
{
int i, j, k, l, m;
static double tr, w;
static double dt = 9.5e-9, t;
static double cu[NZ];
static double AA[NX][NY][NZ] , CC[NX][NY][NZ] , BB[NX][NY][NZ] ;
static double A[NX][NY][NZ] , B[NX][NY][NZ] , C[NX][NY][NZ] ;
FILE *file;
file = fopen("BB-and-A.csv", "w");
t = 0.;
#pragma acc data copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu )
{
for (l = 1; l < 65; l++) {
#pragma acc kernels loop private(i, j,k)
for (i = 1; i < NX - 1; i++) {
for (j = 0; j < NY - 1; j++) {
for (k = 1; k < NZ - 1; k++) {
A[i][j][k] = A[i][j][k]
+ 1. * (B[i][j][k] - AA[i][j][k - 1]);
}
}
}
#pragma acc kernels loop private(i, j,k)
for (i = 1; i < NX - 1; i++) { /* BB */
for (j = 1; j < NY - 1; j++) {
for (k = 0; k < NZ - 1; k++) {
B[i][j][k] = B[i][j][k]
+ 1.* (BB[i][j][k] - A[i - 1][j][k]);
}
}
}
#pragma acc kernels
for (m = 1; m < NZ - 1; m++) {
tr = t - (double)(m)*5 / 1.5e8;
if (tr <= 0.)
cu[m] = 0.;
else {
w = (tr / 0.25e-6)*(tr / 0.25e-6);
cu[m] =1666*w / (w + 1.)*exp(-tr / 2.5e-6) ;
cu[m] = 2*cu[m];
}
A[10][60][m] = -cu[m];
}
#pragma acc update self(B)
fprintf(file, "%e, %e \n", t*1e6, -B[22][60][10] );
t = t + dt;
}
}
fclose(file);
}

The problem here is the "copyin( tr, w,dt, t)", and in particular the "t" variable. By putting these scalars in a data clause, you'll need to managed the synchronization between the host as device copies. Hence, when you update the variable on the host (i.e. "t = t + dt;"), you then need to update the device copy with the new value.
Also, there's a potential race condition on "tr" since the device code will now the shared device variable instead of a private copy.
Though, the easiest thing to do is to simply not put these scalars in a data clause. By default, OpenACC privatizes scalars so there's no need manage them yourself. In t's case, it's value will be passed as an argument to the CUDA kernel.
To fix your code change:
#pragma acc data copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu )
to:
#pragma acc data copy(B ,A , C,AA , CC,BB,cu )
Note that there's no need to put the loop indices in a private clause since they are implicitly private.

Why is calling my C code from F# very slow (compared to native)?

So I wrote some numerical code in C but wanted to call it from F#. However it runs incredibly slowly.
Times:
gcc -O3 : 4 seconds
gcc -O0 : 30 seconds
fsharp code which calls the optimised gcc code: 2 minutes 30 seconds.
For reference, the c code is
int main(int argc, char** argv)
{
setvals(100,100,15,20.0,0.0504);
float* dmats = malloc(sizeof(float) * factor*factor);
MakeDmat(1.4,-1.92,dmats); //dmat appears to be correct
float* arr1 = malloc(sizeof(float)*xsize*ysize);
float* arr2 = malloc(sizeof(float)*xsize*ysize);
randinit(arr1);
for (int i = 0;i < 10000;i++)
{
evolve(arr1,arr2,dmats);
evolve(arr2,arr1,dmats);
if (i==9999) {print(arr1,xsize,ysize);};
}
return 0;
}
I left out the implementation of the functions. The F# code I am using is
open System.Runtime.InteropServices
open Microsoft.FSharp.NativeInterop
[<DllImport("a.dll")>] extern void main (int argc, char* argv)
[<DllImport("a.dll")>] extern void setvals (int _xsize, int _ysize, int _distlimit,float _tau,float _Iex)
[<DllImport("a.dll")>] extern void MakeDmat(float We,float Wi, float*arr)
[<DllImport("a.dll")>] extern void randinit(float* arr)
[<DllImport("a.dll")>] extern void print(float* arr)
[<DllImport("a.dll")>] extern void evolve (float* input, float* output,float* connections)
let dlimit,xsize,ysize = 15,100,100
let factor = (2*dlimit)+1
setvals(xsize,ysize,dlimit,20.0,0.0504)
let dmat = Array.zeroCreate (factor*factor)
MakeDmat(1.4,-1.92,&&dmat.[0])
let arr1 = Array.zeroCreate (xsize*ysize)
let arr2 = Array.zeroCreate (xsize*ysize)
let addr1 = &&arr1.[0]
let addr2 = &&arr2.[0]
let dmataddr = &&dmat.[0]
randinit(&&dmat.[0])
[0..10000] |> List.iter (fun _ ->
evolve(addr1,addr2,dmataddr)
evolve(addr2,addr1,dmataddr)
)
print(&&arr1.[0])
The F# code is compiled with optimisations on.
Is the mono interface for calling C code really that slow (almost 8ms of overhead per function call) or am I just doing something stupid?

It looks like part of the problem is that you are using float on both the F# and C side of the PInvoke signature. In F# float is really System.Double and hence is 8 bytes. In C a float is generally 4 bytes.
If this were running under the CLR I would expect you to see a PInvoke stack unbalanced error during debugging. I'm not sure if Mono has similar checks or not. But it's possible this is related to the problem you're seeing.

How to solve CUDA Thrust library - for_each synchronization error?

I'm trying to modify a simple dynamic vector in CUDA using the thrust library of CUDA. But I'm getting "launch_closure_by_value" error on the screen indicatiing that the error is related to some synchronization process.
A simple 1D dynamic array modification is not possible due to this error.
My code segment which is causing the error is as follows.
from a .cpp file I call setIndexedGrid, which is defined in System.cu
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
The code segment at System.cu:
void
setIndexedGridInfo(float* a, float*b)
{
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData,d_newData)),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData+8,d_newData+8)),
grid_functor(c));
}
grid_functor is defined in _kernel.cu
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1;
thrust::get<1>(t) = pos;
}
};
I also get these on the Output window (I use Visual Studio):
First-chance exception at 0x000007fefdc7cacd in Particles.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x0029eb60.. First-chance exception at 0x000007fefdc7cacd in
smokeParticles.exe: Microsoft C++ exception:
thrust::system::system_error at memory location 0x0029ecf0.. Unhandled
exception at 0x000007fefdc7cacd in Particles.exe: Microsoft C++
exception: thrust::system::system_error at memory location
0x0029ecf0..
What is causing the problem?

You are trying to use host memory pointers in functions expecting pointers in device memory. This code is the problem:
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
.....
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
The thrust::device_ptr is intended for "wrapping" a device memory pointer allocated with the CUDA API so that thrust can use it. You are trying to treat a host pointer directly as a device pointer. That is illegal. You could modify your setIndexedGridInfo function like this:
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
}
The device_vector constructor will allocate device memory and then copy the contents of your host memory to the device. That should fix the error you are seeing, although I am not sure what you are trying to do with the for_each iterator and whether the functor you have wrttien is correct.
Edit:
Here is a complete, compilable, runnable version of your code:
#include <cstdlib>
#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1f;
thrust::get<1>(t) = pos;
}
};
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
thrust::copy(d_newData.begin(), d_newData.end(), b);
}
int main(void)
{
const int n = 8;
float* a= (float*)(malloc(n*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(n*sizeof(float)));
setIndexedGridInfo(a,b,n);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d (%f,%f)\n", i, a[i], b[i]);
}
return 0;
}
I can compile and run this code on an OS 10.6.8 host with CUDA 4.1 like this:
$ nvcc -Xptxas="-v" -arch=sm_12 -g -G thrustforeach.cu
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEi12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEj12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
$ ./a.out
0 (0.000000,0.100000)
1 (1.000000,1.100000)
2 (2.000000,2.100000)
3 (3.000000,3.100000)
4 (4.000000,4.100000)
5 (5.000000,5.100000)
6 (6.000000,6.100000)
7 (7.000000,7.100000)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to sum all 32-bit or 64-bit sub-registers in an SSE XMM, or AVX YMM, and ZMM register? - sse

Related

How can I generate SVE vectors with LLVM

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate]

Data Clauses (output is zero when i use OpenACC)

Why is calling my C code from F# very slow (compared to native)?

How to solve CUDA Thrust library - for_each synchronization error?

Categories

Resources