SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate]

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate] - vectorization

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
{
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.

Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
ret
.LC0:
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.LC1:
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.LC2:
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805

_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.

This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
{
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
{
printf("%02X ", buf[i]);
}
printf("\n");
}
int main(void)
{
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
print128(&a);
print128(&glob_a);
print128(&b);
print128(&glob_b);
print128(&m87);
print128(&glob_m87);
return 0;
}
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
#else
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
#endif
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

Related

Can Lua send extended function Keys? ex F13-F24

I tried sending F13 with kb.stroke("F13");
Well it doesn't work, works fine with anything F12 and below.
I'm trying to use this in a custom remote in Unified Remote app, so my only workaround for know is using os.start to run an ahk script that does the key sending but it's a very slow approach.
Any help will be appreciated.

local ffi = require"ffi"
ffi.cdef[[
typedef struct {
uintptr_t type;
uint16_t wVk;
uint16_t wScan;
uint32_t dwFlags;
uint32_t time;
uintptr_t dwExtraInfo;
uint32_t x[2];
} INP;
int SendInput(int, void*, int);
]]
local inp_t = ffi.typeof"INP[2]"
local function PressAndReleaseKey(vkey)
local inp = inp_t()
for j = 0, 1 do
inp[j].type = 1
inp[j].wVk = vkey
inp[j].dwFlags = j * 2
end
ffi.C.SendInput(2, inp, ffi.sizeof"INP")
end
PressAndReleaseKey(0x57) -- W
PressAndReleaseKey(0x7C) -- F13
VKeys:
https://learn.microsoft.com/en-us/windows/win32/inputdev/virtual-key-codes

How to sum all 32-bit or 64-bit sub-registers in an SSE XMM, or AVX YMM, and ZMM register?

Say your task results in a subtotal in each floating-point subregister. I'm not seeing an instruction that would sum the subtotals down to one floating-point total. Do I need to store the MM register in plain old memory then do the sum with simple instructions?
(It's unresolved whether these will be double or single-precision, and I plan on coding for every CPU variation up to the forthcoming (?) 512-bit AVX version if I can find the opcodes.)

wget http://www.agner.org/optimize/vectorclass.zip
unzip vectorclass.zip -d vectorclass
cd vectorclass/
This code is GPLv3.
SSE
grep -A11 horizontal_add vectorf128.h
static inline float horizontal_add (Vec4f const & a) {
#if INSTRSET >= 3 // SSE3
__m128 t1 = _mm_hadd_ps(a,a);
__m128 t2 = _mm_hadd_ps(t1,t1);
return _mm_cvtss_f32(t2);
#else
__m128 t1 = _mm_movehl_ps(a,a);
__m128 t2 = _mm_add_ps(a,t1);
__m128 t3 = _mm_shuffle_ps(t2,t2,1);
__m128 t4 = _mm_add_ss(t2,t3);
return _mm_cvtss_f32(t4);
#endif
--
static inline double horizontal_add (Vec2d const & a) {
#if INSTRSET >= 3 // SSE3
__m128d t1 = _mm_hadd_pd(a,a);
return _mm_cvtsd_f64(t1);
#else
__m128 t0 = _mm_castpd_ps(a);
__m128d t1 = _mm_castps_pd(_mm_movehl_ps(t0,t0));
__m128d t2 = _mm_add_sd(a,t1);
return _mm_cvtsd_f64(t2);
#endif
}
AVX
grep -A6 horizontal_add vectorf256.h
static inline float horizontal_add (Vec8f const & a) {
__m256 t1 = _mm256_hadd_ps(a,a);
__m256 t2 = _mm256_hadd_ps(t1,t1);
__m128 t3 = _mm256_extractf128_ps(t2,1);
__m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
return _mm_cvtss_f32(t4);
}
--
static inline double horizontal_add (Vec4d const & a) {
__m256d t1 = _mm256_hadd_pd(a,a);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
return _mm_cvtsd_f64(t3);
}
AVX512
grep -A3 horizontal_add vectorf512.h
static inline float horizontal_add (Vec16f const & a) {
#if defined(__INTEL_COMPILER)
return _mm512_reduce_add_ps(a);
#else
return horizontal_add(a.get_low() + a.get_high());
#endif
}
--
static inline double horizontal_add (Vec8d const & a) {
#if defined(__INTEL_COMPILER)
return _mm512_reduce_add_pd(a);
#else
return horizontal_add(a.get_low() + a.get_high());
#endif
}
get_high() and get_low()
Vec8f get_high() const {
return _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));
}
Vec8f get_low() const {
return _mm512_castps512_ps256(zmm);
}
Vec4d get_low() const {
return _mm512_castpd512_pd256(zmm);
}
Vec4d get_high() const {
return _mm512_extractf64x4_pd(zmm,1);
}
For integers look for horizontal_add in vectori128.h, vectori256.h, and vectori512.h.
You can also use the Vector Class Library (VCL) directly
#include <stdio.h>
#define MAX_VECTOR_SIZE 512
#include "vectorclass.h"
int main(void) {
float x[16]; for(int i=0;i<16;i++) x[i]=i+1;
Vec4f v4 = Vec4f().load(x);
Vec8f v8 = Vec8f().load(x);
Vec16f v16 = Vec16f().load(x);
printf("%f %d\n", horizontal_add(v4), 4*5/2);
printf("%f %d\n", horizontal_add(v8), 8*9/2);
printf("%f %d\n", horizontal_add(v16), 16*17/2);
}
Compile like this (GCC only my KNL is too old for AVX512)
SSE2: g++ -O3 test.cpp
AVX: g++ -O3 -mavx test.cpp
AVX512ER: icpc -O3 -xMIC-AVX512 test.cpp
output
10.000000 10
36.000000 36
136.000000 136
One nice thing with the VCL library is that if you use e.g. Vec8f with a system that only has SSE2 it will emulate AVX using SSE twice.
See the section "Instruction sets and CPU dispatching" in the vectorclass.pdf manual for how to compile for different instruction sets with MSVC, ICC, Clang, and GCC.

I have implemented the following inline function for AVX2. It sums all elements and returns the result. You can look this as a suggestion answer to develop your own function for this purpose.
Note: _mm256_extract_epi32 is not presented for AVX you can use your own method with vmovss such as float _mm256_cvtss_f32 (__m256 a) instead and develop your horizontal addition functions.
// my horizontal addition of epi32
inline int _mm256_hadd2_epi32(__m256i a)
{
__m256i a_hi;
a_hi = _mm256_permute2x128_si256(a, a, 1); //maybe it should be 4
a = _mm256_hadd_epi32(a, a_hi);
a = _mm256_hadd_epi32(a, a);
a = _mm256_hadd_epi32(a, a);
return _mm256_extract_epi32(a,0);
}

CRC Calculation Of A Mostly Static Data Stream

Background:
I have a section of memory, 1024 bytes. The last 1020 bytes will always be the same. The first 4 bytes will change (serial number of a product). I need to calculate the CRC-16 CCITT (0xFFFF starting, 0x1021 mask) for the entire section of memory, CRC_WHOLE.
Question:
Is it possible to calculate the CRC for only the first 4 bytes, CRC_A, then apply a function such as the one below to calculate the full CRC? We can assume that the checksum for the last 1020 bytes, CRC_B, is already known.
CRC_WHOLE = XOR(CRC_A, CRC_B)
I know that this formula does not work (tried it), but I am hoping that something similar exists.

Yes. You can see how in zlib's crc32_combine(). If you have two sequences A and B, then the pure CRC of AB is the exclusive-or of the CRC of A0 and the CRC of 0B, where the 0's represent a series of zero bytes with the length of the corresponding sequence, i.e. B and A respectively.
For your application, you can pre-compute a single operator that applies 1020 zeros to the CRC of your first four bytes very rapidly. Then you can exclusive-or that with the pre-computed CRC of the 1020 bytes.
Update:
Here is a post of mine from 2008 with a detailed explanation that #ArtemB discovered (that I had forgotten about):
crc32_combine() in zlib is based on two key tricks. For what follows,
we set aside the fact that the standard 32-bit CRC is pre and post-
conditioned. We can deal with that later. Assume for now a CRC that
has no such conditioning, and so starts with the register filled with
zeros.
Trick #1: CRCs are linear. So if you have stream X and stream Y of
the same length and exclusive-or the two streams bit-by-bit to get Z,
i.e. Z = X ^ Y (using the C notation for exclusive-or), then CRC(Z) =
CRC(X) ^ CRC(Y). For the problem at hand we have two streams A and B
of differing length that we want to concatenate into stream Z. What
we have available are CRC(A) and CRC(B). What we want is a quick way
to compute CRC(Z). The trick is to construct X = A concatenated with
length(B) zero bits, and Y = length(A) zero bits concatenated with B.
So if we represent concatenation simply by juxtaposition of the
symbols, X = A0, Y = 0B, then X^Y = Z = AB. Then we have CRC(Z) =
CRC(A0) ^ CRC(0B).
Now we need to know CRC(A0) and CRC(0B). CRC(0B) is easy. If we feed
a bunch of zeros to the CRC machine starting with zero, the register
is still filled with zeros. So it's as if we did nothing at all.
Therefore CRC(0B) = CRC(B).
CRC(A0) requires more work however. Taking a non-zero CRC and feeding
zeros to the CRC machine doesn't leave it alone. Every zero changes
the register contents. So to get CRC(A0), we need to set the register
to CRC(A), and then run length(B) zeros through it. Then we can
exclusive-or the result of that with CRC(B) = CRC(0B), and we get what
we want, which is CRC(Z) = CRC(AB). Voila!
Well, actually the voila is premature. I wasn't at all satisfied with
that answer. I didn't want a calculation that took a time
proportional to the length of B. That wouldn't save any time compared
to simply setting the register to CRC(A) and running the B stream
through. I figured there must be a faster way to compute the effect
of feeding n zeros into the CRC machine (where n = length(B)). So
that leads us to:
Trick #2: The CRC machine is a linear state machine. If we know the
linear transformation that occurs when we feed a zero to the machine,
then we can do operations on that transformation to more efficiently
find the transformation that results from feeding n zeros into the
machine.
The transformation of feeding a single zero bit into the CRC machine
is completely represented by a 32x32 binary matrix. To apply the
transformation we multiply the matrix by the register, taking the
register as a 32 bit column vector. For the matrix multiplication in
binary (i.e. over the Galois Field of 2), the role of multiplication
is played by and'ing, and the role of addition is played by exclusive-
or'ing.
There are a few different ways to construct the magic matrix that
represents the transformation caused by feeding the CRC machine a
single zero bit. One way is to observe that each column of the matrix
is what you get when your register starts off with a single one in
it. So the first column is what you get when the register is 100...
and then feed a zero, the second column comes from starting with
0100..., etc. (Those are referred to as basis vectors.) You can see
this simply by doing the matrix multiplication with those vectors.
The matrix multiplication selects the column of the matrix
corresponding to the location of the single one.
Now for the trick. Once we have the magic matrix, we can set aside
the initial register contents for a while, and instead use the
transformation for one zero to compute the transformation for n
zeros. We could just multiply n copies of the matrix together to get
the matrix for n zeros. But that's even worse than just running the n
zeros through the machine. However there's an easy way to avoid most
of those matrix multiplications to get the same answer. Suppose we
want to know the transformation for running eight zero bits, or one
byte through. Let's call the magic matrix that represents running one
zero through: M. We could do seven matrix multiplications to get R =
MxMxMxMxMxMxMxM. Instead, let's start with MxM and call that P. Then
PxP is MxMxMxM. Let's call that Q. Then QxQ is R. So now we've
reduced the seven multiplications to three. P = MxM, Q = PxP, and R =
QxQ.
Now I'm sure you get the idea for an arbitrary n number of zeros. We
can very rapidly generate transformation matrices Mk, where Mk is the
transformation for running 2k zeros through. (In the
paragraph above M3 is R.) We can make M1 through Mk with only k
matrix multiplications, starting with M0 = M. k only has to be as
large as the number of bits in the binary representation of n. We can
then pick those matrices where there are ones in the binary
representation of n and multiply them together to get the
transformation of running n zeros through the CRC machine. So if n =
13, compute M0 x M2 x M3.
If j is the number of one's in the binary representation of n, then we
just have j - 1 more matrix multiplications. So we have a total of k
j - 1 matrix multiplications, where j <= k = floor(logbase2(n)).
Now we take our rapidly constructed matrix for n zeros, and multiply
that by CRC(A) to get CRC(A0). We can compute CRC(A0) in O(log(n))
time, instead of O(n) time. We exclusive or that with CRC(B) and
Voila! (really this time), we have CRC(Z).
That's what zlib's crc32_combine() does.
I will leave it as an exercise for the reader as to how to deal with
the pre and post conditioning of the CRC register. You just need to
apply the linearity observations above. Hint: You don't need to know
length(A). In fact crc32_combine() only takes three arguments:
CRC(A), CRC(B), and length(B) (in bytes).

Below is example C code for an alternative approach for CRC(A0). Rather than working with a matrix, a CRC can be cycled forward n bits by muliplying (CRC · ((2^n)%POLY)%POLY . So the repeated squaring is performed on an integer rather than a matrix. If n is constant, then (2^n)%POLY can be pre-computed.
/* crcpad.c - crc - data has a large number of trailing zeroes */
#include <stdio.h>
#include <stdlib.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
#define POLY (0x04c11db7u)
static uint32_t crctbl[256];
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
/* assumes twos complement */
crc = (crc<<1)^((0-(crc>>31))&POLY);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
uint32_t pd = 0;
uint32_t i;
for(i = 0; i < 32; i++){
/* assumes twos complement */
pd = (pd<<1)^((0-(pd>>31))&POLY);
pd ^= (0-(b>>31))&a;
b <<= 1;
}
return pd;
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* # data bytes */
#define DAT ( 32)
/* # zero bytes */
#define PAD (992)
/* DATA+PAD */
#define CNT (1024)
int main()
{
uint32_t pmc;
uint32_t crc;
uint32_t crf;
uint32_t i;
uint8_t *msg = malloc(CNT);
for(i = 0; i < DAT; i++) /* generate msg */
msg[i] = (uint8_t)rand();
for( ; i < CNT; i++)
msg[i] = 0;
GenTbl(); /* generate crc table */
crc = GenCrc(msg, CNT); /* generate crc normally */
crf = GenCrc(msg, DAT); /* generate crc for data */
pmc = PowModCrc(PAD*8); /* pmc = pow(2,PAD*8)%crc */
crf = MpyModCrc(crf, pmc); /* crf = (crf*pmc)%crc */
printf("%08x %08x\n", crc, crf);
free(msg);
return 0;
}
Example C code using intrinsic for carryless multiply, pclmulqdq == _mm_clmulepi64_si128:
/* crcpadm.c - crc - data has a large number of trailing zeroes */
/* pclmulqdq intrinsic version */
#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
#define POLY (0x104c11db7ull)
#define POLYM ( 0x04c11db7u)
static uint32_t crctbl[256];
static __m128i poly; /* poly */
static __m128i invpoly; /* 2^64 / POLY */
void GenMPoly(void) /* generate __m12i8 poly info */
{
uint64_t N = 0x100000000ull;
uint64_t Q = 0;
for(size_t i = 0; i < 33; i++){
Q <<= 1;
if(N&0x100000000ull){
Q |= 1;
N ^= POLY;
}
N <<= 1;
}
poly.m128i_u64[0] = POLY;
invpoly.m128i_u64[0] = Q;
}
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
/* assumes twos complement */
crc = (crc<<1)^((0-(crc>>31))&POLYM);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
__m128i ma, mb, mp, mt;
ma.m128i_u64[0] = a;
mb.m128i_u64[0] = b;
mp = _mm_clmulepi64_si128(ma, mb, 0x00); /* p[0] = a*b */
mt = _mm_clmulepi64_si128(mp, invpoly, 0x00); /* t[1] = (p[0]*((2^64)/POLY))>>64 */
mt = _mm_clmulepi64_si128(mt, poly, 0x01); /* t[0] = t[1]*POLY */
return mp.m128i_u32[0] ^ mt.m128i_u32[0]; /* ret = p[0] ^ t[0] */
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* # data bytes */
#define DAT ( 32)
/* # zero bytes */
#define PAD (992)
/* DATA+PAD */
#define CNT (1024)
int main()
{
uint32_t pmc;
uint32_t crc;
uint32_t crf;
uint32_t i;
uint8_t *msg = malloc(CNT);
GenMPoly(); /* generate __m128 polys */
GenTbl(); /* generate crc table */
for(i = 0; i < DAT; i++) /* generate msg */
msg[i] = (uint8_t)rand();
for( ; i < CNT; i++)
msg[i] = 0;
crc = GenCrc(msg, CNT); /* generate crc normally */
crf = GenCrc(msg, DAT); /* generate crc for data */
pmc = PowModCrc(PAD*8); /* pmc = pow(2,PAD*8)%crc */
crf = MpyModCrc(crf, pmc); /* crf = (crf*pmc)%crc */
printf("%08x %08x\n", crc, crf);
free(msg);
return 0;
}

Why is calling my C code from F# very slow (compared to native)?

So I wrote some numerical code in C but wanted to call it from F#. However it runs incredibly slowly.
Times:
gcc -O3 : 4 seconds
gcc -O0 : 30 seconds
fsharp code which calls the optimised gcc code: 2 minutes 30 seconds.
For reference, the c code is
int main(int argc, char** argv)
{
setvals(100,100,15,20.0,0.0504);
float* dmats = malloc(sizeof(float) * factor*factor);
MakeDmat(1.4,-1.92,dmats); //dmat appears to be correct
float* arr1 = malloc(sizeof(float)*xsize*ysize);
float* arr2 = malloc(sizeof(float)*xsize*ysize);
randinit(arr1);
for (int i = 0;i < 10000;i++)
{
evolve(arr1,arr2,dmats);
evolve(arr2,arr1,dmats);
if (i==9999) {print(arr1,xsize,ysize);};
}
return 0;
}
I left out the implementation of the functions. The F# code I am using is
open System.Runtime.InteropServices
open Microsoft.FSharp.NativeInterop
[<DllImport("a.dll")>] extern void main (int argc, char* argv)
[<DllImport("a.dll")>] extern void setvals (int _xsize, int _ysize, int _distlimit,float _tau,float _Iex)
[<DllImport("a.dll")>] extern void MakeDmat(float We,float Wi, float*arr)
[<DllImport("a.dll")>] extern void randinit(float* arr)
[<DllImport("a.dll")>] extern void print(float* arr)
[<DllImport("a.dll")>] extern void evolve (float* input, float* output,float* connections)
let dlimit,xsize,ysize = 15,100,100
let factor = (2*dlimit)+1
setvals(xsize,ysize,dlimit,20.0,0.0504)
let dmat = Array.zeroCreate (factor*factor)
MakeDmat(1.4,-1.92,&&dmat.[0])
let arr1 = Array.zeroCreate (xsize*ysize)
let arr2 = Array.zeroCreate (xsize*ysize)
let addr1 = &&arr1.[0]
let addr2 = &&arr2.[0]
let dmataddr = &&dmat.[0]
randinit(&&dmat.[0])
[0..10000] |> List.iter (fun _ ->
evolve(addr1,addr2,dmataddr)
evolve(addr2,addr1,dmataddr)
)
print(&&arr1.[0])
The F# code is compiled with optimisations on.
Is the mono interface for calling C code really that slow (almost 8ms of overhead per function call) or am I just doing something stupid?

It looks like part of the problem is that you are using float on both the F# and C side of the PInvoke signature. In F# float is really System.Double and hence is 8 bytes. In C a float is generally 4 bytes.
If this were running under the CLR I would expect you to see a PInvoke stack unbalanced error during debugging. I'm not sure if Mono has similar checks or not. But it's possible this is related to the problem you're seeing.

How to solve CUDA Thrust library - for_each synchronization error?

I'm trying to modify a simple dynamic vector in CUDA using the thrust library of CUDA. But I'm getting "launch_closure_by_value" error on the screen indicatiing that the error is related to some synchronization process.
A simple 1D dynamic array modification is not possible due to this error.
My code segment which is causing the error is as follows.
from a .cpp file I call setIndexedGrid, which is defined in System.cu
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
The code segment at System.cu:
void
setIndexedGridInfo(float* a, float*b)
{
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData,d_newData)),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData+8,d_newData+8)),
grid_functor(c));
}
grid_functor is defined in _kernel.cu
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1;
thrust::get<1>(t) = pos;
}
};
I also get these on the Output window (I use Visual Studio):
First-chance exception at 0x000007fefdc7cacd in Particles.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x0029eb60.. First-chance exception at 0x000007fefdc7cacd in
smokeParticles.exe: Microsoft C++ exception:
thrust::system::system_error at memory location 0x0029ecf0.. Unhandled
exception at 0x000007fefdc7cacd in Particles.exe: Microsoft C++
exception: thrust::system::system_error at memory location
0x0029ecf0..
What is causing the problem?

You are trying to use host memory pointers in functions expecting pointers in device memory. This code is the problem:
float* a= (float*)(malloc(8*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(8*sizeof(float)));
setIndexedGridInfo(a,b);
.....
thrust::device_ptr<float> d_oldData(a);
thrust::device_ptr<float> d_newData(b);
The thrust::device_ptr is intended for "wrapping" a device memory pointer allocated with the CUDA API so that thrust can use it. You are trying to treat a host pointer directly as a device pointer. That is illegal. You could modify your setIndexedGridInfo function like this:
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
}
The device_vector constructor will allocate device memory and then copy the contents of your host memory to the device. That should fix the error you are seeing, although I am not sure what you are trying to do with the for_each iterator and whether the functor you have wrttien is correct.
Edit:
Here is a complete, compilable, runnable version of your code:
#include <cstdlib>
#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
struct grid_functor
{
float a;
__host__ __device__
grid_functor(float grid_Info) : a(grid_Info) {}
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float data = thrust::get<0>(t);
float pos = data + 0.1f;
thrust::get<1>(t) = pos;
}
};
void setIndexedGridInfo(float* a, float*b, const int n)
{
thrust::device_vector<float> d_oldData(a,a+n);
thrust::device_vector<float> d_newData(b,b+n);
float c = 0.0;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.begin(),d_newData.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_oldData.end(),d_newData.end())),
grid_functor(c));
thrust::copy(d_newData.begin(), d_newData.end(), b);
}
int main(void)
{
const int n = 8;
float* a= (float*)(malloc(n*sizeof(float)));
a[0]= 0; a[1]= 1; a[2]= 2; a[3]= 3; a[4]= 4; a[5]= 5; a[6]= 6; a[7]= 7;
float* b = (float*)(malloc(n*sizeof(float)));
setIndexedGridInfo(a,b,n);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d (%f,%f)\n", i, a[i], b[i]);
}
return 0;
}
I can compile and run this code on an OS 10.6.8 host with CUDA 4.1 like this:
$ nvcc -Xptxas="-v" -arch=sm_12 -g -G thrustforeach.cu
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./thrustforeach.cu(20): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEi12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_ZN6thrust6detail7backend4cuda6detail23launch_closure_by_valueINS2_18for_each_n_closureINS_12zip_iteratorINS_5tupleINS0_15normal_iteratorINS_10device_ptrIfEEEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEEj12grid_functorEEEEvT_' for 'sm_12'
ptxas info : Used 14 registers, 160+0 bytes lmem, 16+16 bytes smem, 4 bytes cmem[1]
$ ./a.out
0 (0.000000,0.100000)
1 (1.000000,1.100000)
2 (2.000000,2.100000)
3 (3.000000,3.100000)
4 (4.000000,4.100000)
5 (5.000000,5.100000)
6 (6.000000,6.100000)
7 (7.000000,7.100000)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate] - vectorization

Related

Can Lua send extended function Keys? ex F13-F24

How to sum all 32-bit or 64-bit sub-registers in an SSE XMM, or AVX YMM, and ZMM register?

CRC Calculation Of A Mostly Static Data Stream

Why is calling my C code from F# very slow (compared to native)?

How to solve CUDA Thrust library - for_each synchronization error?

Categories

Resources