SIMD: Bit-pack signed integers - sse

Unsigned integers can be compressed by using "bit-packing" techniques: Within a block of unsigned integers only the significant bits are stored, resulting in data compression when all integers in a block are "small". The method is known as FOR (frame of reference).
There are SIMD libraries that do this very efficiently.
Now I want to use FOR-like techniques to encode signed integers, e.g. from a differenced sequence of unsorted unsigned integers. The sign of each signed integer needs to be stored somewhere, there are two options:
Store the signs in a separate block of data. This adds overhead.
Store the sign together with the absolute value of each signed integer.
I'm following path 2 right now. The 2-s complement has the sign bit in the msb (most signfificant bit), so that won't work for bit-packing à la FOR. One possibility is to store the sign in the lsb (least significant bit). Storing signed integers this way is very unusual, there are no instruction that support this, as far as I know. The question now is: Can these lsb-signed-integers be encoded/decoded efficiently using SIMD instructions?
I think AVX-512 _mm_testn_epi32_mask can be used to extract the lsb from each uint32, followed by a shift, then two mask_extract of some sort? Quite convoluted.

Untested ZigZag examples in C using SSE2 for 64-bit integers:
(note: SSE2 is missing some 64-bit instructions...)
#include <emmintrin.h>
// from comment by Peter-Cordes
__m128i zigzag_encode_epi64(__m128i v) {
__m128i signmask = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,3,1,1));
signmask = _mm_srai_epi32(signmask, 31);
return _mm_xor_si128(_mm_add_epi64(v, v), signmask);
__m128i zigzag_decode_epi64 (__m128i v) {
__m128i signmask = _mm_and_si128(_mm_set_epi32(0, 1, 0, 1), v);
signmask = _mm_sub_epi64(_mm_setzero_si128(), signmask);
return _mm_xor_si128(_mm_srli_epi64(v, 1), signmask);
// no constant
__m128i zigzag_decodev3_epi64 (__m128i v) {
__m128i t = _mm_srli_epi64(v, 1);
__m128i signmask = _mm_sub_epi64(_mm_slli_epi64(t, 1), v);
return _mm_xor_si128(t, signmask);
Zigzag is good for bitwise varints. However, a bytewise group-varint may wish to "sign extend from a variable bit-width".
32-bit examples
I favored compares over arithmetic shifts. I assume - when unrolled - compares will have 1 cycle lower latency.
__m128i zigzag_encode_epi32 (__m128i v) {
__m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
__m128i zigzag_decode_epi32 (__m128i v) {
const __m128i m = _mm_set1_epi32(1);
__m128i signmask =_mm_cmpeq_epi32(_mm_and_si128(m, v), m);
return _mm_xor_si128(_mm_srli_epi32(v, 1), signmask);
__m128i delta_encode_epi32 (__m128i v, __m128i prev) {
return _mm_sub_epi32(v, _mm_alignr_epi8(v, prev, 12));
// prefix sum (see many of answers around stackoverflow...)
__m128i delta_decode_epi32 (__m128i v, __m128i prev) {
prev = _mm_shuffle_epi32(prev, _MM_SHUFFLE(3,3,3,3)); // [P P P P]
v = _mm_add_epi32(v, _mm_slli_si128(v, 4)); // [A AB BC CD]
prev = _mm_add_epi32(prev, v); // [PA PAB PBC PCD]
v = _mm_slli_si128(v, 8); // [0 0 A AB]
return _mm_add_epi32(prev, v); // [PA PAB PABC PABCD]
__m128i delta_zigzag_encode_epi32 (__m128i v, __m128i prev) {
return zigzag_encode_epi32(delta_encode_epi32(v, prev));
__m128i delta_zigzag_decode_epi32 (__m128i v, __m128i prev) {
return delta_decode_epi32(zigzag_decode_epi32(v), prev);
Note: Delta coding would be faster (round-trip/decoding) to transpose the elements while encoding then transpose them back again during decoding; horizontal prefix sums are really slow. However, determining the optimum number of elements to transpose in each batch seems like a hard problem.


How to Calculate CRC Starting at Last Byte

I'm trying to implement a CRC-CCITT calculator in VHDL. I was able to initially do that; however, I recently found out that data is delivered starting at the least-significant byte. In my code, data is transmitted 7 bytes at a time through a frame. So let's say we have the following data: 123456789 in ASCII or 313233343536373839 in hex. The data would be transmitted as such (with the following CRC):
-- First frame of data
RxFrame.Data <= (
1 => x"39", -- LSB
2 => x"38",
3 => x"37",
4 => x"36",
5 => x"35",
6 => x"34",
7 => x"33"
-- Second/last frame of data
RxFrame.Data <= (
1 => x"32",
2 => x"31", -- MSB
3 => xx, -- "xx" means irrelevant data, not part of CRC calculation.
4 => xx, -- This occurs only in the last frame, when it specified in
5 => xx, -- byte 0 which bytes contain data
6 => xx,
7 => xx
-- Calculated CRC should be 0x31C3
Another example with data 0x4376669A1CFC048321313233343536373839 and its correct CRC is shown below:
-- First incoming frame of data
RxFrame.Data <= (
1 => x"39", -- LSB
2 => x"38",
3 => x"37",
4 => x"36",
5 => x"35",
6 => x"34",
7 => x"33"
-- Second incoming frame of data
RxFrame.Data <= (
1 => x"32",
2 => x"31",
3 => x"21",
4 => x"83",
5 => x"04",
6 => x"FC",
7 => x"1C"
-- Third/last incoming frame of data
RxFrame.Data <= (
1 => x"9A",
2 => x"66",
3 => x"76",
4 => x"43", -- MSB
5 => xx, -- Irrelevant data, specified in byte 0
6 => xx,
7 => xx
-- Calculated CRC should be 0x2848
Is there a concept I'm missing? Is there a way to calculate the CRC with the data being received in reverse order? I am implementing this for CANopen SDO block protocols. Thanks!
CRC calculation algorithm to verify SDO block transfer from CANopen standard
Example code to generate a CRC16 with the bytes read in reverse (last byte first), using a function to do a carryless multiply modulo the CRC polynomial. An explanation follows.
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define POLY (0x1021u)
/* carryless multiply modulo crc polynomial */
uint16_t MpyModPoly(uint16_t a, uint16_t b) /* (a*b)%poly */
uint16_t pd = 0;
uint16_t i;
for(i = 0; i < 16; i++){
/* assumes twos complement */
pd = (pd<<1)^((0-(pd>>15))&POLY);
pd ^= (0-(b>>15))&a;
b <<= 1;
return pd;
/* generate crc in reverse byte order */
uint16_t Crc16R(uint8_t * b, size_t sz)
uint8_t *e = b + sz; /* end of bfr ptr */
uint16_t crc = 0u; /* crc */
uint16_t pdm = 0x100u; /* padding multiplier */
while(e > b){ /* generate crc */
pdm = MpyModPoly(0x100, pdm);
crc ^= MpyModPoly( *--e, pdm);
/* msg will be processed in reverse order */
static uint8_t msg[] = {0x43,0x76,0x66,0x9A,0x1C,0xFC,0x04,0x83,
int main()
uint16_t crc;
crc = Crc16R(msg, sizeof(msg));
printf("%04x\n", crc);
return 0;
Example code using X86 xmm pclmulqdq and psrlq, to emulate a 16 bit by 16 bit hardware (VHDL) carryless multiply:
/* __m128i is an intrinsic for X86 128 bit xmm register */
static __m128i poly = {.m128i_u32[0] = 0x00011021u}; /* poly */
static __m128i invpoly = {.m128i_u32[0] = 0x00008898u}; /* 2^31 / poly */
/* carryless multiply modulo crc polynomial */
/* using xmm pclmulqdq and psrlq */
uint16_t MpyModPoly(uint16_t a, uint16_t b)
__m128i ma, mb, mp, mt;
ma.m128i_u64[0] = a;
mb.m128i_u64[0] = b;
mp = _mm_clmulepi64_si128(ma, mb, 0x00); /* mp = a*b */
mt = _mm_srli_epi64(mp, 16); /* mt = mp>>16 */
mt = _mm_clmulepi64_si128(mt, invpoly, 0x00); /* mt = mt*ipoly */
mt = _mm_srli_epi64(mt, 15); /* mt = mt>>15 = (a*b)/poly */
mt = _mm_clmulepi64_si128(mt, poly, 0x00); /* mt = mt*poly */
return mp.m128i_u16[0] ^ mt.m128i_u16[0]; /* ret mp^mt */
/* external code to generate invpoly */
#define POLY (0x11021u)
static __m128i invpoly; /* 2^31 / poly */
void GenMPoly(void) /* generate __m12i8 invpoly */
uint32_t N = 0x10000u; /* numerator = x^16 */
uint32_t Q = 0; /* quotient = 0 */
for(size_t i = 0; i <= 15; i++){ /* 31 - 16 = 15 */
Q <<= 1;
Q |= 1;
N ^= POLY;
N <<= 1;
invpoly.m128i_u16[0] = Q;
Explanation: consider the data as separate strings of ever increasing length, padded with zeroes at the end. For the first few bytes of your example, the logic would calculate
CRC = CRC16({39})
CRC ^= CRC16({38 00})
CRC ^= CRC16({37 00 00})
CRC ^= CRC16({36 00 00 00})
To speed up this calculation, rather than actually pad with n zero bytes, you can do a carryless multiply of a CRC by 2^{n·8} modulo POLY, where POLY is the 17 bit polynomial used for CRC16:
CRC = CRC16({39})
CRC ^= (CRC16({38}) · (2^08 % POLY)) % POLY
CRC ^= (CRC16({37}) · (2^10 % POLY)) % POLY
CRC ^= (CRC16({36}) · (2^18 % POLY)) % POLY
A carryless multiply modulo POLY is equivalent to what CRC16 does, so this translates into pseudo code (all values in hex, 2^8 = 100)
CRC = 0
PDM = 100 ;padding multiplier
PDM = (100 · PDM) % POLY ;main loop (2 lines per byte)
CRC ^= ( 39 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 38 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 37 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 36 · PDM) % POLY
Implementing (A · B) % POLY is based on binary math:
(A · B) % POLY = (A · B) ^ (((A · B) / POLY) · POLY)
Where multiply is carryless (XOR instead of add) and divide is borrowless (XOR instead of subtract). Since the divide is borrowless, and most significant term of POLY is x^16, the quotient
Q = (A · B) / POLY
only depends on the upper 16 bits of (A · B). Dividing by POLY uses multiplication by the 16 bit constant IPOLY = (2^31)/POLY followed by a right shift:
Q = (A · B) / POLY = (((A · B) >> 16) · IPOLY) >> 15
The process uses a 16 bit by 16 bit carryless multiply, producing a 31 bit product.
POLY = 0x11021u ; CRC polynomial (17 bit)
IPOLY = 0x08898u ; 2^31 / POLY
; generated by external software
MpyModPoly(A, B)
MP = A · B ; MP = A · B
MT = MP >> 16 ; MT = MP >> 16
MT = MT >> 15 ; MT = (A · B) / POLY
MT = MT · POLY ; MT = ((A · B) / POLY) * POLY
return MP xor MT ; (A·B) ^ (((A · B) / POLY) · POLY)
A hardware based carryless multiply would look something like this 4 bit · 4 bit example.
p[] = [a3 a2 a1 a0] · [b3 b2 b1 b0]
p[] is a 7 bit product generated with 7 parallel circuits.
The time for multiply would be worst case propagation time for p3.
p6 = a3&b3
p5 = a3&b2 ^ a2&b3
p4 = a3&b1 ^ a2&b2 ^ a1&b3
p3 = a3&b0 ^ a2&b1 ^ a1&b2 ^ a0&b3
p2 = a2&b0 ^ a1&b1 ^ a0&b2
p1 = a1&b0 ^ a0&b1
p0 = a0&b0
If the xor gates available only have 2 bit inputs, the logic can
be split up. For example:
p3 = (a3&b0 ^ a2&b1) ^ (a1&b2 ^ a0&b3)
I don't know if your VHDL toolset includes a library for carryless multiply. For a 16 bit by 16 bit multiply resulting in a 31 bit product (p30 to p00), p15 has 16 outputs from the 16 ands (in parallel), which could be xor'ed using a tree like structure, 8 xors in parallel feeding into 4 xors in parallel feeding into 2 xor's in parallel into a single xor. So the propagation time would be 1 and and 4 xor propagation times.
Here is an example in C that you can adapt. Since you mentioned VHDL, this is a bit-wise implementation suitable for casting into gates and flip-flops. However, if cycles are more precious to you than memory and gates, then there is also a byte-wise table-driven version that would run in 1/8 the number of cycles.
What this does is the inverse of what is done in a normal CRC calculation. It then applies the same size input in zeros with a normal CRC to get what the normal CRC would have been on that input. Running the zeros through takes the same number of cycles as the inverse CRC, i.e. O(n) where n is the size of the input. If that latency is too large, that can be reduced to O(log n) cycles, with some investment in gates.
#include <stddef.h>
// Update crc with the CRC-16/XMODEM of n zero bytes. (This can be done in
// O(log n) time or cycles instead of O(n), with a little more effort.)
static unsigned crc16x_zeros_bit(unsigned crc, size_t n) {
for (size_t i = 0; i < n; i++)
for (int k = 0; k < 8; k++)
crc = crc & 0x8000 ? (crc << 1) ^ 0x1021 : crc << 1;
return crc & 0xffff;
// Update crc with the CRC-16/XMODEM of the len bytes at mem in reverse. If mem
// is NULL, then return the initial value for the CRC. When done,
// crc16x_zeros_bit() must be used to apply the total length of zero bytes, in
// order to get what the CRC would have been if it were calculated on the bytes
// fed in the opposite order.
static unsigned crc16x_inverse_bit(unsigned crc, void const *mem, size_t len) {
unsigned char const *data = mem;
if (data == NULL)
return 0;
crc &= 0xffff;
for (size_t i = 0; i < len; i++) {
for (int k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0x8810 : crc >> 1;
crc ^= (unsigned)data[i] << 8;
return crc;
#include <stdio.h>
int main(void) {
// Do framed example.
unsigned crc = crc16x_inverse_bit(0, NULL, 0);
crc = crc16x_inverse_bit(crc, (void const *)"9876543", 7);
crc = crc16x_inverse_bit(crc, (void const *)"21", 2);
crc = crc16x_zeros_bit(crc, 9);
printf("%04x\n", crc);
// Do another one.
crc = crc16x_inverse_bit(0, NULL, 0);
crc = crc16x_inverse_bit(crc, (void const *)"9876543", 7);
crc = crc16x_inverse_bit(crc, (void const *)"21!\x83\x04\xfc\x1c", 7);
crc = crc16x_inverse_bit(crc, (void const *)"\x9a" "fvC", 4);
crc = crc16x_zeros_bit(crc, 18);
printf("%04x\n", crc);
return 0;
Here is the O(log n) version of crc16x_zeros_bit():
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, a cannot be zero.
static inline unsigned multmodp(unsigned a, unsigned b) {
unsigned p = 0;
for (;;) {
if (a & 1) {
p ^= b;
if (a == 1)
a >>= 1;
b = b & 0x8000 ? (b << 1) ^ 0x1021 : b << 1;
return p & 0xffff;
// Return x^(8n) modulo p(x).
static unsigned x2nmodp(size_t n) {
unsigned p = 1; // x^0 == 1
unsigned q = 0x10; // x^2^2
while (n) {
q = multmodp(q, q); // x^2^k mod p(x), k = 3,4,...
if (n & 1)
p = multmodp(q, p);
n >>= 1;
return p;
// Update crc with the CRC-16/XMODEM of n zero bytes.
static unsigned crc16x_zeros_bit(unsigned crc, size_t n) {
return multmodp(x2nmodp(n), crc);

SIMD Black-Scholes implementation: why is _mm256_set1_pd annihilating my performance? [duplicate]

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.
Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805
_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.
This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
printf("%02X ", buf[i]);
int main(void)
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
return 0;
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

Inner product of two 16bit integer vectors with AVX2 in C++

I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.
After multiplication into a vector x I started with _mm256_extracti128_si256 and _mm256_castsi256_si128 to have the low and high part of x and added them with _mm_add_epi16.
I copied the result register and applied _mm_move_epi64 to the original register and added both again with _mm_add_epi16. Now, I think that I have:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0
within the 128bit register. But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.
Following the comments and hours of google my working solution:
// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
ZTYPE* xv =;
ZTYPE* yv =;
__m256i tres = _mm256_setzero_si256();
for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
// editor's note: alignment required. Use loadu for unaligned
__m256i xr = _mm256_load_si256((__m256i*)(xv+ii));
__m256i yr = _mm256_load_si256((__m256i*)(yv+ii));
const __m256i tmp = _mm256_madd_epi16 (xr, yr);
tres = _mm256_add_epi32(tmp, tres);
// Reduction
const __m128i x128 = _mm_add_epi32 ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
const __m128i x64 = _mm_add_epi32 (x128, x128_up);
const __m128i _x32 = _mm_hadd_epi32(x64, x64);
const int res = _mm_extract_epi32(_x32, 0);
hash |= res;
finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;
It is at least the fastest solution so far:
std::inner_product: 0.819781 sec. (-14335)
std::inner_product (aligned): 0.964058 sec. (-14335)
naive multiply: 0.588623 sec. (-14335)
Unroll multiply: 0.505639 sec. (-14335)
AVX multiply: 0.0488352 sec. (-14335)

interface OpenCV's Mat containers with blas for matrix multiplication

I am processing UHD (2160 x 3840) images.
One of the processing I do consist to process a Sobel filtering on X and Y axis then I have to multiply every output matrix by it's transpose and then I process the gradient image as the square root of the sum of the gradient.
So : S = sqrt( S_x * S_x^t + S_y * S_y^t).
Due to dimension of the image OpenCV take up to twenty seconds to process that without multithreading and ten with multithreading.
I know there OpenCV call OpenCL in order to speed up the filtering operations so I think it can take a long time in order to try to gain performance from the filtering step.
For the matrix multiplication I experience a kind of unstability from the OpenCV's OpenCL gemm kernel implementation.
So I would like to try to use OpenBLAS insted.
My questions are :
I wrote the following code but I face some issue for interface OpenCV's Mat objects :
template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
static_assert(true,"support matrix_multiply is only defined for floating precision numbers.");
inline void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
const int M = A.rows;
const int N = B.cols;
const int K = A.cols;
cblas_sgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<float>(),//8 A
A.rows, //9 LDA
B.ptr<float>(),//10 B
B.rows, //11 LDB
0., //12 BETA
C.ptr<float>(),//13 C
C.rows); //14 LDC
inline void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
void matrix_multiply(cv::InputArray _src1, cv::InputArray _src2, cv::OutputArray _dst)
CV_DbgAssert( (_src1.isMat() || _src1.isUMat()) && (_src1.kind() == _src2.kind()) &&
(_src1.depth() == _src2.depth()) && (_src1.depth() == CV_32F) && (_src1.depth() == _src1.type()) &&
(_src1.rows() == _src2.cols())
cv::Mat src1 = _src1.getMat();
cv::Mat src2 = _src2.getMat();
cv::Mat dst;
bool cpy(false);
if(_dst.rows() == _src1.rows() && _dst.cols() == _src2.cols() && _dst.type() == _src1.type())
dst = _dst.getMat();
dst = cv::Mat::zeros(src1.rows,src2.cols,src1.type());
cpy = true;
I tried to organize the datas as specified here :
without succes.
This is my main issue
I was thinking in order to try to speed up a little my implementation to apply the divide and conquer approach illustrated here :
But for only four submatrix.
Does any one tried some similar approach or got a better way to gain performance in matrix multiplication (without using GPU) ?
Thank you in advance for any help.
I found a solution to the question 1).
I based my first implementation on the documentation of the BLAS library.
BLAS has been written in Fortran language, in this language the index start at 1 and not at 0 like in C or C++.
Another thing is many libraries wrote in Fortran language organize their memory in column order (e.g. BLAS,LAPACK) rather than most of the C or C++ library (e.g. OpenCV) organize the memory in row order.
After taking these two properties in count I modified my code to :
template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
static_assert(true,"The function gemm is only defined for floating precision numbers.");
void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
const int M = A.cols+1;
const int N = B.rows;
const int K = A.cols;
cblas_sgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<float>(),//8 A
A.step1(), //9 LDA
B.ptr<float>(),//10 B
B.step1(), //11 LDB
0., //12 BETA
C.ptr<float>(),//13 C
C.step1()); //14 LDC
void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
const int M = A.cols+1;
const int N = B.rows;
const int K = A.cols;
cblas_dgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<double>(),//8 A
A.step1(), //9 LDA
B.ptr<double>(),//10 B
B.step1(), //11 LDB
0., //12 BETA
C.ptr<double>(),//13 C
C.step1()); //14 LDC
And every thing work well.
Without additional multithreading or divide and conquer approach I was able to reduce the processing time of one step of my code from 150 ms to 500 us.
So it fix every thing for me :).

Can Montgomery multiplication be used to speed up the computation of (large number)! % (some prime)

This question originates in a comment I almost wrote below this question, where Zack is computing the factorial of a large number modulo a large number (that we will assume to be prime for the sake of this question). Zack is using the traditional computation of factorial, taking the remainder at each multiplication.
I almost commented that an alternative to consider was Montgomery multiplication, but thinking more about it, I have only seen this technique used to speed up several multiplications by the same multiplicand (in particular, to speed up the computation of an mod p).
My question is: can Montgomery multiplication be used to speed up the computation of n! mod p for large n and p?
Naively, no; you need to transform each of the n terms of the product into the "Montgomery space", so you have n full reductions mod m, the same as the "usual" algorithm.
However, a factorial isn't just an arbitrary product of n terms; it's much more structured. In particular, if you already have the "Montgomerized" kr mod m, then you can use a very cheap reduction to get (k+1)r mod m.
So this is perfectly feasible, though I haven't seen it done before. I went ahead and wrote a quick-and-dirty implementation (very untested, I wouldn't trust it very far at all):
// returns m^-1 mod 2**64 via clever 2-adic arithmetic (
uint64_t inverse(uint64_t m) {
assert(m % 2 == 1);
uint64_t minv = 2 - m;
uint64_t m_1 = m - 1;
for (int i=1; i<6; i+=1) { m_1 *= m_1; minv *= (1 + m_1); }
return minv;
uint64_t montgomery_reduce(__uint128_t x, uint64_t minv, uint64_t m) {
return x + (__uint128_t)((uint64_t)x*-minv)*m >> 64;
uint64_t montgomery_multiply(uint64_t x, uint64_t y, uint64_t minv, uint64_t m) {
return montgomery_reduce(full_product(x, y), minv, m);
uint64_t montgomery_factorial(uint64_t x, uint64_t m) {
assert(x < m && m % 2 == 1);
uint64_t minv = inverse(m); // m^-1 mod 2**64
uint64_t r_mod_m = -m % m; // 2**64 mod m
uint64_t mont_term = r_mod_m;
uint64_t mont_result = r_mod_m;
for (uint64_t k=2; k<=x; k++) {
// Compute the montgomerized product term: kr mod m = (k-1)r + r mod m.
mont_term += r_mod_m;
if (mont_term >= m) mont_term -= m;
// Update the result by multiplying in the new term.
mont_result = montgomery_multiply(mont_result, mont_term, minv, m);
// Final reduction
return montgomery_reduce(mont_result, minv, m);
and benchmarked it against the usual implementation:
__uint128_t full_product(uint64_t x, uint64_t y) {
return (__uint128_t)x*y;
uint64_t naive_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) result = full_product(result,x) % m;
return result;
and against the usual implementation with some inline asm to fix a minor inefficiency:
uint64_t x86_asm_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) {
__asm__("mov %[result], %%rax; mul %[x]; div %[m]"
: [result] "+d" (result) : [x] "r" (x), [m] "r" (m) : "%rax", "flags");
return result;
Results were as follows on my Haswell laptop for reasonably large x:
implementation speedup
naive 1.00x
x86_asm 1.76x
montgomery 5.68x
So this really does seem to be a pretty nice win. The codegen for the Montgomery implementation is pretty decent, but could probably be improved somewhat further with hand-written assembly as well.
This is an interesting approach for "modest" x and m. Once x gets large, the various approaches that have sub-linear complexity in x will necessarily win out; factorial has so much structure that this method doesn't take advantage of.
