interface OpenCV's Mat containers with blas for matrix multiplication - opencv

I am processing UHD (2160 x 3840) images.
One of the processing I do consist to process a Sobel filtering on X and Y axis then I have to multiply every output matrix by it's transpose and then I process the gradient image as the square root of the sum of the gradient.
So : S = sqrt( S_x * S_x^t + S_y * S_y^t).
Due to dimension of the image OpenCV take up to twenty seconds to process that without multithreading and ten with multithreading.
I know there OpenCV call OpenCL in order to speed up the filtering operations so I think it can take a long time in order to try to gain performance from the filtering step.
For the matrix multiplication I experience a kind of unstability from the OpenCV's OpenCL gemm kernel implementation.
So I would like to try to use OpenBLAS insted.
My questions are :
1.)
I wrote the following code but I face some issue for interface OpenCV's Mat objects :
template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
static_assert(true,"support matrix_multiply is only defined for floating precision numbers.");
}
template<>
inline void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
const int M = A.rows;
const int N = B.cols;
const int K = A.cols;
cblas_sgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<float>(),//8 A
A.rows, //9 LDA
B.ptr<float>(),//10 B
B.rows, //11 LDB
0., //12 BETA
C.ptr<float>(),//13 C
C.rows); //14 LDC
}
template<>
inline void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
cblas_dgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,A.rows,B.cols,A.cols,1.,A.ptr<double>(),A.rows,B.ptr<double>(),B.cols,0.,C.ptr<double>(),C.rows);
}
void matrix_multiply(cv::InputArray _src1, cv::InputArray _src2, cv::OutputArray _dst)
{
CV_DbgAssert( (_src1.isMat() || _src1.isUMat()) && (_src1.kind() == _src2.kind()) &&
(_src1.depth() == _src2.depth()) && (_src1.depth() == CV_32F) && (_src1.depth() == _src1.type()) &&
(_src1.rows() == _src2.cols())
);
cv::Mat src1 = _src1.getMat();
cv::Mat src2 = _src2.getMat();
cv::Mat dst;
bool cpy(false);
if(_dst.rows() == _src1.rows() && _dst.cols() == _src2.cols() && _dst.type() == _src1.type())
dst = _dst.getMat();
else
{
dst = cv::Mat::zeros(src1.rows,src2.cols,src1.type());
cpy = true;
}
if(cpy)
dst.copyTo(_dst);
}
I tried to organize the datas as specified here :
http://www.netlib.org/lapack/explore-html/db/dc9/group__single__blas__level3.html#gafe51bacb54592ff5de056acabd83c260
without succes.
This is my main issue
2.)
I was thinking in order to try to speed up a little my implementation to apply the divide and conquer approach illustrated here :
https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm
But for only four submatrix.
Does any one tried some similar approach or got a better way to gain performance in matrix multiplication (without using GPU) ?
Thank you in advance for any help.

I found a solution to the question 1).
I based my first implementation on the documentation of the BLAS library.
BLAS has been written in Fortran language, in this language the index start at 1 and not at 0 like in C or C++.
Another thing is many libraries wrote in Fortran language organize their memory in column order (e.g. BLAS,LAPACK) rather than most of the C or C++ library (e.g. OpenCV) organize the memory in row order.
After taking these two properties in count I modified my code to :
template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
static_assert(true,"The function gemm is only defined for floating precision numbers.");
}
template<>
void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
const int M = A.cols+1;
const int N = B.rows;
const int K = A.cols;
cblas_sgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<float>(),//8 A
A.step1(), //9 LDA
B.ptr<float>(),//10 B
B.step1(), //11 LDB
0., //12 BETA
C.ptr<float>(),//13 C
C.step1()); //14 LDC
}
template<>
void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
const int M = A.cols+1;
const int N = B.rows;
const int K = A.cols;
cblas_dgemm( CblasRowMajor ,// 1
CblasNoTrans, // 2 TRANSA
CblasNoTrans, // 3 TRANSB
M, // 4 M
N, // 5 N
K, // 6 K
1., // 7 ALPHA
A.ptr<double>(),//8 A
A.step1(), //9 LDA
B.ptr<double>(),//10 B
B.step1(), //11 LDB
0., //12 BETA
C.ptr<double>(),//13 C
C.step1()); //14 LDC
}
And every thing work well.
Without additional multithreading or divide and conquer approach I was able to reduce the processing time of one step of my code from 150 ms to 500 us.
So it fix every thing for me :).

Related

Eigen FFT library

I am trying to use Eigen unsupported FFT library using FFTW backend. Specifically I am want to do a 2D FFT. Here's my code :
void fft2(Eigen::MatrixXf * matIn,Eigen::MatrixXcf * matOut)
{
const int nRows = matIn->rows();
const int nCols = matIn->cols();
Eigen::FFT< float > fft;
for (int k = 0; k < nRows; ++k) {
Eigen::VectorXcf tmpOut(nRows);
fft.fwd(tmpOut, matIn->row(k));
matOut->row(k) = tmpOut;
}
for (int k = 0; k < nCols; ++k) {
Eigen::VectorXcf tmpOut(nCols);
fft.fwd(tmpOut, matOut->col(k));
matOut->col(k) = tmpOut;
}
}
I have 2 problems :
First, I get a segmentation fault when using this code on some matrix. This error doesn't happen for all matrixes. I guess it's related to an alignment error. I use the functions in the following way :
Eigen::MatrixXcf matFFT(mat.rows(),mat.cols());
fft2(&matFloat,&matFFT);
where mat can be any matrix. Funnily, the code plants only when I compute the FFT over the 2nd dimension, never on the first one. This doesn't happen with kissFFT backend.
Second I don't get the same result as Matlab (that uses FFTW), when the function works. Eg :
Input Matrix :
[2, 1, 2]
[3, 2, 1]
[1, 2, 3]
Eigen gives :
[ (0,5), (0.5,0.86603), (0,0.5)]
[ (-4.3301,-2.5), (-1,-1.7321), (0.31699,-1.549)]
[ (-1.5,-0.86603), (2,3.4641), (2,3.4641)]
Matlab gives :
17 + 0i 0.5 + 0.86603i 0.5 - 0.86603i
-1 + 0i -1 - 1.7321i 2 - 3.4641i
-1 + 0i 2 + 3.4641i -1 + 1.7321i
Only the central part is the same.
Any help would be welcome.
I failed to activate EIGEN_FFTW_DEFAULT in my first solution, activating it reveals an error in the fftw-support implementation of Eigen. The following works:
#define EIGEN_FFTW_DEFAULT
#include <iostream>
#include <unsupported/Eigen/FFT>
int main(int argc, char *argv[])
{
Eigen::MatrixXf A(3,3);
A << 2,1,2, 3,2,1, 1,2,3;
const int nRows = A.rows();
const int nCols = A.cols();
std::cout << A << "\n\n";
Eigen::MatrixXcf B(3,3);
Eigen::FFT< float > fft;
for (int k = 0; k < nRows; ++k) {
Eigen::VectorXcf tmpOut(nRows);
fft.fwd(tmpOut, A.row(k));
B.row(k) = tmpOut;
}
std::cout << B << "\n\n";
Eigen::FFT< float > fft2; // Workaround: Using the same FFT object for a real and a complex FFT seems not to work with FFTW
for (int k = 0; k < nCols; ++k) {
Eigen::VectorXcf tmpOut(nCols);
fft2.fwd(tmpOut, B.col(k));
B.col(k) = tmpOut;
}
std::cout << B << '\n';
}
I get this output:
2 1 2
3 2 1
1 2 3
(17,0) (0.5,0.866025) (0.5,-0.866025)
(-1,0) (-1,-1.73205) (2,-3.4641)
(-1,0) (2,3.4641) (-1,1.73205)
Which is the same as your Matlab result.
N.B.: FFTW seems to support 2D real->complex FFT natively (without using individual FFTs). This is likely more efficient.
fftwf_plan fftwf_plan_dft_r2c_2d(int n0, int n1,
float *in, fftwf_complex *out, unsigned flags);

Accessing global memory in CUDA is slow

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
Unfortunately, this function executes very slow.
However, it runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
//val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
It also runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
// p[idx ]= val;
__syncthreads();
}
So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.
Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):
dim3
blockSize(8,16,2),
gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p; cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));
dd<<<gridSize, blockSize>>>( dev_p,dimX,dimY,dimZ);
Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?
The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.
The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.
Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.
Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).
You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.

Can Montgomery multiplication be used to speed up the computation of (large number)! % (some prime)

This question originates in a comment I almost wrote below this question, where Zack is computing the factorial of a large number modulo a large number (that we will assume to be prime for the sake of this question). Zack is using the traditional computation of factorial, taking the remainder at each multiplication.
I almost commented that an alternative to consider was Montgomery multiplication, but thinking more about it, I have only seen this technique used to speed up several multiplications by the same multiplicand (in particular, to speed up the computation of an mod p).
My question is: can Montgomery multiplication be used to speed up the computation of n! mod p for large n and p?
Naively, no; you need to transform each of the n terms of the product into the "Montgomery space", so you have n full reductions mod m, the same as the "usual" algorithm.
However, a factorial isn't just an arbitrary product of n terms; it's much more structured. In particular, if you already have the "Montgomerized" kr mod m, then you can use a very cheap reduction to get (k+1)r mod m.
So this is perfectly feasible, though I haven't seen it done before. I went ahead and wrote a quick-and-dirty implementation (very untested, I wouldn't trust it very far at all):
// returns m^-1 mod 2**64 via clever 2-adic arithmetic (http://arxiv.org/pdf/1209.6626.pdf)
uint64_t inverse(uint64_t m) {
assert(m % 2 == 1);
uint64_t minv = 2 - m;
uint64_t m_1 = m - 1;
for (int i=1; i<6; i+=1) { m_1 *= m_1; minv *= (1 + m_1); }
return minv;
}
uint64_t montgomery_reduce(__uint128_t x, uint64_t minv, uint64_t m) {
return x + (__uint128_t)((uint64_t)x*-minv)*m >> 64;
}
uint64_t montgomery_multiply(uint64_t x, uint64_t y, uint64_t minv, uint64_t m) {
return montgomery_reduce(full_product(x, y), minv, m);
}
uint64_t montgomery_factorial(uint64_t x, uint64_t m) {
assert(x < m && m % 2 == 1);
uint64_t minv = inverse(m); // m^-1 mod 2**64
uint64_t r_mod_m = -m % m; // 2**64 mod m
uint64_t mont_term = r_mod_m;
uint64_t mont_result = r_mod_m;
for (uint64_t k=2; k<=x; k++) {
// Compute the montgomerized product term: kr mod m = (k-1)r + r mod m.
mont_term += r_mod_m;
if (mont_term >= m) mont_term -= m;
// Update the result by multiplying in the new term.
mont_result = montgomery_multiply(mont_result, mont_term, minv, m);
}
// Final reduction
return montgomery_reduce(mont_result, minv, m);
}
and benchmarked it against the usual implementation:
__uint128_t full_product(uint64_t x, uint64_t y) {
return (__uint128_t)x*y;
}
uint64_t naive_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) result = full_product(result,x) % m;
return result;
}
and against the usual implementation with some inline asm to fix a minor inefficiency:
uint64_t x86_asm_factorial(uint64_t x, uint64_t m) {
assert(x < m);
uint64_t result = x ? x : 1;
while (x --> 2) {
__asm__("mov %[result], %%rax; mul %[x]; div %[m]"
: [result] "+d" (result) : [x] "r" (x), [m] "r" (m) : "%rax", "flags");
}
return result;
}
Results were as follows on my Haswell laptop for reasonably large x:
implementation speedup
---------------------------
naive 1.00x
x86_asm 1.76x
montgomery 5.68x
So this really does seem to be a pretty nice win. The codegen for the Montgomery implementation is pretty decent, but could probably be improved somewhat further with hand-written assembly as well.
This is an interesting approach for "modest" x and m. Once x gets large, the various approaches that have sub-linear complexity in x will necessarily win out; factorial has so much structure that this method doesn't take advantage of.

CRC Calculation Of A Mostly Static Data Stream

Background:
I have a section of memory, 1024 bytes. The last 1020 bytes will always be the same. The first 4 bytes will change (serial number of a product). I need to calculate the CRC-16 CCITT (0xFFFF starting, 0x1021 mask) for the entire section of memory, CRC_WHOLE.
Question:
Is it possible to calculate the CRC for only the first 4 bytes, CRC_A, then apply a function such as the one below to calculate the full CRC? We can assume that the checksum for the last 1020 bytes, CRC_B, is already known.
CRC_WHOLE = XOR(CRC_A, CRC_B)
I know that this formula does not work (tried it), but I am hoping that something similar exists.
Yes. You can see how in zlib's crc32_combine(). If you have two sequences A and B, then the pure CRC of AB is the exclusive-or of the CRC of A0 and the CRC of 0B, where the 0's represent a series of zero bytes with the length of the corresponding sequence, i.e. B and A respectively.
For your application, you can pre-compute a single operator that applies 1020 zeros to the CRC of your first four bytes very rapidly. Then you can exclusive-or that with the pre-computed CRC of the 1020 bytes.
Update:
Here is a post of mine from 2008 with a detailed explanation that #ArtemB discovered (that I had forgotten about):
crc32_combine() in zlib is based on two key tricks. For what follows,
we set aside the fact that the standard 32-bit CRC is pre and post-
conditioned. We can deal with that later. Assume for now a CRC that
has no such conditioning, and so starts with the register filled with
zeros.
Trick #1: CRCs are linear. So if you have stream X and stream Y of
the same length and exclusive-or the two streams bit-by-bit to get Z,
i.e. Z = X ^ Y (using the C notation for exclusive-or), then CRC(Z) =
CRC(X) ^ CRC(Y). For the problem at hand we have two streams A and B
of differing length that we want to concatenate into stream Z. What
we have available are CRC(A) and CRC(B). What we want is a quick way
to compute CRC(Z). The trick is to construct X = A concatenated with
length(B) zero bits, and Y = length(A) zero bits concatenated with B.
So if we represent concatenation simply by juxtaposition of the
symbols, X = A0, Y = 0B, then X^Y = Z = AB. Then we have CRC(Z) =
CRC(A0) ^ CRC(0B).
Now we need to know CRC(A0) and CRC(0B). CRC(0B) is easy. If we feed
a bunch of zeros to the CRC machine starting with zero, the register
is still filled with zeros. So it's as if we did nothing at all.
Therefore CRC(0B) = CRC(B).
CRC(A0) requires more work however. Taking a non-zero CRC and feeding
zeros to the CRC machine doesn't leave it alone. Every zero changes
the register contents. So to get CRC(A0), we need to set the register
to CRC(A), and then run length(B) zeros through it. Then we can
exclusive-or the result of that with CRC(B) = CRC(0B), and we get what
we want, which is CRC(Z) = CRC(AB). Voila!
Well, actually the voila is premature. I wasn't at all satisfied with
that answer. I didn't want a calculation that took a time
proportional to the length of B. That wouldn't save any time compared
to simply setting the register to CRC(A) and running the B stream
through. I figured there must be a faster way to compute the effect
of feeding n zeros into the CRC machine (where n = length(B)). So
that leads us to:
Trick #2: The CRC machine is a linear state machine. If we know the
linear transformation that occurs when we feed a zero to the machine,
then we can do operations on that transformation to more efficiently
find the transformation that results from feeding n zeros into the
machine.
The transformation of feeding a single zero bit into the CRC machine
is completely represented by a 32x32 binary matrix. To apply the
transformation we multiply the matrix by the register, taking the
register as a 32 bit column vector. For the matrix multiplication in
binary (i.e. over the Galois Field of 2), the role of multiplication
is played by and'ing, and the role of addition is played by exclusive-
or'ing.
There are a few different ways to construct the magic matrix that
represents the transformation caused by feeding the CRC machine a
single zero bit. One way is to observe that each column of the matrix
is what you get when your register starts off with a single one in
it. So the first column is what you get when the register is 100...
and then feed a zero, the second column comes from starting with
0100..., etc. (Those are referred to as basis vectors.) You can see
this simply by doing the matrix multiplication with those vectors.
The matrix multiplication selects the column of the matrix
corresponding to the location of the single one.
Now for the trick. Once we have the magic matrix, we can set aside
the initial register contents for a while, and instead use the
transformation for one zero to compute the transformation for n
zeros. We could just multiply n copies of the matrix together to get
the matrix for n zeros. But that's even worse than just running the n
zeros through the machine. However there's an easy way to avoid most
of those matrix multiplications to get the same answer. Suppose we
want to know the transformation for running eight zero bits, or one
byte through. Let's call the magic matrix that represents running one
zero through: M. We could do seven matrix multiplications to get R =
MxMxMxMxMxMxMxM. Instead, let's start with MxM and call that P. Then
PxP is MxMxMxM. Let's call that Q. Then QxQ is R. So now we've
reduced the seven multiplications to three. P = MxM, Q = PxP, and R =
QxQ.
Now I'm sure you get the idea for an arbitrary n number of zeros. We
can very rapidly generate transformation matrices Mk, where Mk is the
transformation for running 2k zeros through. (In the
paragraph above M3 is R.) We can make M1 through Mk with only k
matrix multiplications, starting with M0 = M. k only has to be as
large as the number of bits in the binary representation of n. We can
then pick those matrices where there are ones in the binary
representation of n and multiply them together to get the
transformation of running n zeros through the CRC machine. So if n =
13, compute M0 x M2 x M3.
If j is the number of one's in the binary representation of n, then we
just have j - 1 more matrix multiplications. So we have a total of k
j - 1 matrix multiplications, where j <= k = floor(logbase2(n)).
Now we take our rapidly constructed matrix for n zeros, and multiply
that by CRC(A) to get CRC(A0). We can compute CRC(A0) in O(log(n))
time, instead of O(n) time. We exclusive or that with CRC(B) and
Voila! (really this time), we have CRC(Z).
That's what zlib's crc32_combine() does.
I will leave it as an exercise for the reader as to how to deal with
the pre and post conditioning of the CRC register. You just need to
apply the linearity observations above. Hint: You don't need to know
length(A). In fact crc32_combine() only takes three arguments:
CRC(A), CRC(B), and length(B) (in bytes).
Below is example C code for an alternative approach for CRC(A0). Rather than working with a matrix, a CRC can be cycled forward n bits by muliplying (CRC ยท ((2^n)%POLY)%POLY . So the repeated squaring is performed on an integer rather than a matrix. If n is constant, then (2^n)%POLY can be pre-computed.
/* crcpad.c - crc - data has a large number of trailing zeroes */
#include <stdio.h>
#include <stdlib.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
#define POLY (0x04c11db7u)
static uint32_t crctbl[256];
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
/* assumes twos complement */
crc = (crc<<1)^((0-(crc>>31))&POLY);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
uint32_t pd = 0;
uint32_t i;
for(i = 0; i < 32; i++){
/* assumes twos complement */
pd = (pd<<1)^((0-(pd>>31))&POLY);
pd ^= (0-(b>>31))&a;
b <<= 1;
}
return pd;
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* # data bytes */
#define DAT ( 32)
/* # zero bytes */
#define PAD (992)
/* DATA+PAD */
#define CNT (1024)
int main()
{
uint32_t pmc;
uint32_t crc;
uint32_t crf;
uint32_t i;
uint8_t *msg = malloc(CNT);
for(i = 0; i < DAT; i++) /* generate msg */
msg[i] = (uint8_t)rand();
for( ; i < CNT; i++)
msg[i] = 0;
GenTbl(); /* generate crc table */
crc = GenCrc(msg, CNT); /* generate crc normally */
crf = GenCrc(msg, DAT); /* generate crc for data */
pmc = PowModCrc(PAD*8); /* pmc = pow(2,PAD*8)%crc */
crf = MpyModCrc(crf, pmc); /* crf = (crf*pmc)%crc */
printf("%08x %08x\n", crc, crf);
free(msg);
return 0;
}
Example C code using intrinsic for carryless multiply, pclmulqdq == _mm_clmulepi64_si128:
/* crcpadm.c - crc - data has a large number of trailing zeroes */
/* pclmulqdq intrinsic version */
#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
#define POLY (0x104c11db7ull)
#define POLYM ( 0x04c11db7u)
static uint32_t crctbl[256];
static __m128i poly; /* poly */
static __m128i invpoly; /* 2^64 / POLY */
void GenMPoly(void) /* generate __m12i8 poly info */
{
uint64_t N = 0x100000000ull;
uint64_t Q = 0;
for(size_t i = 0; i < 33; i++){
Q <<= 1;
if(N&0x100000000ull){
Q |= 1;
N ^= POLY;
}
N <<= 1;
}
poly.m128i_u64[0] = POLY;
invpoly.m128i_u64[0] = Q;
}
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
/* assumes twos complement */
crc = (crc<<1)^((0-(crc>>31))&POLYM);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
__m128i ma, mb, mp, mt;
ma.m128i_u64[0] = a;
mb.m128i_u64[0] = b;
mp = _mm_clmulepi64_si128(ma, mb, 0x00); /* p[0] = a*b */
mt = _mm_clmulepi64_si128(mp, invpoly, 0x00); /* t[1] = (p[0]*((2^64)/POLY))>>64 */
mt = _mm_clmulepi64_si128(mt, poly, 0x01); /* t[0] = t[1]*POLY */
return mp.m128i_u32[0] ^ mt.m128i_u32[0]; /* ret = p[0] ^ t[0] */
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* # data bytes */
#define DAT ( 32)
/* # zero bytes */
#define PAD (992)
/* DATA+PAD */
#define CNT (1024)
int main()
{
uint32_t pmc;
uint32_t crc;
uint32_t crf;
uint32_t i;
uint8_t *msg = malloc(CNT);
GenMPoly(); /* generate __m128 polys */
GenTbl(); /* generate crc table */
for(i = 0; i < DAT; i++) /* generate msg */
msg[i] = (uint8_t)rand();
for( ; i < CNT; i++)
msg[i] = 0;
crc = GenCrc(msg, CNT); /* generate crc normally */
crf = GenCrc(msg, DAT); /* generate crc for data */
pmc = PowModCrc(PAD*8); /* pmc = pow(2,PAD*8)%crc */
crf = MpyModCrc(crf, pmc); /* crf = (crf*pmc)%crc */
printf("%08x %08x\n", crc, crf);
free(msg);
return 0;
}

NLopt with Armadillo data

The NLopt objective function looks like this:
double myfunc(const std::vector<double> &x, std::vector<double> &grad, void *my_func_data)
x is the data being optimized, grad is a vector of gradients, and my_func_data holds additional data.
I am interested in supplying Armadillo matrices A and B to void *my_func_data.
I fiddled with Armadillo's member functions
mat A(5,5);
mat B(5,5);
double* A_mem = A.memptr();
double* B_mem = B.memptr();
which gives me a pointers to the matrices A and B. I was thinking of defining another pointer to these pointers:
double** CombineMat;
int* Arow = A.n_rows; int* Acols = A.n_cols; //obtain dimensions of A
int* Brows = B.n_rows; int* Bcols = B.n_cols; // dim(B)
CombineMat[0] = A_mem; CombineMat[1] = Arows; CombineMat[2] = Acols;
CombineMat[3] = B_mem; CombineMat[4] = Brows; CombineMat[5] = Bcols;
and then passing *CombineMat as my_func_data.
Is this the way to do it? It seems clumsy...
Once CombineMat is passed, how do re-cast the void type into something usable when I'm inside myfunc?
ANSWER
I answered my own question with help from here.
mat A(2,2);
A << 1 << 2 << endr << 3 << 4;
mat B(2,2);
B << 5 << 6 << endr << 7 << 8;
mat C[2];
C[0] = A;
C[1] = B;
opt.set_min_objective(myfunc, &C);
Once inside myfunc, the data in C can be converted back to Armadillo matrices like this:
mat* pC = (mat*)(my_func_data);
mat A = pC[0];
mat B = pC[1];
You can also use Armadillo's Cube class ("3D matrix", or 3-rd order tensor).
Each slice in a cube is just a matrix. For example:
cube X(4,5,2);
mat A(4,5);
mat B(4,5);
X.slice(0) = A; // set the individual slices
X.slice(1) = B;
mat& C = X.slice(1); // get the reference to a matrix stored in a cube

Resources