How to vectorize Mersenne Twister loops over arrays - vectorization

Currently i'm working with an custom implementation of the Mersenne Twister, and i'd like to improve my understanding of vector operations.
I have the following code:
#define N 624
#define M 397
for( k = N -1; k; k-- )
{
array[i] = (array[i] ^ ((array[i-1] ^ (array[i-1] >> 30)) * 1566083941UL)) - i;
array[i] &= 0xffffffffUL;
++i;
if ( i >= N )
{
array[0] = array[N-1];
i = 1;
}
}
Here i'm working with 32 bit integers only, so as i understand, I could perform 8 times as much operations at the same time, using AVX2 instructions? How can I do that in practice?
I know how to deal with addition of 2 vectors, but this case seems to be more complicated. I don't know how to begin.
For a scalar approach i'd work like that, but i'd like to get sure how to perform these actions in my case.
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Unfortunately at my university there are no lessons about intrinsics, but i'm quite curious in order to get an improvement.
I'm also doing some thoughts, about how to create the array using vectors? Maybe matrix? (Maybe _mm256_setr_epi32)
I hope to get some advice regarding this topic!

Related

Explicit memory prefetching for Intel Compilers

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA

Storing functions in an array and applying them to an array of numbers

I've prototyped an algorithm for my iOS game in Python, and I need to rewrite in in ObjC. Basically, I have a board of 16 numbers, and I want to loop through every number three times and the four functions I'm using (add, subtract, multiply, exponentiate). 1+2+3, 2*3-4, 3^4-5, 9-4^3, etc., but without order of operations (first operation is always done first).
What I would like is an overview of how this might be implemented in Objective-C. Specifically, what is the equivalent of an array of functions in Objective-C? Is there an easy way to implement it with selectors? What's the best structure to use for loops with numbers? Array of NSIntegers, array of ints, NSArray/NSMutableArray of NSNumbers?
import random as rand
min = 0
max = 9
max_target = 20
maximum_to_calculate = 100
def multiply(x, y):
return x * y
def exponate(x, y):
return x ** y
def add(x, y):
return x + y
def subtract(x, y):
return x - y
function_array = [multiply, exponate, add, subtract]
board = [rand.randint(min, max) for i in xrange(0, 16)]
dict_of_frequencies = {}
for a in board:
for b in board:
for first_fun in function_array:
first_result = first_fun(a, b)
for c in board:
for second_fun in function_array:
final_result = second_fun(first_result, c)
if final_result not in dict_of_frequencies:
dict_of_frequencies[final_result] = 0
dict_of_frequencies[final_result] += 1
The most convenient way in Objective-C to construct an array of functions would be to use Blocks:
typedef NSInteger (^ArithmeticBlock)(NSInteger, NSInteger);
ArithmeticBlock add = ^NSInteger (NSInteger x, NSInteger y){
return x + y;
};
ArithmeticBlock sub = ^NSInteger (NSInteger x, NSInteger y){
return x - y;
};
NSArray * operations = #[add, sub];
Since there's no great way to perform arithmetic on NSNumbers, it would probably be best to create and store the board's values as primitives, such as NSIntegers, in a plain C array. You can box them up later easily enough, if necessary -- #(boardValue) gives you an NSNumber.
If you want to do it with straight C function pointers, something like this will do it:
#include <stdio.h>
#include <math.h>
long add(int a, int b) {
return a + b;
}
long subtract(int a, int b) {
return a - b;
}
long multiply(int a, int b) {
return a * b;
}
long exponate(int a, int b) {
return pow(a, b);
}
int main(void) {
long (*mfunc[4])(int, int) = {add, subtract, multiply, exponate};
char ops[4] = {'+', '-', '*', '^'};
for ( int i = 0; i < 4; ++i ) {
printf("5 %c 9 = %ld\n", ops[i], mfunc[i](5, 9));
}
return 0;
}
and gives the output:
paul#MacBook:~/Documents/src$ ./rndfnc
5 + 9 = 14
5 - 9 = -4
5 * 9 = 45
5 ^ 9 = 1953125
paul#MacBook:~/Documents/src$
Function pointer syntax can be slightly convoluted. long (*mfunc[4])(int, int) basically translates to defining a four-element array, called mfunc, of pointers to functions returning long and taking two arguments of type int.
Maddy is right. Anyway, I'll give it a try just for the fun of it.
This has never seen a compiler. So please forgive me all the typos and minor syntax errors in advance.
#include <stdlib.h>
...
const int MIN = 0;
const int MAX = 9;
const int MAX_TARGET = 20;
const int MAX_TO_CALCULATE = 100;
...
- (int) multiply:(int)x with:(int)y { return x * y; }
- (int) exponate:(int)x with:(int)y { return x ^ y; }
- (int) add:(int)x to:(int)y { return x + y; }
- (int) substract:(int)x by:(int)y { return x - y; }
// some method should start here, probably with
-(void) someMethod {
NSArray *functionArray = [NSArray arrayWithObjects: #selector(multiply::), #selector(exponate::), #selector(add::), #substract(multiply::), nil]; // there are other ways of generating an array of objects
NSMutableArray *board = [NSMutableArray arrayWithCapacity:16]; //Again, there are other ways available.
for (int i = 0; i < 16; i++) {
[board addObject:#(arc4random() % (MAX-MIN) + MIN)];
}
NSMutableDictionary dictOfFrequencies = [[NSMutableDictionary alloc] init];
for (NSNumber a in board)
for (NSNumber b in board)
for (SEL firstFun in functionArray) {
NSNumber firstResult = #([self performSelector:firstFun withObject:a withObject:b]);
NSNumber countedResults = [dictOfFrequencies objectForKey:firstResult];
if (countedResults) {
[dictOfFrequencies removeObjectForKey:firstResult];
countedResults = #(1 + [countedResults intValue]);
} else {
countedResults = #1; // BTW, using the # followed by a numeric expression creates an NSNumber object with the value 1.
}
[dictOfFrequencies setObject:countedResults forKey:firstResult];
}
}
Well, let me add some comments before others do. :-)
There is no need for objective c. You python code is iterative therefore you can implement it in plain C. Plain C is available where ever Objective C is.
If you really want to go for Objective-C here then you should forget your python code and implement the same logic (aiming for the same result) in Objective-C in an OOP style. My code really tries to translate your code as close as possible. Therefore my code is far far away from neither beeing good style nor maintainable nor proper OOP. Just keep that in mind before you think, ObjC was complicated compared to python :-)

How to do classification manually parsing the support vectors from LibSVM model?

As much as I understand, I could parse the support vectors from the model produced by training with a set of data with LibSVM.
What would be the formula, to produce the classifier?
Do I need the data in the headers of the file, like the following (kernel etc...before the listed support vectors):
svm_type c_svc
kernel_type rbf
gamma 0.125
nr_class 4
total_sv 1038
rho -0.859244 -0.876628 -0.958343 0.543365 -1.10722 -1.79433
label 2 1 3 0
nr_sv 364 276 242 156
SV
My case is
I want to do classification from Node.JS. But there isn't any bindings for LibSVM for it, yet.
Since my models are not going to change, I would like to do the classification in Node.JS, holding the model in-memory.
If this proves to be slow, I rather write the same classification from scratch in C++ and create a wrapper module if it's only a matter of a simple computation (as I suspect it is).
Thanks.
You should be able to translate the C function to Javascript.
Here is the relevant code:
double svm_predict_values(const svm_model *model, const svm_node *x, double* dec_values)
{
int i;
int nr_class = model->nr_class;
int l = model->l;
double *kvalue = Malloc(double,l);
for(i=0;i<l;i++)
kvalue[i] = Kernel::k_function(x,model->SV[i],model->param);
int *start = Malloc(int,nr_class);
start[0] = 0;
for(i=1;i<nr_class;i++)
start[i] = start[i-1]+model->nSV[i-1];
int *vote = Malloc(int,nr_class);
for(i=0;i<nr_class;i++)
vote[i] = 0;
int p=0;
for(i=0;i<nr_class;i++)
for(int j=i+1;j<nr_class;j++)
{
double sum = 0;
int si = start[i];
int sj = start[j];
int ci = model->nSV[i];
int cj = model->nSV[j];
int k;
double *coef1 = model->sv_coef[j-1];
double *coef2 = model->sv_coef[i];
for(k=0;k<ci;k++)
sum += coef1[si+k] * kvalue[si+k];
for(k=0;k<cj;k++)
sum += coef2[sj+k] * kvalue[sj+k];
sum -= model->rho[p];
dec_values[p] = sum;
if(dec_values[p] > 0)
++vote[i];
else
++vote[j];
p++;
}
int vote_max_idx = 0;
for(i=1;i<nr_class;i++)
if(vote[i] > vote[vote_max_idx])
vote_max_idx = i;
free(kvalue);
free(start);
free(vote);
return model->label[vote_max_idx];
}
Notice that you have to recreate this equation:
The only difference is since your model has 4 classes, you need to implement the vote system which is basically the code above.
Hope it helps.

In OpenCV, what's the difference between CV_8U and CV_8UC1?

In OpenCV, is there a difference between CV_8U and CV_8UC1? Do they both refer to an 8-bit unsigned type with one channel? If so, why are there two names? If not, what's the difference?
You can see from this answer, they evaluate to identical types.
As for why there are two names, if you look at how the #defines are structured (again, see linked answer), a type in OpenCV has 2 parts, the depth, and the number of channels. The system is flexible enough to let you define new types with up to 512 channels. It just so happens that when you specify 1 channel, the channel component of type is set to 0 which makes the result equivalent to simply using the depth CV_8U.
They should be the same. For me, I prefer to use CV_8UC1 since it makes my code more clear that how many number of channels I am working with.
However, if you are dealing with a matrix that has 10 channels or more, you need to specify the number of channels.
You may want to experiment with the number of channels using the code snippet below.
#define CV_MAT_ELEM_CN( mat, elemtype, row, col ) \
(*(elemtype*)((mat).data.ptr + (size_t)(mat).step*(row) + sizeof(elemtype)*(col)))
...
CvMat *M = cvCreateMat(4, 4, CV_32FC(10));
for(int ch = 0; ch < 10; ch++) {
for(int i = 0; i < 4; i++) {
for(int j = 0; j < 4; j++) {
CV_MAT_ELEM_CN(*M, float, i, j * CV_MAT_CN(M->type) + ch) = 0.0;
cout << CV_MAT_ELEM_CN(*M, float, i, j * CV_MAT_CN(M->type) + ch) << " ";
}
}
cout << endl << endl;
}
cvReleaseMat(&M);
credit: http://note.sonots.com/OpenCV/MatrixOperations.html

using HLSL to invisibly stress a graphics card - How to stress the memory?

I've been developing for a bit an invisible (read: doesn't produce any visual output) stressor to test the capabilities of my graphics card (and as a exploration of DirectCompute in general, with which I'm pretty new). I've got the following code right now that I'm pretty proud of:
RWStructuredBuffer<uint> BufferOut : register(u0);
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
{
uint total = 0;
float p = 0;
while(p++ < 40.0){
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
{
s=((s*s) - 2) % M;
}
if(s < 1.0) total++;
}
BufferOut[DTid.x] = total;
}
This runs the Lucas Lehmer Test for the first 40 powers of two. When I dispatch this code in a timed loop and look at my graphics cards stats using GPU-Z, my GPU load shoots to 99% for the duration. I'm pretty happy with this, but I also notice that the heat generation from a fully loaded out GPU is actually pretty minimal (I'm getting about a 5 to 10 degree Celsius jump, nowhere near the heat jump I get when running, say, Borderlands 2). My thought is that most of my heat comes from memory accesses, so I would need to include consistent memory accesses across the run. My initial code looked like this:
RWStructuredBuffer<uint> BufferOut : register(u0);
groupshared float4 memory_buffer[1024];
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
{
uint total = 0;
float p = 0;
while(p++ < 40.0){
[fastop] // to lower compile times - Code efficiency is strangely not what Im looking for right now.
for(uint i = 0; i < 1024; ++i)
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
{
s=((s*s) - 2) % M;
}
if(s < 1.0) total++;
}
BufferOut[DTid.x] = total;
}
Read a lot of non-coherent samples in large textures. Try both DXT1 compressed and non-compressed values. And use render to texture. And MRT. All will beat on the GPU memory systems.

Resources