Explicit memory prefetching for Intel Compilers - vectorization

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA

Related

How to vectorize Mersenne Twister loops over arrays

Currently i'm working with an custom implementation of the Mersenne Twister, and i'd like to improve my understanding of vector operations.
I have the following code:
#define N 624
#define M 397
for( k = N -1; k; k-- )
{
array[i] = (array[i] ^ ((array[i-1] ^ (array[i-1] >> 30)) * 1566083941UL)) - i;
array[i] &= 0xffffffffUL;
++i;
if ( i >= N )
{
array[0] = array[N-1];
i = 1;
}
}
Here i'm working with 32 bit integers only, so as i understand, I could perform 8 times as much operations at the same time, using AVX2 instructions? How can I do that in practice?
I know how to deal with addition of 2 vectors, but this case seems to be more complicated. I don't know how to begin.
For a scalar approach i'd work like that, but i'd like to get sure how to perform these actions in my case.
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Unfortunately at my university there are no lessons about intrinsics, but i'm quite curious in order to get an improvement.
I'm also doing some thoughts, about how to create the array using vectors? Maybe matrix? (Maybe _mm256_setr_epi32)
I hope to get some advice regarding this topic!

How to implement Sobel operator

I have implemented Sobel operator in vertical direction. But the result which I am getting is very poor. I have attached my code below.
int mask_size= 3;
char mask [3][3]= {{-1,0,1},{-2,0,2},{-1,0,1}};
void sobel(Mat input_image)
{
/**Padding m-1 and n-1 zeroes to the result where m and n are mask_size**/
Mat result=Mat::zeros(input_image.rows+(mask_size - 1) * 2,input_image.cols+(mask_size - 1) * 2,CV_8UC1);
Mat result1=Mat::zeros(result.rows,result.cols,CV_8UC1);
int sum= 0;
/*For loop for copying original values to new padded image **/
for(int i=0;i<input_image.rows;i++)
for(int j=0;j<input_image.cols;j++)
result.at<uchar>(i+(mask_size-1),j+(mask_size-1))=input_image.at<uchar>(i,j);
GaussianBlur( result, result, Size(5,5), 0, 0, BORDER_DEFAULT );
/**For loop to implement the convolution **/
for(int i=0;i<result.rows-(mask_size - 1);i++)
for(int j=0;j<result.cols-(mask_size - 1);j++)
{
int counter=0;
int counterX=0,counterY=0;
sum= 0;
for(int k= i ; k < i + mask_size ; k++)
{
for(int l= j ; l< j + mask_size ; l++)
{
sum+=result.at<uchar>(k,l) * mask[counterX][counterY];
counterY++;
}
counterY=0;
counterX++;
}
result1.at<uchar>(i+mask_size/2,j+mask_size/2)=sum/(mask_size * mask_size);
}
/** Truncating all the extras rows and columns **/
result=Mat::zeros( result1.rows - (mask_size - 1) * 2, result1.cols - (mask_size - 1) * 2,CV_8UC1);
for(int i=0;i<result.rows;i++)
for(int j=0;j<result.cols;j++)
result.at<uchar>(i,j)=result1.at<uchar>(i+(mask_size - 1),j+(mask_size - 1));
imshow("Input",result);
imwrite("output2.tif",result);
}
My input to the algorithm is
My output is
I have also tried using Gaussian blur before actually convolving an image and the output I got is
The output which I am expecting is
The guide I am using is: https://www.tutorialspoint.com/dip/sobel_operator.htm
Your convolution looks ok although I only had a quick look.
Check your output type. It's unsigned char.
Now think about the values your output pixels may have if you have negative kernel values and if it is a good idea to store them in uchar directly.
If you store -1 in an unsigned char it will be wrapped around and your output is 255. In case you're wondering where all that excess white stuff is coming from. That's actually small negative gradients.
The desired result looks like the absolute of the Sobel output values.

Accessing global memory in CUDA is slow

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
Unfortunately, this function executes very slow.
However, it runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
//val = SomeComputationOnVal();
p[idx ]= val;
__syncthreads();
}
It also runs very fast if I do this:
__global__ void dd( float* p, int dimX, int dimY, int dimZ )
{
int
i = blockIdx.x*blockDim.x + threadIdx.x,
j = blockIdx.y*blockDim.y + threadIdx.y,
k = blockIdx.z*blockDim.z + threadIdx.z,
idx = j*dimX*dimY + j*dimX +i;
if (i >= dimX || j >= dimY || k >= dimZ)
{
return;
}
float val = 0;
val = SomeComputationOnVal();
// p[idx ]= val;
__syncthreads();
}
So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.
Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):
dim3
blockSize(8,16,2),
gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p; cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));
dd<<<gridSize, blockSize>>>( dev_p,dimX,dimY,dimZ);
Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?
The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.
The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.
Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.
Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).
You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.

How does #pragma simd reduction(<operator>:<variable>) work under the hood?

I would like to know in more detail how the simd reduction clause used by Intel compilers works under the hood.
In particular, for a loop of the form
double x = x_initial;
#pragma simd reduction(<operator1>:x)
for( int i = 0; i < N; i++ )
x <operator2> some_value;
my naive guess is as follows:
The compiler initializes a private copy of x for each vector lane, then iterates through the loop one vector width at a time. If the vector width is 4 doubles, for example, this would correspond to N/4 iterations plus a peel loop at the end. At each step of the iteration, each lane's private copy of x is updated using operator2, then at the end, the 4 vector lanes' private copies are combined using operator1. The auto-vectorization guide does not appear to address this directly.
I did some experimentation and found some results that agree with my expectation and some that don't. For example, I tried the case
double x = 1;
#pragma simd reduction(*:x) assert
for( int i = 0; i < 16; i++ )
x += a[i]; // All elements of a are equal to 3.0
cout << "x after (*:x), x += a[i] loop: " << x << endl;
where operator1 is * and operator2 is +=. When I compile for avx2, which has a vector width of 4 doubles, the output is 28561 = ( 1 + 4*a[i] )^4. This implies that the code first initializes 4 lane-private copies of x to 1, then adds 3 to each of those copies 4 times as the 4-double-wide vector lane iterates across the trip count of 16. Each lane-private copy of x is now equal to 13. Finally, the lane-private copies are combined (reduced) using operator2 which is *, yielding 13*13*13*13 = 28561.
However, when I switch the * and + operators, like so
x = 1;
#pragma simd reduction(+:x) assert
for( int i = 0; i < 16; i++ )
x *= a[i];
cout << "x after (+:x), x *= a[i] loop: " << x << endl;
and compile again for avx2, the output is 1.0. If my theory were correct, each vector lane should end up containing a value of 1*3^4, which would then be combined using + to yield 4*3^4 = 324. Evidently this is not the case. What am I missing?

CUDA memory limitations

If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
int width;
int height;
int* elements;
};
int main()
{
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
{
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
else
{
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
}
}
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
/*
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
and
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.

Resources