How does #pragma simd reduction(<operator>:<variable>) work under the hood? - vectorization

I would like to know in more detail how the simd reduction clause used by Intel compilers works under the hood.
In particular, for a loop of the form
double x = x_initial;
#pragma simd reduction(<operator1>:x)
for( int i = 0; i < N; i++ )
x <operator2> some_value;
my naive guess is as follows:
The compiler initializes a private copy of x for each vector lane, then iterates through the loop one vector width at a time. If the vector width is 4 doubles, for example, this would correspond to N/4 iterations plus a peel loop at the end. At each step of the iteration, each lane's private copy of x is updated using operator2, then at the end, the 4 vector lanes' private copies are combined using operator1. The auto-vectorization guide does not appear to address this directly.
I did some experimentation and found some results that agree with my expectation and some that don't. For example, I tried the case
double x = 1;
#pragma simd reduction(*:x) assert
for( int i = 0; i < 16; i++ )
x += a[i]; // All elements of a are equal to 3.0
cout << "x after (*:x), x += a[i] loop: " << x << endl;
where operator1 is * and operator2 is +=. When I compile for avx2, which has a vector width of 4 doubles, the output is 28561 = ( 1 + 4*a[i] )^4. This implies that the code first initializes 4 lane-private copies of x to 1, then adds 3 to each of those copies 4 times as the 4-double-wide vector lane iterates across the trip count of 16. Each lane-private copy of x is now equal to 13. Finally, the lane-private copies are combined (reduced) using operator2 which is *, yielding 13*13*13*13 = 28561.
However, when I switch the * and + operators, like so
x = 1;
#pragma simd reduction(+:x) assert
for( int i = 0; i < 16; i++ )
x *= a[i];
cout << "x after (+:x), x *= a[i] loop: " << x << endl;
and compile again for avx2, the output is 1.0. If my theory were correct, each vector lane should end up containing a value of 1*3^4, which would then be combined using + to yield 4*3^4 = 324. Evidently this is not the case. What am I missing?

Related

Explicit memory prefetching for Intel Compilers

I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
}
}
}
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
}
}
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
TIA

Inner product of two 16bit integer vectors with AVX2 in C++

I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.
After multiplication into a vector x I started with _mm256_extracti128_si256 and _mm256_castsi256_si128 to have the low and high part of x and added them with _mm_add_epi16.
I copied the result register and applied _mm_move_epi64 to the original register and added both again with _mm_add_epi16. Now, I think that I have:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0
within the 128bit register. But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.
Following the comments and hours of google my working solution:
// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
ZTYPE* xv = al_entr1.c.data();
ZTYPE* yv = al_entr2.c.data();
__m256i tres = _mm256_setzero_si256();
for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
{
// editor's note: alignment required. Use loadu for unaligned
__m256i xr = _mm256_load_si256((__m256i*)(xv+ii));
__m256i yr = _mm256_load_si256((__m256i*)(yv+ii));
const __m256i tmp = _mm256_madd_epi16 (xr, yr);
tres = _mm256_add_epi32(tmp, tres);
}
// Reduction
const __m128i x128 = _mm_add_epi32 ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
const __m128i x64 = _mm_add_epi32 (x128, x128_up);
const __m128i _x32 = _mm_hadd_epi32(x64, x64);
const int res = _mm_extract_epi32(_x32, 0);
hash |= res;
}
finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;
It is at least the fastest solution so far:
std::inner_product: 0.819781 sec. (-14335)
std::inner_product (aligned): 0.964058 sec. (-14335)
naive multiply: 0.588623 sec. (-14335)
Unroll multiply: 0.505639 sec. (-14335)
AVX multiply: 0.0488352 sec. (-14335)

Preserve Color-Depth of imported Images

When importing Images with the
loadImage("...")
Command, iterating over the pixels in like this:
img.loadPixels();
int w = img.width;
int h = img.height;
for (int y = 0; y < h; y++) {
for (int x = 0; x < w; x++) {
int loc = x + y*w;
float r = red(img.pixels[loc]);
float g = green(img.pixels[loc]);
float b = blue(img.pixels[loc]);
println(r + ", " + g + ", " + b);
}
}
The R G B Values always seem to be between 0 and 255 even if the image file has a depth of 16 bit per channel, where the values should be between 0 and 65535.
Is it possible to preserve the correct color depth?
You haven't said which library the loadImage command is from.
There might be a 16-bit version, but it's unlikely. 24 bits is a sort of standard, suitable for all but very high end work.
What I suggest you do is take a look at my TIFF loader (which, like loadImage, returns 24 bit images), and modify it to return 16-bit channels. It's not difficult, just a case of not discarding the lower bits of the larger channel images (float and 16 bit).
Her'e the TIFF loader:
https://github.com/MalcolmMcLean/tiffloader

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*src.at<uchar>(y1, x);
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
}
dst.at<uchar>(y,x) = sum;
}
}
return dst;
}
A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
}
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.
If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<<
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<<
}
dst.at<uchar>(y,x) = sum / (256 * 256); // <<<
}
}
return dst;
}
This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values ​​of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
}
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
transpose(dst, dst);
return dst;
}
According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.
https://developer.android.com/training/articles/perf-tips

In OpenCV, what's the difference between CV_8U and CV_8UC1?

In OpenCV, is there a difference between CV_8U and CV_8UC1? Do they both refer to an 8-bit unsigned type with one channel? If so, why are there two names? If not, what's the difference?
You can see from this answer, they evaluate to identical types.
As for why there are two names, if you look at how the #defines are structured (again, see linked answer), a type in OpenCV has 2 parts, the depth, and the number of channels. The system is flexible enough to let you define new types with up to 512 channels. It just so happens that when you specify 1 channel, the channel component of type is set to 0 which makes the result equivalent to simply using the depth CV_8U.
They should be the same. For me, I prefer to use CV_8UC1 since it makes my code more clear that how many number of channels I am working with.
However, if you are dealing with a matrix that has 10 channels or more, you need to specify the number of channels.
You may want to experiment with the number of channels using the code snippet below.
#define CV_MAT_ELEM_CN( mat, elemtype, row, col ) \
(*(elemtype*)((mat).data.ptr + (size_t)(mat).step*(row) + sizeof(elemtype)*(col)))
...
CvMat *M = cvCreateMat(4, 4, CV_32FC(10));
for(int ch = 0; ch < 10; ch++) {
for(int i = 0; i < 4; i++) {
for(int j = 0; j < 4; j++) {
CV_MAT_ELEM_CN(*M, float, i, j * CV_MAT_CN(M->type) + ch) = 0.0;
cout << CV_MAT_ELEM_CN(*M, float, i, j * CV_MAT_CN(M->type) + ch) << " ";
}
}
cout << endl << endl;
}
cvReleaseMat(&M);
credit: http://note.sonots.com/OpenCV/MatrixOperations.html

Resources