Is it possible to convince clang to auto-vectorize this code without using intrinsics? - vectorization

Imagine I have this naive function to detect sphere overlap. The point of this question is not really to discuss the best way to do hit testing on spheres, so this is just for illustration.
inline bool sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2) {
float xd = (x1 - x2);
float yd = (y1 - y2);
float zd = (z1 - z2);
float max_dist = (r1 + r2);
return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
}
And I call it in a nested loop, as follows:
std::vector<float> xs, ys, zs, rs;
int n_spheres;
// <snip>
int n_hits = 0;
for (int i = 0; i < n_spheres; ++i) {
for (int j = i + 1; j < n_spheres; ++j) {
if (sphere_hit(xs[i], ys[i], zs[i], rs[i],
xs[j], ys[j], zs[j], rs[j])) {
++n_hits;
}
}
}
std::printf("total hits: %d\n", n_hits);
Now, clang (with -O3 -march=native) is smart enough to figure out how to vectorize (and unroll) this loop into 256-bit avx2 instructions. Awesome!
However, if I do anything more complicated than increment the number of hits, for example calling some arbitrary function handle_hit(i, j), clang instead emits a naive scalar version.
Hits should be very rare, so what I think should happen is checking on every vectorized loop iteration if the value is true for any of the lanes, and jumping to some scalar slow path if so. This should be possible with vcmpltps followed by vmovmskps. However, I can't get clang to emit this code, even if I surround the call to sphere_hit with __builtin_expect(..., 0).

Indeed it is possible to convince clang to vectorize this code. With compiler options
-Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize, clang claims that the floating point operations are vectorized, which is confirmed by the Godbolt output. (The red underlined fors are not errors, but vectorization reports).
Vectorization is possible by storing the results of sphere_hit as chars to a temporary array hitx8.
Afterwards, 8 sphere_hit results are tested per iteration by reading the 8 chars back from memory as one uint64_t a. This should be quite efficient since the condition a!=0
(see code below) is still rare since sphere hits are very rare. Moreover, array hitx8 is likely in L1 or L2 cache most of the time.
I didn't test the code for correctness, but at least the auto-vectorization idea should work.
/* clang -Ofast -Wall -march=broadwell -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize */
#include<string.h>
char sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2);
void handle_hit(int i, int j);
void vectorized_code(float* __restrict xs, float* __restrict ys, float* __restrict zs, float* __restrict rs, char* __restrict hitx8, int n_spheres){
unsigned long long int a;
for (int i = 0; i < n_spheres; ++i) {
for (int j = i + 1; j < n_spheres; ++j){
/* Store the boolean results temporarily in char array hitx8. */
/* The indices of hitx8 are shifted by i+1, so the loop */
/* starts with hitx8[0] */
/* char array hitx8 should have n_spheres + 8 elements */
hitx8[j-i-1] = sphere_hit(xs[i], ys[i], zs[i], rs[i],
xs[j], ys[j], zs[j], rs[j]);
}
for (int j = n_spheres; j < n_spheres+8; ++j){
/* Add 8 extra zeros at the end of hitx8. */
hitx8[j-i-1] = 0; /* hitx8 is 8 elements longer than xs */
}
for (int j = i + 1; j < n_spheres; j=j+8){
memcpy(&a,&hitx8[j-i-1],8);
/* Check 8 sphere hits in parallel: */
/* one `unsigned long long int a` contains 8 boolean values here */
/* The condition a!=0 is still rare since sphere hits are very rare. */
if (a!=0ull){
if (hitx8[j-i-1+0] != 0) handle_hit(i,j+0);
if (hitx8[j-i-1+1] != 0) handle_hit(i,j+1);
if (hitx8[j-i-1+2] != 0) handle_hit(i,j+2);
if (hitx8[j-i-1+3] != 0) handle_hit(i,j+3);
if (hitx8[j-i-1+4] != 0) handle_hit(i,j+4);
if (hitx8[j-i-1+5] != 0) handle_hit(i,j+5);
if (hitx8[j-i-1+6] != 0) handle_hit(i,j+6);
if (hitx8[j-i-1+7] != 0) handle_hit(i,j+7);
}
}
}
}
inline char sphere_hit(float x1, float y1, float z1, float r1,
float x2, float y2, float z2, float r2) {
float xd = (x1 - x2);
float yd = (y1 - y2);
float zd = (z1 - z2);
float max_dist = (r1 + r2);
return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
}

Related

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*src.at<uchar>(y1, x);
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
}
dst.at<uchar>(y,x) = sum;
}
}
return dst;
}
A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
}
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.
If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<<
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<<
}
dst.at<uchar>(y,x) = sum / (256 * 256); // <<<
}
}
return dst;
}
This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values ​​of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
}
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
transpose(dst, dst);
return dst;
}
According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.
https://developer.android.com/training/articles/perf-tips

Why do operations with an array corrupt the values?

I'm trying to implement the Particle Swarm Optimization on CUDA. I'm partially initializing data arrays on host, then I allocate memory on CUDA and copy it there, and then try to proceed with the initialization.
The problem is, when I'm trying to modify array element like so
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
It corrupts the values and gives me 1.#INF00 as array element. When I uncomment the last line *pElement = (X_high - X_low) - X_low; and comment the previous, it works as expected: I get values like 15.36 and so on.
I believe the problem is either with my memory allocation and copying, and/or with adressing the specific array element. I read the CUDA manual about these both topics, but I can't spot the error: I still get corrupt array if I do anything with the element of the array. For example, *pElement = *pElement * 2 gives unreasonable big results like 779616...00000000.00000 when the initial pElement is expected to be just a float in [0;1].
Here is the full source. Initialization of arrays begins in main (bottom of the source), then f1 function does the work for CUDA and launches the initialization kernel kernelInit:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
const unsigned f_n = 3;
const unsigned n = 2;
const unsigned p = 64;
typedef struct {
unsigned k_max;
float c1;
float c2;
unsigned p;
float inertia_factor;
float Ef;
float X_low[f_n];
float X_high[f_n];
float X_min[n][f_n];
} params_t;
typedef void (*kernelWrapperType) (
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
);
typedef float (*twoArgsFuncType) (
float x1,
float x2
);
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
__device__ float kernelF1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
void f1(
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
) {
float *X_d = NULL;
float *Y_d = NULL;
unsigned length = n * p;
const cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();
size_t pitch;
size_t dpitch;
cudaError_t err;
unsigned width = n;
unsigned height = p;
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
pitch = n * sizeof(float);
err = cudaMemcpy2D(X_d, dpitch, X, pitch, width * sizeof(float), height, cudaMemcpyHostToDevice);
err = cudaMalloc (&Y_d, sizeof(float) * p);
err = cudaMemcpy (Y_d, Y, sizeof(float) * p, cudaMemcpyHostToDevice);
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
err = cudaMemcpy2D(X, pitch, X_d, dpitch, n*sizeof(float), p, cudaMemcpyDeviceToHost);
err = cudaFree(X_d);
err = cudaMemcpy(Y, Y_d, sizeof(float) * p, cudaMemcpyDeviceToHost);
err = cudaFree(Y_d);
}
float F1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
/*
* Generates random float in [0.0; 1.0]
*/
float frand(){
return (float)rand()/(float)RAND_MAX;
}
/*
* This is the main routine which declares and initializes the integer vector, moves it to the device, launches kernel
* brings the result vector back to host and dumps it on the console.
*/
int main() {
const params_t params = {
100,
0.5,
0.5,
p,
0.98,
0.01,
{-5.12, -2.048, -5.12},
{5.12, 2.048, 5.12},
{{0, 1, 0}, {0, 1, 0}}
};
float X[p][n];
float X_highVec[n];
float V[p][n];
float X_best[p][n];
float Y[p] = {0};
float Y_best[p] = {0};
float X_swarmBest[n];
kernelWrapperType F_wrapper[f_n] = {&f1, &f1, &f1};
twoArgsFuncType F[f_n] = {&F1, &F1, &F1};
for (unsigned f = 0; f < f_n; f++) {
printf("Optimizing function #%u\n", f);
srand ( time(NULL) );
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
X[i][j] = X_best[i][j] = frand();
for (int i = 0; i < n; i++)
X_highVec[i] = params.X_high[f];
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
V[i][j] = frand();
for (unsigned i = 0; i < p; i++)
Y_best[i] = F[f](X[i][0], X[i][1]);
for (unsigned i = 0; i < n; i++)
X_swarmBest[i] = params.X_high[f];
float y_swarmBest = F[f](X_highVec[0], X_highVec[1]);
bool termination = false;
float inertia = 1.;
for (unsigned k = 0; k < params.k_max; k++) {
F_wrapper[f]((float *)X, X_highVec, (float *)V, (float *)X_best, Y, Y_best, X_swarmBest, termination, inertia, &params, f);
}
for (unsigned i = 0; i < p; i++)
{
for (unsigned j = 0; j < n; j++)
{
printf("%f\t", X[i][j]);
}
printf("F = %f\n", Y[i]);
}
getchar();
}
}
Update: I tried adding error handling like so
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
if (err != cudaSuccess) {
fprintf(stderr, cudaGetErrorString(err));
exit(1);
}
after each API call, but it gave me nothing and didn't return (I still get all the results and program works to the end).
This is an unnecessarily complex piece of code for what should be a simple repro case, but this immediately jumps out:
const unsigned n = 2;
const unsigned p = 64;
unsigned length = n * p
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
So you are firstly computing the incorrect number of blocks, and then reversing the order of the blocks per grid and threads per block arguments in the kernel launch. That may well lead to out of bounds memory access, either hosing something in GPU memory or causing an unspecified launch failure, which your lack of error handling might not be catching. There is a tool called cuda-memcheck which has been shipped with the toolkit since about CUDA 3.0. If you run it, it will give you valgrind style memory access violation reports. You should get into the habit of using it, if you are not already doing so.
As for infinite values, that is to be expected isn't it? Your code starts with values in (0,1), and then does
X[i] = X[i] * (5.12--5.12) - -5.12
100 times, which is the rough equivalent of multiplying by 10^100, which is then followed by
X[i] = X[i] * (2.048--2.048) - -2.048
100 times, which is the rough equivalent of multiplying by 4^100, finally followed by
X[i] = X[i] * (5.12--5.12) - -5.12
again. So your results should be of the order of 1E250, which is much larger than the maximum 3.4E38 which is the rough upper limit of representable numbers in IEEE 754 single precision.

Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as
bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));
I am implementing the function
int whiteCount = 0;
for (int q=i; q<i+windowHeight; q++)
{
for (int w=j; w<j+windowWidth; w++)
{
if (imageData[q*W + w] == 1)
whiteCount++;
}
}
This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS
can be used to make several operations in 1 cycle. Maybe thats the way to go ??
The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.
The only code in neon intrinsics that I am able to find online is the code for rgb to gray
http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion-on-iphone/
Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:
int whiteCount = 0;
for (int q = i; q < i + windowHeight; q++)
{
const bool * const row = &imageData[q * W];
for (int w = j; w < j + windowWidth; w++)
{
whiteCount += row[w];
}
}
(This assumes that imageData[] is truly binary, i.e. each element can only ever be 0 or 1.)
Here is a simple NEON implementation:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
{
const bool * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
{
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
}
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
{
whiteCount += row[j];
}
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is truly binary, imageWidth <= 2^19, and sizeof(bool) == 1.)
Updated version for unsigned char and values of 255 for white, 0 for black:
#include <arm_neon.h>
// ...
int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };
for (q = i; q < i + windowHeight; q++)
{
const uint8_t * const row = &imageData[q * W];
uint16x8_t vrow_count = { 0 };
for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
{
uint8x16_t v = vld1q_u8(&row[j]); // load 16 x 8 bit pixels
v = vandq_u8(v, v_mask); // mask out all but LS bit
vrow_count = vpadalq_u8(vrow_count, v); // accumulate 16 bit row counts
}
for ( ; w < j + windowWidth; ++w) // scalar clean up loop
{
whiteCount += (row[j] == 255);
}
v_count = vpadalq_u16(v_count, vrow_count); // update 32 bit image counts
} // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount
(This assumes that imageData[] is has values of 255 for white and 0 for black, and imageWidth <= 2^19.)
Note that all the above code is untested and may need some further work.
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
Section 6.55.3.6
The vectorized algorithm will do the comparisons and put them in a structure for you, but you'd still need to go through each element of the structure and determine if it's a zero or not.
How fast does that loop currently run and how fast do you need it to run? Also remember that NEON will work in the same registers as the floating point unit, so using NEON here may force an FPU context switch.

CUDA memory limitations

If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
int width;
int height;
int* elements;
};
int main()
{
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
{
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
else
{
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
}
}
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
/*
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
and
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.

Search for lines with a small range of angles in OpenCV

I'm using the Hough transform in OpenCV to detect lines. However, I know in advance that I only need lines within a very limited range of angles (about 10 degrees or so). I'm doing this in a very performance sensitive setting, so I'd like to avoid the extra work spent detecting lines at other angles, lines I know in advance I don't care about.
I could extract the Hough source from OpenCV and just hack it to take min_rho and max_rho parameters, but I'd like a less fragile approach (have to manually update my code w/ each OpenCV update, etc.).
What's the best approach here?
Well, i've modified the icvHoughlines function to go for a certain range of angles. I'm sure there's cleaner ways that plays with memory allocation as well, but I got a speed gain going from 100ms to 33ms for a range of angle going from 180deg to 60deg, so i'm happy with that.
Note that this code also outputs the accumulator value. Also, I only output 1 line because that fit my purposes but there was no gain really there.
static void
icvHoughLinesStandard2( const CvMat* img, float rho, float theta,
int threshold, CvSeq *lines, int linesMax )
{
cv::AutoBuffer<int> _accum, _sort_buf;
cv::AutoBuffer<float> _tabSin, _tabCos;
const uchar* image;
int step, width, height;
int numangle, numrho;
int total = 0;
float ang;
int r, n;
int i, j;
float irho = 1 / rho;
double scale;
CV_Assert( CV_IS_MAT(img) && CV_MAT_TYPE(img->type) == CV_8UC1 );
image = img->data.ptr;
step = img->step;
width = img->cols;
height = img->rows;
numangle = cvRound(CV_PI / theta);
numrho = cvRound(((width + height) * 2 + 1) / rho);
_accum.allocate((numangle+2) * (numrho+2));
_sort_buf.allocate(numangle * numrho);
_tabSin.allocate(numangle);
_tabCos.allocate(numangle);
int *accum = _accum, *sort_buf = _sort_buf;
float *tabSin = _tabSin, *tabCos = _tabCos;
memset( accum, 0, sizeof(accum[0]) * (numangle+2) * (numrho+2) );
// find n and ang limits (in our case we want 60 to 120
float limit_min = 60.0/180.0*PI;
float limit_max = 120.0/180.0*PI;
//num_steps = (limit_max - limit_min)/theta;
int start_n = floor(limit_min/theta);
int stop_n = floor(limit_max/theta);
for( ang = limit_min, n = start_n; n < stop_n; ang += theta, n++ )
{
tabSin[n] = (float)(sin(ang) * irho);
tabCos[n] = (float)(cos(ang) * irho);
}
// stage 1. fill accumulator
for( i = 0; i < height; i++ )
for( j = 0; j < width; j++ )
{
if( image[i * step + j] != 0 )
//
for( n = start_n; n < stop_n; n++ )
{
r = cvRound( j * tabCos[n] + i * tabSin[n] );
r += (numrho - 1) / 2;
accum[(n+1) * (numrho+2) + r+1]++;
}
}
int max_accum = 0;
int max_ind = 0;
for( r = 0; r < numrho; r++ )
{
for( n = start_n; n < stop_n; n++ )
{
int base = (n+1) * (numrho+2) + r+1;
if (accum[base] > max_accum)
{
max_accum = accum[base];
max_ind = base;
}
}
}
CvLinePolar2 line;
scale = 1./(numrho+2);
int idx = max_ind;
n = cvFloor(idx*scale) - 1;
r = idx - (n+1)*(numrho+2) - 1;
line.rho = (r - (numrho - 1)*0.5f) * rho;
line.angle = n * theta;
line.votes = accum[idx];
cvSeqPush( lines, &line );
}
If you use the Probabilistic Hough transform then the output is in the form of a cvPoint each for lines[0] and lines[1] parameters. We can get x and y co-ordinated for each of the two points by pt1.x, pt1.y and pt2.x and pt2.y.
Then use the simple formula for finding slope of a line - (y2-y1)/(x2-x1). Taking arctan (tan inverse) of that will yield that angle in radians. Then simply filter out desired angles from the values for each hough line obtained.
I think it's more natural to use standart HoughLines(...) function, which gives collection of lines directly in rho and theta terms and select nessessary angle range from it, rather than recalculate angle from segment end points.

Resources