how to make a CUDA Histogram kernel? - histogram

I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
#para d_dst: input image TColor is equals to float4.
#para data: the array for histogram size [255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}

Have you looked at the SDK sample? The "histogram" sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).
The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).

Histogramming is not particularly efficient when implemented with CUDA (or with GPGPU in general) - typically you need to generate lots of partial histograms in shared memory and then sum them. You might want to consider keeping this particular task on the CPU.

You will have to either use atomic function to block other thread from using he same memory, or use the partial histogram. Either way it not that efficient unless the input image is very very large.

Related

Fast vectorized pixel-wise operations on images

I want to measure the similarity degree between two grayscale same sized images using mean square error. I can't use any framework which is not a part of macOS SDK(e.g. OpenCV, Eigen). Simple realization of this algorithm without vectorization looks like this:
vImage_Buffer imgA;
vImage_Buffer imgB;
NSUInteger mse = 0;
unsigned char *pxlsA = (unsigned char *)imgA.data;
unsigned char *pxlsB = (unsigned char *)imgB.data;
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
NSUInteger d = pxlsA[i] - pxlsB[i]);
mse += d * d;
}
Is there some way to do this without loop, in more vectorized way? Maybe something like:
mse = ((imgA - imgB) ^ 2).sum();
The answer to this question is stored in vDSP library, which is part of macOS SDK.
https://developer.apple.com/documentation/accelerate/vdsp
vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.
In my situation I have not really big vectors, but still.
Firstly, you need to convert unsigned char * to float *, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm and vDSP_sve.
vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.
vDSP_sve - Calculates the sum of values in a single-precision vector.
So the final code looks like that:
float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
fpxlsA[i] = (float)(pxlsA[i]);
fpxlsB[i] = (float)(pxlsB[i]);
}
vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);
free(output);
free(fpxlsA);
free(fpxlsB);
So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.
This appears to be part of Mac OS: https://developer.apple.com/documentation/accelerate
Nice and fast using pointer arithmetic way to loop that would be as follows ...
int d;
size_t i = imgA.height * imgA.width;
while ( i -- )
{
d = ( int )(*pxlsA++) - ( int )(*pxlsB++);
mse += d * d;
}
EDIT
Ooops since those are unsigned char's and since we calculate the difference we need to use signed integers to do so.
And another edit - must use pxls... here, don't know what img... is.

OpenCV cuda::meanStdDev support for CV_32FC1

I want to find the mean pixel value and standard deviation of a GPUMat and do this reduction on the GPU rather than having to download the image and compute the mean on the CPU (since this will slow me down considerably in my application). The thing is, the GpuMat images I am dealing with are 32 bit floats - the opencv documentation however states that
CV_8UC1 matrices are supported for now
I have no trouble compiling the following code:
#include <opencv2/core/core.hpp>
#include <opencv2/core/cuda.hpp>
#include <opencv2/cudaarithm.hpp>
int main(int argc, char** argv)
{
cv::cuda::GpuMat img = cv::cuda::GpuMat(cv::Mat::zeros(cv::Size(kIWEWidth,kIWEHeight), CV_32FC1));
cv::Scalar mean, std;
cv::cuda::meanStdDev(img, mean, std);
}
However, when I try to actually execute this, I'm hit with
error: (-215:Assertion failed) src.type() == CV_8UC1 in function 'meanStdDev'
So, I was wondering if anyone knows if it's possible to compile OpenCV with 32 bit float support on the meanStdDev method, or if there are any alternative methods that can be recommended. I realise for example, that I should be able to find the average using cuda::sum, cuda::subtract and cuda::sqrSum. But this requires a bunch of kernel launches, and in my particular case, every microsecond counts.
Anyways, thanks in advance for your help!
I find it really weird that the cv::cuda version only supports CV_8U1, because it literally calls the npp function nppiMean_StdDev_8u_C1R, and versions for more image types exist.
void meanStdDev_32FC1M(cv::cuda::GpuMat src, cv::cuda::GpuMat mask, double *mean, double *stddev)
{
CV_Assert(src.type() == CV_32FC1);
double *mean_dev, *stddev_dev;
cudaMalloc((void**)&mean_dev, sizeof(double));
cudaMalloc((void**)&stddev_dev, sizeof(double));
NppiSize sz;
sz.width = src.cols;
sz.height = src.rows;
int bufSize;
nppiMeanStdDevGetBufferHostSize_32f_C1R(sz, &bufSize);//nppSafeCall
cv::cuda::BufferPool pool(cv::cuda::Stream::Null());
cv::cuda::GpuMat buf = pool.getBuffer(1, bufSize, CV_8UC1);
nppiMean_StdDev_32f_C1MR(src.ptr<Npp32f>(), static_cast<int>(src.step), mask.ptr<Npp8u>(), static_cast<int>(mask.step), sz, buf.ptr<Npp8u>(), mean_dev, stddev_dev);
cudaMemcpy(mean, mean_dev, sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(stddev, stddev_dev, sizeof(double), cudaMemcpyDeviceToHost);
cudaFree(mean_dev);
cudaFree(stddev_dev);
}

Increase speed of OpenCL image processing

I think the execution time of my kernel is too high. It's job is it to just blend two images together using either addition, subtraction, division or multiplication.
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = convert_float2(xy) / (float2) (get_image_width(output), get_image_height(output));\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
SETUP_KERNEL(div, /)
SETUP_KERNEL(add, +)
SETUP_KERNEL(mult, *)
SETUP_KERNEL(sub, -)
As you can see I use macros to quickly define the different kernels. (Should I better use functions for that?)
Somehow the kernel managed to take 3ms on a GTX 970.
What can I do to increase the performance of this particular kernel?
Should I split it into different programs?
Bilinear interpolation is 2x-3x slower than nearest neighbor way. Are you sure you are not using nearest neighbor in opengl?
What it does in background(by the sampler) is something like:
R1 = ((x2 – x)/(x2 – x1))*Q11 + ((x – x1)/(x2 – x1))*Q21
R2 = ((x2 – x)/(x2 – x1))*Q12 + ((x – x1)/(x2 – x1))*Q22
After the two R values are calculated, the value of P can finally be calculated by a weighted average of R1 and R2.
P = ((y2 – y)/(y2 – y1))*R1 + ((y – y1)/(y2 – y1))*R2
The calculation will have to be repeated for the red, green, blue, and optionally the alpha component of.
http://supercomputingblog.com/graphics/coding-bilinear-interpolation/
Or it is simply Nvidia implemented fastpath for opengl and complete path for opencl image access. For example, for amd, image writes are complete path, smaller than 32bit data accesses are complete path, image reads are fastpath.
Another option: Z-order is better suited to compute-divergence of those image data and opencl's non Z-order(suspicious, maybe not) is worse.
Division is often expenisve so
I suggest moving calculation of normalizedCoords to host side.
On host side:
float normalized_x[output_width]; // initialize with [0..output_width-1]/output_width
float normalized_y[output_height]; // initialize with [0..output_height-1]/output_height
Change kernel to:
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
global float *normalized_x, \
global float *normalized_y, \
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = (float2) (normalized_x[xy.x],normalized_y[xy.y] );\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
Also you can try not using normalized cooardinates by using the same technique.
This would be more beneficial if the sizes of the input images dont change often.

Why Global memory version is faster than constant memory in my CUDA code?

I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.

CUDA coalesced access for two-dimensional block

For 1D cases I've pretty much understood the whole coalesced access requirement of global memory in CUDA.
However I'm a bit stuck for two-dimensional case (that is we have a 2D grid, made of 2D blocks).
Suppose I have a vector in_vector and in my kernel I want to access it in a coalesced manner. Like so:
__global__ void my_kernel(float* out_matrix, float* in_vector, int size)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
// ...
float vx = in_vector[i]; // This is good. Here we have coalesced access
float vy = in_vector[j]; // Not sure about this. All threads in my warp access the same global address. (See explanation)
// ...
// Do some calculations... Obtain result
}
In my understanding for this 2D case the threads inside the block are "arranged" in a column-major fashion. Eg: assuming a (threadIdx.x, threadIdx.y) notation:
the first warp would be: (0, 0), (1, 0), (2, 0), ..., (31, 0),
the second warp would be: (0, 1), (1, 1), (2, 1), ..., (31, 1),
and so on...
In this case calling in_vector[i] gives us a coalesced access because each consecutive thread in the same warp will access consecutive addresses. However calling in_vector[j] seems a bad ideea, as each consecutive thread will access the same address in global memory (For example all the threads in warp 0 will access in_vector[0], which would give us 32 different global memory requests)
Did I understood this correctly? If so how can I make a coalesced access to global memory using in_vector[j]?
What you have shown in your question is only correct for certain block sizes. Your "coalesced" access:
int i = blockIdx.x * blockDim.x + threadIdx.x;
float vx = in_vector[i];
will result in coalesced access of in_vector from global memory only when blockDim.x is greater than or equal to 32. Even in the coalesced case, each thread within a block which shares the same threadIdx.x value reads the same word from global memory, which seems to be counter-intuitive and wasteful.
The correct way to ensure reads are unique per thread and coalesced is to calculate the thread number within the block and an offset within the grid, perhaps something like:
int tid = threadIdx.x + blockDim.x * threadIdx.y; // must use column major order
int bid = blockIdx.x + gridDim.x * blockDim.y; // can either use column or row major
int offset = (blockDim.x * blockDim.y) * bid; // block id * threads per block
float vx = in_vector[tid + offset];
If your intention really isn't to read a unique value per thread, then you can save a lot of memory bandwidth and achieve coalescing using shared memory, something like this:
__shared__ float vx[32], vy[32];
int tid = threadIdx.x + blockDim.x * threadIdx.y;
if (tid < 32) {
vx[tid] = in_vector[blockIdx.x * blockDim.x + tid];
vy[tid] = in_vector[blockIdx.y * blockDim.y + tid];
}
__syncthread();
and you will get a single warp reading unique values into shared memory once. Other threads can then read values from shared memory without requiring any further global memory access. Note that in the above example I followed the conventions of your code, even if it doesn't necessarily make that much sense to read in_vector twice in that way.

Resources