how to convert uint32 to uint8 using simd but not avx512?

how to convert uint32 to uint8 using simd but not avx512? - sse

Say there are a lot of uint32s store in aligned memory uint32 *p, how to convert them to uint8s with simd?
I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it 😢

If you really have a lot of them, I would do something like this (untested).
The main loop reads 64 bytes per iteration containing 16 uint32_t values, shuffles around the bytes implementing the truncation, merges result into a single register, and writes 16 bytes with a vector store instruction.
void convertToBytes( const uint32_t* source, uint8_t* dest, size_t count )
{
// 4 bytes of the shuffle mask to fetch bytes 0, 4, 8 and 12 from a 16-bytes source vector
constexpr int shuffleScalar = 0x0C080400;
// Mask to shuffle first 8 values of the batch, making first 8 bytes of the result
const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar, -1, -1, -1, -1, shuffleScalar, -1, -1 );
// Mask to shuffle last 8 values of the batch, making last 8 bytes of the result
const __m256i shuffMaskHigh = _mm256_setr_epi32( -1, -1, shuffleScalar, -1, -1, -1, -1, shuffleScalar );
// Indices for the final _mm256_permutevar8x32_epi32
const __m256i finalPermute = _mm256_setr_epi32( 0, 5, 2, 7, 0, 5, 2, 7 );
const uint32_t* const sourceEnd = source + count;
// Vectorized portion, each iteration handles 16 values.
// Round down the count making it a multiple of 16.
const size_t countRounded = count & ~( (size_t)15 );
const uint32_t* const sourceEndAligned = source + countRounded;
while( source < sourceEndAligned )
{
// Load 16 inputs into 2 vector registers
const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
source += 16;
// Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
const __m256i low = _mm256_shuffle_epi8( s1, shuffMaskLow );
const __m256i high = _mm256_shuffle_epi8( s2, shuffMaskHigh );
// Unused bytes were zeroed out, using bitwise OR to merge, very fast.
const __m256i res32 = _mm256_or_si256( low, high );
// Final shuffle of the 32-bit values into correct positions
const __m256i res16 = _mm256_permutevar8x32_epi32( res32, finalPermute );
// Store lower 16 bytes of the result
_mm_storeu_si128( ( __m128i* )dest, _mm256_castsi256_si128( res16 ) );
dest += 16;
}
// Deal with the remainder
while( source < sourceEnd )
{
*dest = (uint8_t)( *source );
source++;
dest++;
}
}

Related

What are my options to convert OpenCV reduce loop to a native iOS code. SIMD anyone?

Which native iOS framework is best used to eradicate this cpu hog written in OpenCV?
/// Reduce the channel elements of given Mat to a single channel
static func reduce(input: Mat) throws -> Mat {
let output = Mat(rows: input.rows(), cols: input.cols(), type: CvType.CV_8UC1)
for x in 0 ..< input.rows() {
for y in 0 ..< input.cols() {
let value = input.get(row: x, col: y)
let dataValue = value.reduce(0, +)
try output.put(row: x, col: y, data: [dataValue])
}
}
return output
}
takes about 20+ seconds to do those gets and puts on real world data I put this code through.

Assuming your input matrix is CV_64FC2, call computeSumX2 C function for each row.
Untested.
#include <arm_neon.h>
#include <stdint.h>
#include <stddef.h>
// Load 8 FP64 values, add pairwise, narrow uint64 to uint32, combine into a single vector
inline uint32x4_t reduce4( const double* rsi )
{
// Load 8 values
float64x2x4_t f64 = vld1q_f64_x4( rsi );
// Add them pairwise
float64x2_t f64_1 = vpaddq_f64( f64.val[ 0 ], f64.val[ 1 ] );
float64x2_t f64_2 = vpaddq_f64( f64.val[ 2 ], f64.val[ 3 ] );
// Convert FP64 to uint64
uint64x2_t i64_1 = vcvtq_u64_f64( f64_1 );
uint64x2_t i64_2 = vcvtq_u64_f64( f64_2 );
// Convert int64 to int32 in a single vector, using saturation
uint32x2_t low = vqmovn_u64( i64_1 );
return vqmovn_high_u64( low, i64_2 );
}
// Compute pairwise sum of FP64 values, cast to bytes
void computeSumX2( uint8_t* rdi, size_t length, const double* rsi )
{
const double* const rsiEnd = rsi + length * 2;
size_t lengthAligned = ( length / 16 ) * 16;
const double* const rsiEndAligned = rsi + lengthAligned * 2;
for( ; rsi < rsiEndAligned; rsi += 16 * 2, rdi += 16 )
{
// Each iteration of the loop loads 32 source values, stores 16 bytes
uint16x4_t low16 = vqmovn_u32( reduce4( rsi ) );
uint16x8_t u16 = vqmovn_high_u32( low16, reduce4( rsi + 8 ) );
uint8x8_t low8 = vqmovn_u16( u16 );
low16 = vqmovn_u32( reduce4( rsi + 8 * 2 ) );
u16 = vqmovn_high_u32( low16, reduce4( rsi + 8 * 3 ) );
uint8x16_t res = vqmovn_high_u16( low8, u16 );
vst1q_u8( rdi, res );
}
for( ; rsi < rsiEnd; rsi += 2, rdi++ )
{
// Each iteration of the loop loads 2 source values, stores a single byte
float64x2_t f64 = vld1q_f64( rsi );
double sum = vaddvq_f64( f64 );
*rdi = (uint8_t)sum;
}
}

For folks such as myself who have a poor comprehension of ARM Intrinsics
a simpler solution is to bridge into Objective C code as Soonts did
and thusly ditch crude Swift api to opencv bypassing costly memory copying with gets and puts.
void fasterSumX2( const char *input,
int rows,
int columns,
long step,
int channels,
char* output,
long output_step
)
{
for(int j = 0;j < rows;j++){
for(int i = 0;i < columns;i++){
long offset = step * j + i * channels;
const unsigned char *ptr = (const unsigned char *)(input + offset);
int res = ptr[0]+ptr[1];
if (res > 0) {
if (res > 255) {
assert(false);
}
}
*(output + output_step * j + i) = res;
}
}
}

simulating dynamic memory allocation in OpenCl

I ran into a problem which is making me crazy.
I need to simulate dynamic memory allocation in OpenCl kernel. In this regard, I have the following malloc function defined in a *.cl file:
__global void* malloc(size_t size, __global byte *heap, __global uint *next)
{
uint index = atomic_add(next, size);
return heap+index;
}
In the host program, I dynamically dedicate a large array of type cl_uchar for this virtual heap as follows:
int MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL = 1000;
cl_uchar* heap = new cl_byte[1000000];
cl_uint *next = new cl_uint;
*next = 0;
cl_uint * test_result =
new cl_uint[MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL];
cl_mem memory[3]= { 0, 0, 0};
cl_int error;
memory[0] = clCreateBuffer(GPU_context,
CL_MEM_READ_WRITE, sizeof(cl_uchar) * MAX_HEAP_SIZE, NULL,
NULL);
memory[1] = clCreateBuffer(GPU_context, CL_MEM_READ_WRITE, sizeof(cl_uint), NULL,
&error);
memory[2] = clCreateBuffer(GPU_context, CL_MEM_READ_WRITE,
sizeof(cl_uint) * MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL, NULL,
&error);
clEnqueueWriteBuffer(command_queue, memory[0], CL_TRUE, 0,
sizeof(cl_uchar) * MAX_HEAP_SIZE, heap, 0, NULL, NULL);
clEnqueueWriteBuffer(command_queue, memory[1], CL_TRUE, 0, sizeof(cl_uint),
next, 0, NULL, NULL);
error = 0;
error |= clSetKernelArg(kernel, 0, sizeof(cl_mem), &memory[0]);
error |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &memory[1]);
error |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &memory[2]);
size_t globalWorkSize[1] = { MAX_NUM_OF_HEADERS_PROCESSED_IN_PARALLEL };
size_t localWorkSize[1] = { 1 };
error = 0;
error = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
globalWorkSize, localWorkSize, 0, NULL, NULL);
I also have the following kernel:
__kernel void packet_routing2(__global byte* heap_, __global uint* next, __global uint* test_result){
int gid = get_global_id(0);
__global uint*xx[100];
for ( int i = 0 ; i < 100; i ++)
{
xx[i] = (__global uint*) malloc(sizeof(uint),heap_,next);
*xx[i] = i*gid;
result[gid] = *(xx[0]);
}
I encounterd the following error when I run the program:
" %27 = load i32 addrspace(1)* %26, align 4, !tbaa !17
Illegal pointer which is not from a valid memory space.
Aborting..."
Could you please help me fix this issue. I also found out that if xx has only 10 elements, instead of 100, the code works well !!!!

Edit: Simplest solution: add a padding value to 'size' before malloc so all struct types (that are lesser in size than max-padding) receive necessary alignment conditions.
0=struct footprint in memory
*=heap
_=padding
***000_____*****0000____****0_______****00000___*****0000000_*******00______***
|
v
save this unused padded memory space in its thread to use later.
it is important that first/starting address value needs to satisfy maximum alignment requirements. If there is a struct 256-byte long, it should have multiple of 256 to start.
struct size malloc size minimum 'next' value (address, not offset)
1-4 4 multiple of 4
5-8 8 multiple of 8
9-16 16 multiple of 16
17-32 32 32*k
33-64 64 64*k
if there is 64-byte struct, even an int needs 64-byte malloc size now. Maybe you can save that values locally per thread to use that remaining unused areas.
So it doesnt give alignment errors and probably works faster for those don't.
Also float3 needs 16 byte natively.

Image Processing: Image has grid lines after applying filter

I'm very new to working with image processing at a low level and have just had a go at implementing a gaussian kernel with both GPU and CPU - however both yield the same output, an image which is severely skewed by a grid:
I'm aware I could use OpenCV's pre-built functions to handle the filters, but I wanted to learn the methodology behind it, so I built my own.
Convolution kernel:
// Convolution kernel - this manipulates the given channel and writes out a new blurred channel.
void convoluteChannel_cpu(
const unsigned char* const channel, // Input channel
unsigned char* const channelBlurred, // Output channel
const size_t numRows, const size_t numCols, // Channel width/height (rows, cols)
const float *filter, // The weight of sigma, to convulge
const int filterWidth // This is normally a sample of 9
)
{
// Loop through the images given R, G or B channel
for(int rows = 0; rows < (int)numRows; rows++)
{
for(int cols = 0; cols < (int)numCols; cols++)
{
// Declare new pixel colour value
float newColor = 0.f;
// Loop for every row along the stencil size (3x3 matrix)
for(int filter_x = -filterWidth/2; filter_x <= filterWidth/2; filter_x++)
{
// Loop for every col along the stencil size (3x3 matrix)
for(int filter_y = -filterWidth/2; filter_y <= filterWidth/2; filter_y++)
{
// Clamp to the boundary of the image to ensure we don't access a null index.
int image_x = __min(__max(rows + filter_x, 0), static_cast<int>(numRows -1));
int image_y = __min(__max(cols + filter_y, 0), static_cast<int>(numCols -1));
// Assign the new pixel value to the current pixel, numCols and numRows are both 3, so we only
// need to use one to find the current pixel index (similar to how we find the thread in a block)
float pixel = static_cast<float>(channel[image_x * numCols + image_y]);
// Sigma is the new weight to apply to the image, we perform the equation to get a radnom weighting,
// if we don't do this the image will become choppy.
float sigma = filter[(filter_x + filterWidth / 2) * filterWidth + filter_y + filterWidth/2];
//float sigma = 1 / 81.f;
// Set the new pixel value
newColor += pixel * sigma;
}
}
// Set the value of the next pixel at the current image index with the newly declared color
channelBlurred[rows * numCols + cols] = newColor;
}
}
}
I call this 3 times from another method which splits the image into respective R, G, B channels, but I don't believe this would cause the image to be so severely mutated.
Has anybody encountered a problem similar to this before, and if so how did you solve it?
EDIT Channel Splitting Func:
void gaussian_cpu(
const uchar4* const rgbaImage, // Our input image from the camera
uchar4* const outputImage, // The image we are writing back for display
size_t numRows, size_t numCols, // Width and Height of the input image (rows/cols)
const float* const filter, // The value of sigma
const int filterWidth // The size of the stencil (3x3) 9
)
{
// Build an array to hold each channel for the given image
unsigned char *r_c = new unsigned char[numRows * numCols];
unsigned char *g_c = new unsigned char[numRows * numCols];
unsigned char *b_c = new unsigned char[numRows * numCols];
// Build arrays for each of the output (blurred) channels
unsigned char *r_bc = new unsigned char[numRows * numCols];
unsigned char *g_bc = new unsigned char[numRows * numCols];
unsigned char *b_bc = new unsigned char[numRows * numCols];
// Separate the image into R,G,B channels
for(size_t i = 0; i < numRows * numCols; i++)
{
uchar4 rgba = rgbaImage[i];
r_c[i] = rgba.x;
g_c[i] = rgba.y;
b_c[i] = rgba.z;
}
// Convolute each of the channels using our array
convoluteChannel_cpu(r_c, r_bc, numRows, numCols, filter, filterWidth);
convoluteChannel_cpu(g_c, g_bc, numRows, numCols, filter, filterWidth);
convoluteChannel_cpu(b_c, b_bc, numRows, numCols, filter, filterWidth);
// Recombine the channels to build the output image - 255 for alpha as we want 0 transparency
for(size_t i = 0; i < numRows * numCols; i++)
{
uchar4 rgba = make_uchar4(r_bc[i], g_bc[i], b_bc[i], 255);
outputImage[i] = rgba;
}
}
EDIT Calling the kernel
while(gpu_frames > 0)
{
//cout << gpu_frames << "\n";
camera >> frameIn;
// Allocate I/O Pointers
beginStream(&h_inputFrame, &h_outputFrame, &d_inputFrame, &d_outputFrame, &d_redBlurred, &d_greenBlurred, &d_blueBlurred, &_h_filter, &filterWidth, frameIn);
// Show the source image
imshow("Source", frameIn);
g_timer.Start();
// Allocate mem to GPU
allocateMemoryAndCopyToGPU(numRows(), numCols(), _h_filter, filterWidth);
// Apply the gaussian kernel filter and then free any memory ready for the next iteration
gaussian_gpu(h_inputFrame, d_inputFrame, d_outputFrame, numRows(), numCols(), d_redBlurred, d_greenBlurred, d_blueBlurred, filterWidth);
// Output the blurred image
cudaMemcpy(h_outputFrame, d_frameOut, sizeof(uchar4) * numPixels(), cudaMemcpyDeviceToHost);
g_timer.Stop();
cudaDeviceSynchronize();
gpuTime += g_timer.Elapsed();
cout << "Time for this kernel " << g_timer.Elapsed() << "\n";
Mat outputFrame(Size(numCols(), numRows()), CV_8UC1, h_outputFrame, Mat::AUTO_STEP);
clean_mem();
imshow("Dest", outputFrame);
// 1ms delay to prevent system from being interrupted whilst drawing the new frame
waitKey(1);
gpu_frames--;
}
And then within the beginStream() method, images are converted to uchar4:
// Allocate host variables, casting the frameIn and frameOut vars to uchar4 elements, these will
// later be processed by the kernel
*h_inputFrame = (uchar4 *)frameIn.ptr<unsigned char>(0);
*h_outputFrame = (uchar4 *)frameOut.ptr<unsigned char>(0);

There are many doubts in the problem.
At the start of the code, its mentioned that the filter width is 9, thus making it a 9x9 kernel. But in some other comments its said to be 3. So I am guessing that you are actually using a 9x9 kernel and the filter do have the 81 weights in them.
But the above output can never be due to the above mentioned confusion.
uchar4 is of 4-byte size. Thus in gaussian_cpu while splitting the data by running the loop over rgbaImage[i] on an image that doesnot contain alpha value (it could be inferred from the above mentioned loop that alpha is not present) what actually gets done is that your are copying R1,G2,B3,R5,G6,B7 and so on to the red-channel. Better you initially try the code on a grayscale image and make sure you are using uchar instead of uchar4.
The output image seems exactly 1/3rd the width of the original image, which makes the above assumption to be true.
EDIT 1:
Is the input rgbaImage to guassian_cpu function RGBA or RGB? videoCapture must be giving a 3 channel output. The initialization of *h_inputFrame (to uchar4) itself is wrong as its pointing to 3 channel data.
Similarly the output data is four channel data, but Mat outputFrame is declared as a single channel which points to this four channel data. Try Mat outputFrame as 8UC3 type and see the result.
Also, how is the code working, the guassian_cpu() function has 7 input parameters in the definition, but when you call the function 8 parameters are used. Hope this is just a typo.

OpenCL :Access proper index by using globalid(.)

Hi,
I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
{
for(int j=1;j<output.cols-1;j++)
{
float xVal = input.at<uchar>(i-1,j-1)-input.at<uchar>(i-1,j+1)+ 2*(input.at<uchar>(i,j-1)-input.at<uchar>(i,j+1))+input.at<uchar>(i+1,j-1) - input.at<uchar>(i+1,j+1);
float yVal = input.at<uchar>(i-1,j-1) - input.at<uchar>(i+1,j-1)+ 2*(input.at<uchar>(i-1,j) - input.at<uchar>(i+1,j))+input.at<uchar>(i-1,j+1)-input.at<uchar>(i+1,j+1);
output.at<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
}
}
...
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
OclStr->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR ,
(input.rows) * (input.cols),
static_cast<void *>(input.data), &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
OclStr->context, CL_MEM_WRITE_ONLY| CL_MEM_USE_HOST_PTR ,
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(output.data), &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
OclStr->command_queue,
OclStr->objkernel,
2,
NULL,
OclStr->global_work_size,
NULL,
0,
NULL,
NULL
);
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
{
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
}
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
PFB
Your suggestions are most welcome !!!

It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
{
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
}

I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
{
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
}
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
}
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
for(j=-1;j<dimLocalBuff.y+1;j++){
for(i=x-1;i<dimLocalBuff.x+1;i+=localDim){
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
for(i=0;i<dimLocalBuff.y;i++){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//copy results to output
for(j=0;j<dimLocalBuff.y;j++){
for(i=0;i<dimLocalBuff.x;i+=localDim){
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
}

colored image to greyscale image using CUDA parallel processing

I am trying to solve a problem in which i am supposed to change a colour image to a greyscale image. For this purpose i am using CUDA parallel approach. The kerne code i am invoking on the GPU is as follows.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int absolute_image_position_x = blockIdx.x;
int absolute_image_position_y = blockIdx.y;
if ( absolute_image_position_x >= numCols ||
absolute_image_position_y >= numRows )
{
return;
}
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage,
size_t numRows,
size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(numCols/32, numCols/32 , 1); //TODO
const dim3 gridSize(numRows/12, numRows/12 , 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage,
d_greyImage,
numRows,
numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
i see a line of dots in the first pixel line.
error i am getting is
libdc1394 error: Failed to initialize libdc1394
Difference at pos 51 exceeds tolerance of 5
Reference: 255
GPU : 0
my input/output images
Can anyone help me with this??? thanks in advance.

I recently joined this course and tried your solution but it don't work so, i tried my own. You are almost correct. The correct solution is this:
__global__`
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{`
int pos_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int pos_y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(pos_x >= numCols || pos_y >= numRows)
return;
uchar4 rgba = rgbaImage[pos_x + pos_y * numCols];
greyImage[pos_x + pos_y * numCols] = (.299f * rgba.x + .587f * rgba.y + .114f * rgba.z);
}
The rest is same as your code.

Now, since I posted this question I have been continuously working on this problem there are a couple of improvements that should be done in order to get this problem correct now I realize my initial solution was wrong . Changes to be done:-
1. absolute_position_x =(blockIdx.x * blockDim.x) + threadIdx.x;
2. absolute_position_y = (blockIdx.y * blockDim.y) + threadIdx.y;
Secondly,
1. const dim3 blockSize(24, 24, 1);
2. const dim3 gridSize((numCols/16), (numRows/16) , 1);
In the solution we are using a grid of numCols/16 * numCols/16
and blocksize of 24 * 24
code executed in 0.040576 ms
#datenwolf : thanks for answering above!!!

Since you are not aware of the image size. It is best to choose any reasonable dimension of the two-dimensional block of threads and then check for two conditions. The first one is that the pos_x and pos_y indexes in the kernel do not exceed numRows and numCols. Secondly the grid size should be just above the total number of threads in all the blocks.
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols%16) ? numCols/16+1 : numCols/16,
(numRows%16) ? numRows/16+1 : numRows/16, 1);

libdc1394 error: Failed to initialize libdc1394
I don't think that this is a CUDA problem. libdc1394 is a library used to access IEEE1394 aka FireWire aka iLink video devices (DV camcorders, Apple iSight camera). That library doesn'r properly initialize, hence you're not getting usefull results. Basically it's NINO: Nonsens In Nonsens Out.

the calculation of absolute x & y image positions is perfect.
but when u need to access that particular pixel in the coloured image , shouldn't you u use the following code??
uchar4 rgba = rgbaImage[absolute_image_position_x + (absolute_image_position_y * numCols)];
I thought so, when comparing it to a code you'd write to execute the same problem in serial code.
Please let me know :)

You still should have a problem with run time - the conversion will not give a proper result.
The lines:
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;
should be changed to:
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y*numCols];
greyImage[absolute_image_position_x + absolute_image_position_y*numCols] = channelSum;

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int rgba_x = blockIdx.x * blockDim.x + threadIdx.x;
int rgba_y = blockIdx.y * blockDim.y + threadIdx.y;
int pixel_pos = rgba_x+rgba_y*numCols;
uchar4 rgba = rgbaImage[pixel_pos];
unsigned char gray = (unsigned char)(0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z);
greyImage[pixel_pos] = gray;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(24, 24, 1); //TODO
const dim3 gridSize( numCols/24+1, numRows/24+1, 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

The libdc1394 error is not related to firewire etc in this case - it is the library that udacity is using to compare the image your program creates to the reference image. And what is is saying is that the difference between your image and the reference image has been been exceeded by a specific threshold, for that position ie. pixel.

You are running following number of block and grids:
const dim3 blockSize(numCols/32, numCols/32 , 1); //TODO
const dim3 gridSize(numRows/12, numRows/12 , 1); //TODO
yet you are not using any threads in your kernel code!
int absolute_image_position_x = blockIdx.x;
int absolute_image_position_y = blockIdx.y;
think this way, the width of an image can be divide into absolute_image_position_x parts of column and the height of an image can be divide into absolute_image_position_y parts of row. Now the box each of the cross section it creates you need to change/redraw all the pixels in terms of greyImage, parallely. Enough spoiler for an assignment :)

same code with with ability to handle non-standard input size images
int idx=blockDim.x*blockIdx.x+threadIdx.x;
int idy=blockDim.y*blockIdx.y+threadIdx.y;
uchar4 rgbcell=rgbaImage[idx*numCols+idy];
greyImage[idx*numCols+idy]=0.299*rgbcell.x+0.587*rgbcell.y+0.114*rgbcell.z;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
int totalpixels=numRows*numCols;
int factors[]={2,4,8,16,24,32};
vector<int> numbers(factors,factors+sizeof(factors)/sizeof(int));
int factor=1;
while(!numbers.empty())
{
if(totalpixels%numbers.back()==0)
{
factor=numbers.back();
break;
}
else
{
numbers.pop_back();
}
}
const dim3 blockSize(factor, factor, 1); //TODO
const dim3 gridSize(numRows/factor+1, numCols/factor+1,1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);

1- int x =(blockIdx.x * blockDim.x) + threadIdx.x;
2- int y = (blockIdx.y * blockDim.y) + threadIdx.y;
And in grid and block size
1- const dim3 blockSize(32, 32, 1);
2- const dim3 gridSize((numCols/32+1), (numRows/32+1) , 1);
Code executed in 0.036992 ms.

const dim3 blockSize(16, 16, 1); //TODO
const dim3 gridSize( (numRows+15)/16, (numCols+15)/16, 1); //TODO
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
uchar4 rgba = rgbaImage[y*numRows + x];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numRows + x] = channelSum;

Categories

HOME

quartz.net

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

how to convert uint32 to uint8 using simd but not avx512? - sse

Say there are a lot of uint32s store in aligned memory uint32 *p, how to convert them to uint8s with simd? I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it 😢

Related

What are my options to convert OpenCV reduce loop to a native iOS code. SIMD anyone?

simulating dynamic memory allocation in OpenCl

Image Processing: Image has grid lines after applying filter

OpenCL :Access proper index by using globalid(.)

colored image to greyscale image using CUDA parallel processing

Categories

Resources