Example usage of a libyuv API MJPGToI420() - image-processing

I'm trying to use a libyuv API, more specifically MJPGToI420().
I want to first take a jpeg image as input to MJPGToI420(), the signature of which is below:
int MJPGToI420(const uint8_t* sample,
size_t sample_size,
uint8_t* dst_y,
int dst_stride_y,
uint8_t* dst_u,
int dst_stride_u,
uint8_t* dst_v,
int dst_stride_v,
int src_width,
int src_height,
int dst_width,
int dst_height);
Then, I want to allocate space for the dst_y, dst_u, and dst_v pointers. However, I don't know how much space to allocate for them. I'm also confused about what the strides should be, i.e., what the parameters dst_stride_y, dst_stride_u and dst_stride_v should be.
Would really appreciate any pointers in the right direction.
EDIT: Here's a snippet of code from the libyuv source unit tests that uses this function. However, the test returns 1 which is failure of the function as the intended behavior. The test also just uses zeroes for the data, instead of an actual MJPG file.
TEST_F(LibYUVConvertTest, MJPGToI420) {
const int kOff = 10;
const int kMinJpeg = 64;
const int kImageSize = benchmark_width_ * benchmark_height_ >= kMinJpeg
? benchmark_width_ * benchmark_height_
: kMinJpeg;
const int kSize = kImageSize + kOff;
align_buffer_page_end(orig_pixels, kSize);
align_buffer_page_end(dst_y_opt, benchmark_width_ * benchmark_height_);
align_buffer_page_end(dst_u_opt, SUBSAMPLE(benchmark_width_, 2) *
SUBSAMPLE(benchmark_height_, 2));
align_buffer_page_end(dst_v_opt, SUBSAMPLE(benchmark_width_, 2) *
SUBSAMPLE(benchmark_height_, 2));
// EOI, SOI to make MJPG appear valid.
memset(orig_pixels, 0, kSize);
orig_pixels[0] = 0xff;
orig_pixels[1] = 0xd8; // SOI.
orig_pixels[kSize - kOff + 0] = 0xff;
orig_pixels[kSize - kOff + 1] = 0xd9; // EOI.
for (int times = 0; times < benchmark_iterations_; ++times) {
int ret =
MJPGToI420(orig_pixels, kSize, dst_y_opt, benchmark_width_, dst_u_opt,
SUBSAMPLE(benchmark_width_, 2), dst_v_opt,
SUBSAMPLE(benchmark_width_, 2), benchmark_width_,
benchmark_height_, benchmark_width_, benchmark_height_);
// Expect failure because image is not really valid.
EXPECT_EQ(1, ret);
}
free_aligned_buffer_page_end(dst_y_opt);
free_aligned_buffer_page_end(dst_u_opt);
free_aligned_buffer_page_end(dst_v_opt);
free_aligned_buffer_page_end(orig_pixels);
}
EDIT 2: Furthermore, this is what I've tried, however, the end yuv files are not even viewable in a yuv viewer (created using the buffers dst_u_opt and dst_y_opt), which makes me believe there might be something that I'm messing up with the function:
int convertMJPGToI420() {
auto fileSize = filesize(IMG_NAME);
// load image into memory
uint8_t* my_img = (uint8_t*) calloc(fileSize, 1);
std::ifstream fin(IMG_NAME, ios::in | ios::binary);
fin.read(reinterpret_cast<char*>(my_img), fileSize);
// exif data offset
// This is the size of the exif data
const int kOff = 4096;
// 4k image is being sent in
int benchmark_width_ = 3840;
int benchmark_height_ = 2160;
const int kSize = fileSize;
// align_buffer_page_end is a macro (look at link posted for unit tests above)
// I'm not sure if the size allocation for these is correct
// I have tried to model it based off the example
align_buffer_page_end(orig_pixels, kSize);
align_buffer_page_end(dst_y_opt, benchmark_width_ * benchmark_height_);
align_buffer_page_end(dst_u_opt, SUBSAMPLE(benchmark_width_, 2) *
SUBSAMPLE(benchmark_height_, 2));
align_buffer_page_end(dst_v_opt, SUBSAMPLE(benchmark_width_, 2) *
SUBSAMPLE(benchmark_height_, 2));
// EOI, SOI to make MJPG appear valid
memset(orig_pixels, 0, kSize);
orig_pixels[0] = 0xff;
orig_pixels[1] = 0xd8; // SOI
memcpy(orig_pixels + 2, my_img, kSize - kOff - 3);
orig_pixels[kSize - kOff + 0] = 0xff;
orig_pixels[kSize - kOff + 1] = 0xd9; // EOI
// using async as this function might be ansynchronous
std::future<int> ret = std::async(libyuv::MJPGToI420, orig_pixels, kSize, dst_y_opt, benchmark_width_,
dst_u_opt, SUBSAMPLE(benchmark_width_, 2),
dst_v_opt, SUBSAMPLE(benchmark_width_, 2),
benchmark_width_, benchmark_height_,
benchmark_width_, benchmark_height_);
ret.wait();
// ret is always one, which means there was a failure
if(ret.get() == 0) {
cout << "return value was zero" << endl;
} else {
cout << "return value was one" << endl;
}
FILE* file = fopen("/data/dst_u_opt", "wb");
fwrite(dst_y_opt, 1, SUBSAMPLE(benchmark_width_, 2) * SUBSAMPLE(benchmark_height_, 2) , file);
fclose(file);
file = fopen("/data/dst_v_opt", "wb");
fwrite(dst_y_opt, 1, SUBSAMPLE(benchmark_width_, 2) * SUBSAMPLE(benchmark_height_, 2), file);
fclose(file);
free_aligned_buffer_page_end(dst_y_opt);
free_aligned_buffer_page_end(dst_u_opt);
free_aligned_buffer_page_end(dst_v_opt);
free_aligned_buffer_page_end(orig_pixels);
return 0;
}

You'll need to know the width and height of the jpeg.
I420 is a 420 sub-sampled YUV.
The Y plane is width * height in bytes.
The dst_stride_y value is width
e.g.
char* dst_y = malloc(width * height);
The U and V planes are half width and height. To handle odd sizes you should round up.
dst_stride_u = (width + 1) / 2;
dst_stride_v = (width + 1) / 2;
The u and v planes are ((width + 1) / 2) * ((height + 1) / 2) bytes.
char* dst_u = malloc(((width + 1) / 2) * ((height + 1) / 2));
char* dst_y = malloc(((width + 1) / 2) * ((height + 1) / 2));
If you'd like to file an issue, including better documentation, post it here:
https://bugs.chromium.org/p/libyuv/issues/list

Related

Separable gaussian blur - optimize vertical pass

I have implemented separable Gaussian blur. Horizontal pass was relatively easy to optimize with SIMD processing. However, I am not sure how to optimize vertical pass.
Accessing elements is not very cache friendly and filling SIMD lane would mean reading many different pixels. I was thinking about transpose the image and run horizontal pass and then transpose image back, however, I am not sure if it will gain any improvement because of two tranpose operations.
I have quite large images 16k resolution and kernel size is 19, so vectorization of vertical pass gain was about 15%.
My Vertical pass is as follows (it is sinde generic class typed to T which can be uint8_t or float):
int yStart = kernelHalfSize;
int xStart = kernelHalfSize;
int yEnd = input.GetWidth() - kernelHalfSize;
int xEnd = input.GetHeigh() - kernelHalfSize;
const T * inData = input.GetData().data();
V * outData = output.GetData().data();
int kn = kernelHalfSize * 2 + 1;
int kn4 = kn - kn % 4;
for (int y = yStart; y < yEnd; y++)
{
size_t yW = size_t(y) * output.GetWidth();
size_t outX = size_t(xStart) + yW;
size_t xEndSimd = xStart;
int len = xEnd - xStart;
len = len - len % 4;
xEndSimd = xStart + len;
for (int x = xStart; x < xEndSimd; x += 4)
{
size_t inYW = size_t(y) * input.GetWidth();
size_t x0 = ((x + 0) - kernelHalfSize) + inYW;
size_t x1 = x0 + 1;
size_t x2 = x0 + 2;
size_t x3 = x0 + 3;
__m128 sumDot = _mm_setzero_ps();
int i = 0;
for (; i < kn4; i += 4)
{
__m128 kx = _mm_set_ps1(kernelDataX[i + 0]);
__m128 ky = _mm_set_ps1(kernelDataX[i + 1]);
__m128 kz = _mm_set_ps1(kernelDataX[i + 2]);
__m128 kw = _mm_set_ps1(kernelDataX[i + 3]);
__m128 dx, dy, dz, dw;
if constexpr (std::is_same<T, uint8_t>::value)
{
//we need co convert uint8_t inputs to float
__m128i u8_0 = _mm_loadu_si128((const __m128i*)(inData + x0));
__m128i u8_1 = _mm_loadu_si128((const __m128i*)(inData + x1));
__m128i u8_2 = _mm_loadu_si128((const __m128i*)(inData + x2));
__m128i u8_3 = _mm_loadu_si128((const __m128i*)(inData + x3));
__m128i u32_0 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_0, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_1 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_1, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_2 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_2, _mm_setzero_si128()),
_mm_setzero_si128());
__m128i u32_3 = _mm_unpacklo_epi16(
_mm_unpacklo_epi8(u8_3, _mm_setzero_si128()),
_mm_setzero_si128());
dx = _mm_cvtepi32_ps(u32_0);
dy = _mm_cvtepi32_ps(u32_1);
dz = _mm_cvtepi32_ps(u32_2);
dw = _mm_cvtepi32_ps(u32_3);
}
else
{
/*
//load 8 consecutive values
auto dd = _mm256_loadu_ps(inData + x0);
//extract parts by shifting and casting to 4 values float
dx = _mm256_castps256_ps128(dd);
dy = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 4, 3, 2, 1)));
dz = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 5, 4, 3, 2)));
dw = _mm256_castps256_ps128(_mm256_permutevar8x32_ps(dd, _mm256_set_epi32(0, 0, 0, 0, 6, 5, 4, 3)));
*/
dx = _mm_loadu_ps(inData + x0);
dy = _mm_loadu_ps(inData + x1);
dz = _mm_loadu_ps(inData + x2);
dw = _mm_loadu_ps(inData + x3);
}
//calculate 4 dots at once
//[dx, dy, dz, dw] <dot> [kx, ky, kz, kw]
auto mx = _mm_mul_ps(dx, kx); //dx * kx
auto my = _mm_fmadd_ps(dy, ky, mx); //mx + dy * ky
auto mz = _mm_fmadd_ps(dz, kz, my); //my + dz * kz
auto res = _mm_fmadd_ps(dw, kw, mz); //mz + dw * kw
sumDot = _mm_add_ps(sumDot, res);
x0 += 4;
x1 += 4;
x2 += 4;
x3 += 4;
}
for (; i < kn; i++)
{
auto v = _mm_set_ps1(kernelDataX[i]);
auto v2 = _mm_set_ps(
*(inData + x3), *(inData + x2),
*(inData + x1), *(inData + x0)
);
sumDot = _mm_add_ps(sumDot, _mm_mul_ps(v, v2));
x0++;
x1++;
x2++;
x3++;
}
sumDot = _mm_mul_ps(sumDot, _mm_set_ps1(weightX));
if constexpr (std::is_same<V, uint8_t>::value)
{
__m128i asInt = _mm_cvtps_epi32(sumDot);
asInt = _mm_packus_epi32(asInt, asInt);
asInt = _mm_packus_epi16(asInt, asInt);
uint32_t res = _mm_cvtsi128_si32(asInt);
((uint32_t *)(outData + outX))[0] = res;
outX += 4;
}
else
{
float tmpRes[4];
_mm_store_ps(tmpRes, sumDot);
outData[outX + 0] = tmpRes[0];
outData[outX + 1] = tmpRes[1];
outData[outX + 2] = tmpRes[2];
outData[outX + 3] = tmpRes[3];
outX += 4;
}
}
for (int x = xEndSimd; x < xEnd; x++)
{
int kn = kernelHalfSize * 2 + 1;
const T * v = input.GetPixelStart(x - kernelHalfSize, y);
float tmp = 0;
for (int i = 0; i < kn; i++)
{
tmp += kernelDataX[i] * v[i];
}
tmp *= weightX;
outData[outX] = ImageUtils::clamp_cast<V>(tmp);
outX++;
}
}
There’s a well-known trick for that.
While you compute both passes, read them sequentially, use SIMD to compute, but write out the result into another buffer, transposed, using scalar stores. Protip: SSE 4.1 has _mm_extract_ps just don’t forget to cast your destination image pointer from float* into int*. Another thing about these stores, I would recommend using _mm_stream_si32 for that as you want maximum cache space used by your input data. When you’ll be computing the second pass, you’ll be reading sequential memory addresses again, the prefetcher hardware will deal with the latency.
This way both passes will be identical, I usually call same function twice, with different buffers.
Two transposes caused by your 2 passes cancel each other. Here’s an HLSL version, BTW.
There’s more. If your kernel size is only 19, that fits in 3 AVX registers. I think shuffle/permute/blend instructions are still faster than even L1 cache loads, i.e. it might be better to load the kernel outside the loop.

Resize image using nearest neighborhood with cuda

I am implementing a nearest neighborhood kernel function to resize the input image. But the result is wrong and I have no idea.
Here is the input image
the result is wrong.
I use opencv to read the input image.
cv::Mat image = cv::imread("/home/tumh/test.jpg");
unsigned char* data = image.data;
int outH, outW;
float *out_data_host = test(data, image.rows, image.cols, outH, outW);
cv::Mat out_image(outH, outW, CV_32FC3);
memcpy(out_image.data, out_data_host, outH * outW * 3 * sizeof(float));
float* test(unsigned char* in_data_host, const int &inH, const int &inW, int &outH, int &outW) {
// get the output size
int im_size_min = std::min(inW, inH);
int im_size_max = std::max(inW, inH);
float scale_factor = static_cast<float>(640) / im_size_min;
float im_scale_x = std::floor(inW * scale_factor / 64) * 64 / inW;
float im_scale_y = std::floor(inH * scale_factor / 64) * 64 / inH;
outW = inW * im_scale_x;
outH = inH * im_scale_y;
int channel = 3;
unsigned char* in_data_dev;
CUDA_CHECK(cudaMalloc(&in_data_dev, sizeof(unsigned char) * channel * inH * inW));
CUDA_CHECK(cudaMemcpy(in_data_dev, in_data_host, 1 * sizeof(unsigned char) * channel * inH * inW, cudaMemcpyHostToDevice));
// image pre process
const float2 scale = make_float2( im_scale_x, im_scale_y);
float * out_buffer = NULL;
CUDA_CHECK(cudaMalloc(&out_buffer, sizeof(float) * channel * outH * outW));
float *out_data_host = new float[sizeof(float) * channel * outH * outW];
const dim3 threads(32, 32);
const dim3 block(iDivUp(outW, threads.x), iDivUp(outW, threads.y));
gpuPreImageNet<<<block, threads>>>(scale, in_data_dev, inW, out_buffer, outW, outH);
CUDA_CHECK(cudaFree(in_data_dev));
CUDA_CHECK(cudaMemcpy(out_data_host, out_buffer, sizeof(float) * channel * outH * outW, cudaMemcpyDeviceToHost));
CUDA_CHECK(cudaFree(out_buffer));
return out_data_host;
}
Here is the resize kernel function
__global__ void gpuPreImageNet( float2 scale, unsigned char* input, int iWidth, float* output, int oWidth, int oHeight )
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int n = oWidth * oHeight;
int channel = 3;
if( x >= oWidth || y >= oHeight )
return;
const int dx = ((float)x * scale.x);
const int dy = ((float)y * scale.y);
const unsigned char* px = input + dy * iWidth * channel + dx * channel ;
const float3 bgr = make_float3(*(px + 0), *(px + 1), *(px + 2));
output[channel * y * oWidth + channel * x + 0] = bgr.x;
output[channel * y * oWidth + channel * x + 1] = bgr.y;
output[channel * y * oWidth + channel * x + 2] = bgr.z;
}
Most of the implementation is from https://github.com/soulsheng/ResizeNN/blob/master/resizeCUDA/resizeNN.cu
Any idea?
Maybe you are observing an uninitialized memory problem.
As i understand your code, out_data_host allocation is too big
new float[sizeof(float) * channel * outH * outW];
should be
new float[channel * outH * outW]
Then out_buffer is uninitialized, add a cudaMemset after the cudaMalloc line.
To clarify your code, since you already use OpenCV to load images, why don't you use opencv to resize your images ?
cv::resize // Host side method is probably better since you'll have less data copied through PCI-Express
// or
cv::cuda::resize
It took me around two days to figure out a solution for this problem. Basically, I was building a GPU based image preprocessing pipeline for my project. Here's the custom Cuda Kernel.
For Gray scale Image Resizing, change channel from 3 -> 1 and it should work.
__global__ void resize_kernel( real* pIn, real* pOut, int widthIn, int heightIn, int widthOut, int heightOut)
{
int i = blockDim.y * blockIdx.y + threadIdx.y;
int j = blockDim.x * blockIdx.x + threadIdx.x;
int channel = 3;
if( i < heightOut && j < widthOut )
{
int iIn = i * heightIn / heightOut;
int jIn = j * widthIn / widthOut;
for(int c = 0; c < channel; c++)
pOut[ (i*widthOut + j)*channel + c ] = pIn[ (iIn*widthIn + jIn)*channel + c ];
}
}

Converting cv::Mat to MTLTexture

An intermediate step of my current project requires conversion of opencv's cv::Mat to MTLTexture, the texture container of Metal. I need to store the Floats in the Mat as Floats in the texture; my project cannot quite afford the loss of precision.
This is my attempt at such a conversion.
- (id<MTLTexture>)texForMat:(cv::Mat)image context:(MBEContext *)context
{
id<MTLTexture> texture;
int width = image.cols;
int height = image.rows;
Float32 *rawData = (Float32 *)calloc(height * width * 4,sizeof(float));
int bytesPerPixel = 4;
int bytesPerRow = bytesPerPixel * width;
float r, g, b,a;
for(int i = 0; i < height; i++)
{
Float32* imageData = (Float32*)(image.data + image.step * i);
for(int j = 0; j < width; j++)
{
r = (Float32)(imageData[4 * j]);
g = (Float32)(imageData[4 * j + 1]);
b = (Float32)(imageData[4 * j + 2]);
a = (Float32)(imageData[4 * j + 3]);
rawData[image.step * (i) + (4 * j)] = r;
rawData[image.step * (i) + (4 * j + 1)] = g;
rawData[image.step * (i) + (4 * j + 2)] = b;
rawData[image.step * (i) + (4 * j + 3)] = a;
}
}
MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA16Float
width:width
height:height
mipmapped:NO];
texture = [context.device newTextureWithDescriptor:textureDescriptor];
MTLRegion region = MTLRegionMake2D(0, 0, width, height);
[texture replaceRegion:region mipmapLevel:0 withBytes:rawData bytesPerRow:bytesPerRow];
free(rawData);
return texture;
}
But it doesn't seem to be working. It reads zeroes every time from the Mat, and throws up EXC_BAD_ACCESS. I need the MTLTexture in MTLPixelFormatRGBA16Float to keep the precision.
Thanks for considering this issue.
One problem here is you’re loading up rawData with Float32s but your texture is RGBA16Float, so the data will be corrupted (16Float is half the size of Float32). This shouldn’t cause your crash, but it’s an issue you’ll have to deal with.
Also as “chappjc” noted you’re using ‘image.step’ when writing your data out, but that buffer should be contiguous and not ever have a step that’s not just (width * bytesPerPixel).

colored image to greyscale image using CUDA parallel processing

I am trying to solve a problem in which i am supposed to change a colour image to a greyscale image. For this purpose i am using CUDA parallel approach. The kerne code i am invoking on the GPU is as follows.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int absolute_image_position_x = blockIdx.x;
int absolute_image_position_y = blockIdx.y;
if ( absolute_image_position_x >= numCols ||
absolute_image_position_y >= numRows )
{
return;
}
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage,
size_t numRows,
size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(numCols/32, numCols/32 , 1); //TODO
const dim3 gridSize(numRows/12, numRows/12 , 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage,
d_greyImage,
numRows,
numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
i see a line of dots in the first pixel line.
error i am getting is
libdc1394 error: Failed to initialize libdc1394
Difference at pos 51 exceeds tolerance of 5
Reference: 255
GPU : 0
my input/output images
Can anyone help me with this??? thanks in advance.
I recently joined this course and tried your solution but it don't work so, i tried my own. You are almost correct. The correct solution is this:
__global__`
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{`
int pos_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int pos_y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(pos_x >= numCols || pos_y >= numRows)
return;
uchar4 rgba = rgbaImage[pos_x + pos_y * numCols];
greyImage[pos_x + pos_y * numCols] = (.299f * rgba.x + .587f * rgba.y + .114f * rgba.z);
}
The rest is same as your code.
Now, since I posted this question I have been continuously working on this problem there are a couple of improvements that should be done in order to get this problem correct now I realize my initial solution was wrong . Changes to be done:-
1. absolute_position_x =(blockIdx.x * blockDim.x) + threadIdx.x;
2. absolute_position_y = (blockIdx.y * blockDim.y) + threadIdx.y;
Secondly,
1. const dim3 blockSize(24, 24, 1);
2. const dim3 gridSize((numCols/16), (numRows/16) , 1);
In the solution we are using a grid of numCols/16 * numCols/16
and blocksize of 24 * 24
code executed in 0.040576 ms
#datenwolf : thanks for answering above!!!
Since you are not aware of the image size. It is best to choose any reasonable dimension of the two-dimensional block of threads and then check for two conditions. The first one is that the pos_x and pos_y indexes in the kernel do not exceed numRows and numCols. Secondly the grid size should be just above the total number of threads in all the blocks.
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols%16) ? numCols/16+1 : numCols/16,
(numRows%16) ? numRows/16+1 : numRows/16, 1);
libdc1394 error: Failed to initialize libdc1394
I don't think that this is a CUDA problem. libdc1394 is a library used to access IEEE1394 aka FireWire aka iLink video devices (DV camcorders, Apple iSight camera). That library doesn'r properly initialize, hence you're not getting usefull results. Basically it's NINO: Nonsens In Nonsens Out.
the calculation of absolute x & y image positions is perfect.
but when u need to access that particular pixel in the coloured image , shouldn't you u use the following code??
uchar4 rgba = rgbaImage[absolute_image_position_x + (absolute_image_position_y * numCols)];
I thought so, when comparing it to a code you'd write to execute the same problem in serial code.
Please let me know :)
You still should have a problem with run time - the conversion will not give a proper result.
The lines:
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;
should be changed to:
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y*numCols];
greyImage[absolute_image_position_x + absolute_image_position_y*numCols] = channelSum;
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int rgba_x = blockIdx.x * blockDim.x + threadIdx.x;
int rgba_y = blockIdx.y * blockDim.y + threadIdx.y;
int pixel_pos = rgba_x+rgba_y*numCols;
uchar4 rgba = rgbaImage[pixel_pos];
unsigned char gray = (unsigned char)(0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z);
greyImage[pixel_pos] = gray;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(24, 24, 1); //TODO
const dim3 gridSize( numCols/24+1, numRows/24+1, 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
The libdc1394 error is not related to firewire etc in this case - it is the library that udacity is using to compare the image your program creates to the reference image. And what is is saying is that the difference between your image and the reference image has been been exceeded by a specific threshold, for that position ie. pixel.
You are running following number of block and grids:
const dim3 blockSize(numCols/32, numCols/32 , 1); //TODO
const dim3 gridSize(numRows/12, numRows/12 , 1); //TODO
yet you are not using any threads in your kernel code!
int absolute_image_position_x = blockIdx.x;
int absolute_image_position_y = blockIdx.y;
think this way, the width of an image can be divide into absolute_image_position_x parts of column and the height of an image can be divide into absolute_image_position_y parts of row. Now the box each of the cross section it creates you need to change/redraw all the pixels in terms of greyImage, parallely. Enough spoiler for an assignment :)
same code with with ability to handle non-standard input size images
int idx=blockDim.x*blockIdx.x+threadIdx.x;
int idy=blockDim.y*blockIdx.y+threadIdx.y;
uchar4 rgbcell=rgbaImage[idx*numCols+idy];
greyImage[idx*numCols+idy]=0.299*rgbcell.x+0.587*rgbcell.y+0.114*rgbcell.z;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
int totalpixels=numRows*numCols;
int factors[]={2,4,8,16,24,32};
vector<int> numbers(factors,factors+sizeof(factors)/sizeof(int));
int factor=1;
while(!numbers.empty())
{
if(totalpixels%numbers.back()==0)
{
factor=numbers.back();
break;
}
else
{
numbers.pop_back();
}
}
const dim3 blockSize(factor, factor, 1); //TODO
const dim3 gridSize(numRows/factor+1, numCols/factor+1,1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
1- int x =(blockIdx.x * blockDim.x) + threadIdx.x;
2- int y = (blockIdx.y * blockDim.y) + threadIdx.y;
And in grid and block size
1- const dim3 blockSize(32, 32, 1);
2- const dim3 gridSize((numCols/32+1), (numRows/32+1) , 1);
Code executed in 0.036992 ms.
const dim3 blockSize(16, 16, 1); //TODO
const dim3 gridSize( (numRows+15)/16, (numCols+15)/16, 1); //TODO
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
uchar4 rgba = rgbaImage[y*numRows + x];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numRows + x] = channelSum;

Implementing Ordered Dithering (24 bit RGB to 3 bit per channel RGB)

I'm writing an image editing programme, and I need functionality to dither any arbitrary 24-bit RGB image (I've taken care of loading it with CoreGraphics and such) to an image with 3 bit colour channels, then displaying it. I've set up my matrices and such, but I've not got any results from the code below besides a simple pattern that is applied to the image:
- (CGImageRef) ditherImageTo16Colours:(CGImageRef)image withDitheringMatrixType:(SQUBayerDitheringMatrix) matrix {
if(image == NULL) {
NSLog(#"Image is NULL!");
return NULL;
}
unsigned int imageWidth = CGImageGetWidth(image);
unsigned int imageHeight = CGImageGetHeight(image);
NSLog(#"Image size: %u x %u", imageWidth, imageHeight);
CGContextRef context = CGBitmapContextCreate(NULL,
imageWidth,
imageHeight,
8,
4 * (imageWidth),
CGColorSpaceCreateWithName(kCGColorSpaceGenericRGB),
kCGImageAlphaNoneSkipLast);
CGContextDrawImage(context, CGRectMake(0, 0, imageWidth, imageHeight), image); // draw it
CGImageRelease(image); // get rid of the image, we don't want it anymore.
unsigned char *imageData = CGBitmapContextGetData(context);
unsigned char ditheringModulusType[0x04] = {0x02, 0x03, 0x04, 0x08};
unsigned char ditheringModulus = ditheringModulusType[matrix];
unsigned int red;
unsigned int green;
unsigned int blue;
uint32_t *memoryBuffer;
memoryBuffer = (uint32_t *) malloc((imageHeight * imageWidth) * 4);
unsigned int thresholds[0x03] = {256/8, 256/8, 256/8};
for(int y = 0; y < imageHeight; y++) {
for(int x = 0; x < imageWidth; x++) {
// fetch the colour components, add the dither value to them
red = (imageData[((y * imageWidth) * 4) + (x << 0x02)]);
green = (imageData[((y * imageWidth) * 4) + (x << 0x02) + 1]);
blue = (imageData[((y * imageWidth) * 4) + (x << 0x02) + 2]);
if(red > 36 && red < 238) {
red += SQUBayer117_matrix[x % ditheringModulus][y % ditheringModulus];
} if(green > 36 && green < 238) {
green += SQUBayer117_matrix[x % ditheringModulus][y % ditheringModulus];
} if(blue > 36 && blue < 238) {
blue += SQUBayer117_matrix[x % ditheringModulus][y % ditheringModulus];
}
// memoryBuffer[(y * imageWidth) + x] = (0xFF0000 + ((x >> 0x1) << 0x08) + (y >> 2));
memoryBuffer[(y * imageWidth) + x] = find_closest_palette_colour(((red & 0xFF) << 0x10) | ((green & 0xFF) << 0x08) | (blue & 0xFF));
}
}
//CGContextRelease(context);
context = CGBitmapContextCreate(memoryBuffer,
imageWidth,
imageHeight,
8,
4 * (imageWidth),
CGColorSpaceCreateWithName(kCGColorSpaceGenericRGB),
kCGImageAlphaNoneSkipLast);
NSLog(#"Created context from buffer: %#", context);
CGImageRef result = CGBitmapContextCreateImage(context);
return result;
}
Note that find_closest_palette_colour doesn't do anything besides returning the original colour right now for testing.
I'm trying to implement the example pseudocode from Wikipedia, and I don't really get anything out of that right now.
Anyone got a clue on how to fix this up?
Use the code that I have provided here: https://stackoverflow.com/a/17900812/342646
This code converts the image to a single-channel gray-scale first. If you want the dithering to be done on a three-channel image, you can just split your image into three channels and call the function three times (once per channel).

Resources