how to decide the step size(row size ) for YUV420(I420) image when convert YUV420 to BGR with NVIDIA NPP library - image-processing

I am a newer to cuda and its image,signal processing library :NPP ,now I am trying to convert YUV420 to BGR ,use this function:
NppStatus nppiYUV420ToBGR_8u_P3C3R(const Npp8u * const pSrc[3], int rSrcStep[3], Npp8u * pDst, int nDstStep, NppiSize oSizeROI);
but I can't decide the rSrcStep , I know it's the row size of each component ,Y U V, but not sure I really understand it , the original image size is 1920x1080 (wxh) ,I use a opencv Mat to contain the YUV image
cv::Mat(cv::Size(1920,1080*3/2), CV_8UC1, (void*)data)
,then for the parameter rSrcStep of the 1st function ,I try rSrcStep={1920,1920/2,1920/2} ,but it returns NPP_STEP_ERROR
ps: for the nDestStep , I use the below function to allocate dest buff ,and get the step at same time
Npp8u *
nppiMalloc_8u_C3(int nWidthPixels, int nHeightPixels, int * pStepBytes);
1080*(3/2) because YUV420 size is wh3/2 bytes when original RGB image is w*h

Set rSrcStep and nDstStep to the following values:
int rSrcStep[3] = { COLS, COLS / 2, COLS / 2 };
int nDstStep = COLS * 3;
Where COLS = 1920.
In YUV420 planar format (I420), the resolution of Y channel is full resolution, and the resolution of U and V channels is half resolution in each axis.
Example:
Y:
U:
V:
Assuming the data is continuous in memory, the step (row stride in bytes) of Y equals to image width, and the step of U and V equals width/2.
Testing:
The testing code uses FFmpeg for building the input image in raw I420 format, and uses FFmpeg for converting the raw BGR output to PNG image.
#include <stdint.h>
#include <stdio.h>
#include "nppi.h"
#define COLS 192
#define ROWS 108
uint8_t Y[COLS * ROWS]; //Y color channel in host memory
uint8_t U[COLS * ROWS / 4]; //U color channel in host memory
uint8_t V[COLS * ROWS / 4]; //V color channel in host memory
uint8_t BGR[COLS * ROWS * 3]; //BGR output image in host memory
int main()
{
//Read Y, U, V planes to host memory buffers.
//Build input sample using FFmpeg first:
//ffmpeg -y -f lavfi -i testsrc=size=192x108:rate=1:duration=1 -pix_fmt yuvj420p -f rawvideo in.yuv420p
////////////////////////////////////////////////////////////////////////////
FILE* f = fopen("in.yuv420p", "rb");
fread(Y, 1, COLS * ROWS, f);
fread(U, 1, COLS * ROWS / 4, f);
fread(V, 1, COLS * ROWS / 4, f);
fclose(f);
////////////////////////////////////////////////////////////////////////////
//Allocate device memory, and copy Y,U,V from host to device.
////////////////////////////////////////////////////////////////////////////
Npp8u* gpuY, * gpuU, * gpuV, * gpuBGR;
cudaMalloc(&gpuY, COLS * ROWS);
cudaMalloc(&gpuU, COLS * ROWS / 4);
cudaMalloc(&gpuV, COLS * ROWS / 4);
cudaMalloc(&gpuBGR, COLS * ROWS * 3);
cudaMemcpy(gpuY, Y, COLS * ROWS, cudaMemcpyHostToDevice);
cudaMemcpy(gpuU, U, COLS * ROWS / 4, cudaMemcpyHostToDevice);
cudaMemcpy(gpuV, V, COLS * ROWS / 4, cudaMemcpyHostToDevice);
////////////////////////////////////////////////////////////////////////////
//Execute nppiYUV420ToBGR_8u_P3C3R
////////////////////////////////////////////////////////////////////////////
const Npp8u* const pSrc[3] = { gpuY, gpuU, gpuV };
int rSrcStep[3] = { COLS, COLS / 2, COLS / 2 };
int nDstStep = COLS * 3;
NppiSize oSizeROI = { COLS, ROWS };
NppStatus sts = nppiYUV420ToBGR_8u_P3C3R(pSrc, //const Npp8u* const pSrc[3],
rSrcStep, //int rSrcStep[3],
gpuBGR, //Npp8u *pDst,
nDstStep, //int nDstStep,
oSizeROI); //NppiSize oSizeROI);
if (sts != NPP_SUCCESS)
{
printf("Error: nppiResize_8u_C3R status = %d\n", (int)sts);
}
////////////////////////////////////////////////////////////////////////////
// Copy BGR output from device to host, and save BGR output to binary file
// After saving, use FFmpeg to convert the output image from binary to PNG:
// ffmpeg -y -f rawvideo -video_size 192x108 -pixel_format bgr24 -i out.bgr out.png
////////////////////////////////////////////////////////////////////////////
cudaMemcpy(BGR, gpuBGR, COLS * ROWS * 3, cudaMemcpyDeviceToHost);
f = fopen("out.bgr", "wb");
fwrite(BGR, 1, COLS * ROWS * 3, f);
fclose(f);
////////////////////////////////////////////////////////////////////////////
cudaFree(&gpuY);
cudaFree(&gpuU);
cudaFree(&gpuV);
cudaFree(&gpuBGR);
return 0;
}
Output (out.png):

Related

ImageMagick colors depth conversion

I have several thousands of tga files (without palette) which contain RGBA4444 data (I know usualy tga files don't contain RGBA4444 data). I would like to convert them into RGBA8888 data. I use the following command line:
convert -depth 4 woody4.tga -depth 8 woody8.tga
In this case, woody4.tga is the original RGBA4444 file, and woody8.tga the target RGBA8888 file but it doesn't change the colors of my pictures, what am I missing?
Thanks,
Pierre
Edit:
Thanks very much Mark, I have successfully converted more than 10 000 TGA with your program, the result is very good and correct to the original TGA ! this would has been impossible without the parallel command ! Just a last point, I have around 50 TGA larger (the backgrounds of the game) which are coded with RGBA5650 and not RGBA4444, how can I modify your program to manage the RGBA5650 ? Thanks very much !
Oh, I see Eric beat me to it:-)
Hey ho! I did it a different way anyway and got a different answer so you can see which one you like best. I also wrote some C but I didn't rely on any libraries, I just read the TGA and converted it to a PAM format and let ImageMagick make that into PNG afterwards at command-line.
I chose PAM because it is the simplest file to write which supports transparency - see Wikipedia on PAM format.
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
int main(int argc,char* argv[]){
unsigned char buf[64];
FILE* fp=fopen(argv[1],"rb");
if(fp==NULL){
fprintf(stderr,"ERROR: Unable to open %s\n",argv[1]);
exit(1);
}
// Read TGA header of 18 bytes, extract width and height
fread(buf,1,18,fp); // 12 bytes junk, 2 bytes width, 2 bytes height, 2 bytes junk
unsigned short w=buf[12]|(buf[13]<<8);
unsigned short h=buf[14]|(buf[15]<<8);
// Write PAM header
fprintf(stdout,"P7\n");
fprintf(stdout,"WIDTH %d\n",w);
fprintf(stdout,"HEIGHT %d\n",h);
fprintf(stdout,"DEPTH 4\n");
fprintf(stdout,"MAXVAL 255\n");
fprintf(stdout,"TUPLTYPE RGB_ALPHA\n");
fprintf(stdout,"ENDHDR\n");
// Read 2 bytes at a time RGBA4444
while(fread(buf,2,1,fp)==1){
unsigned char out[4];
out[0]=(buf[1]&0x0f)<<4;
out[1]=buf[0]&0xf0;
out[2]=(buf[0]&0x0f)<<4;
out[3]=buf[1]&0xf0;
// Write the 4 modified bytes out RGBA8888
fwrite(out,4,1,stdout);
}
fclose(fp);
return 0;
}
I the compile that with gcc:
gcc targa.c -o targa
Or you could use clang:
clang targa.c -o targa
and run it with
./targa someImage.tga > someImage.pam
and convert the PAM to PNG with ImageMagick at the command-line:
convert someImage.pam someImage.png
If you want to avoid writing the intermediate PAM file to disk, you can pipe it straight into convert like this:
./targa illu_evolution_01.tga | convert - result.png
You can, equally, make a BMP output file if you wish:
./targa illu_evolution_01.tga | convert - result.bmp
If you have thousands of files to do, and you are on a Mac or Linux, you can use GNU Parallel and get them all done in parallel much faster like this:
parallel --eta './targa {} | convert - {.}.png' ::: *.tga
If you have more than a couple of thousand files, you may get "Argument list too long" errors, in which case, use the slightly harder syntax:
find . -name \*tga -print0 | parallel -0 --eta './targa {} | convert - {.}.png'
On a Mac, you would install GNU Parallel with homebrew using:
brew install parallel
For your RGBA5650 images, I will fall back to PPM as my intermediate format because the alpha channel of PAM is no longer needed. The code will now look like this:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
int main(int argc,char* argv[]){
unsigned char buf[64];
FILE* fp=fopen(argv[1],"rb");
if(fp==NULL){
fprintf(stderr,"ERROR: Unable to open %s\n",argv[1]);
exit(1);
}
// Read TGA header of 18 bytes, extract width and height
fread(buf,1,18,fp); // 12 bytes junk, 2 bytes width, 2 bytes height, 2 bytes junk
unsigned short w=buf[12]|(buf[13]<<8);
unsigned short h=buf[14]|(buf[15]<<8);
// Write PPM header
fprintf(stdout,"P6\n");
fprintf(stdout,"%d %d\n",w,h);
fprintf(stdout,"255\n");
// Read 2 bytes at a time RGBA5650
while(fread(buf,2,1,fp)==1){
unsigned char out[3];
out[0]=buf[1]&0xf8;
out[1]=((buf[1]&7)<<5) | ((buf[0]>>3)&0x1c);
out[2]=(buf[0]&0x1f)<<3;
// Write the 3 modified bytes out RGB888
fwrite(out,3,1,stdout);
}
fclose(fp);
return 0;
}
And will compile and run exactly the same way.
Updated answer.
After reading a few documents about TARGA format. I've revised + simplified a C program to convert.
// tga2img.c
#include <stdio.h>
#include <stdlib.h>
#include <wand/MagickWand.h>
typedef struct {
unsigned char idlength;
unsigned char colourmaptype;
unsigned char datatypecode;
short int colourmaporigin;
short int colourmaplength;
unsigned char colourmapdepth;
short int x_origin;
short int y_origin;
short int width;
short int height;
unsigned char bitsperpixel;
unsigned char imagedescriptor;
} HEADER;
typedef struct {
int extensionoffset;
int developeroffset;
char signature[16];
unsigned char p;
unsigned char n;
} FOOTER;
int main(int argc, const char * argv[]) {
HEADER tga_header;
FOOTER tga_footer;
FILE
* fd;
size_t
tga_data_size,
tga_pixel_size,
i,
j;
unsigned char
* tga_data,
* buffer;
const char
* input,
* output;
if (argc != 3) {
printf("Usage:\n\t %s <input> <output>\n", argv[0]);
return 1;
}
input = argv[1];
output = argv[2];
fd = fopen(input, "rb");
if (fd == NULL) {
fprintf(stderr, "Unable to read TGA input\n");
return 1;
}
/********\
* TARGA *
\*********/
#pragma mark TARGA
// Read TGA header
fread(&tga_header.idlength, sizeof(unsigned char), 1, fd);
fread(&tga_header.colourmaptype, sizeof(unsigned char), 1, fd);
fread(&tga_header.datatypecode, sizeof(unsigned char), 1, fd);
fread(&tga_header.colourmaporigin, sizeof( short int), 1, fd);
fread(&tga_header.colourmaplength, sizeof( short int), 1, fd);
fread(&tga_header.colourmapdepth, sizeof(unsigned char), 1, fd);
fread(&tga_header.x_origin, sizeof( short int), 1, fd);
fread(&tga_header.y_origin, sizeof( short int), 1, fd);
fread(&tga_header.width, sizeof( short int), 1, fd);
fread(&tga_header.height, sizeof( short int), 1, fd);
fread(&tga_header.bitsperpixel, sizeof(unsigned char), 1, fd);
fread(&tga_header.imagedescriptor, sizeof(unsigned char), 1, fd);
// Calculate sizes
tga_pixel_size = tga_header.bitsperpixel / 8;
tga_data_size = tga_header.width * tga_header.height * tga_pixel_size;
// Read image data
tga_data = malloc(tga_data_size);
fread(tga_data, 1, tga_data_size, fd);
// Read TGA footer.
fseek(fd, -26, SEEK_END);
fread(&tga_footer.extensionoffset, sizeof( int), 1, fd);
fread(&tga_footer.developeroffset, sizeof( int), 1, fd);
fread(&tga_footer.signature, sizeof( char), 16, fd);
fread(&tga_footer.p, sizeof(unsigned char), 1, fd);
fread(&tga_footer.n, sizeof(unsigned char), 1, fd);
fclose(fd);
buffer = malloc(tga_header.width * tga_header.height * 4);
#pragma mark RGBA4444 to RGBA8888
for (i = 0, j=0; i < tga_data_size; i+= tga_pixel_size) {
buffer[j++] = (tga_data[i+1] & 0x0f) << 4; // Red
buffer[j++] = tga_data[i ] & 0xf0; // Green
buffer[j++] = (tga_data[i ] & 0x0f) << 4; // Blue
buffer[j++] = tga_data[i+1] & 0xf0; // Alpha
}
free(tga_data);
/***************\
* IMAGEMAGICK *
\***************/
#pragma mark IMAGEMAGICK
MagickWandGenesis();
PixelWand * background;
background = NewPixelWand();
PixelSetColor(background, "none");
MagickWand * wand;
wand = NewMagickWand();
MagickNewImage(wand,
tga_header.width,
tga_header.height,
background);
background = DestroyPixelWand(background);
MagickImportImagePixels(wand,
0,
0,
tga_header.width,
tga_header.height,
"RGBA",
CharPixel,
buffer);
free(buffer);
MagickWriteImage(wand, argv[2]);
wand = DestroyMagickWand(wand);
return 0;
}
Which can be compiled with clang $(MagickWand-config --cflags --libs) -o tga2im tga2im.c, and can be executed simply by ./tga2im N_birthday_0000.tga N_birthday_0000.tga.png.
Original answer.
The only way I can think of converting the images is to author a quick program/script to do the bitwise color-pixel logic.
This answer offers a quick way to read the image data; so combining with MagickWand, can be converted easily. (Although I know there'll be better solutions found on old game-dev forums...)
#include <stdio.h>
#include <stdbool.h>
#include <wand/MagickWand.h>
typedef struct
{
unsigned char imageTypeCode;
short int imageWidth;
short int imageHeight;
unsigned char bitCount;
unsigned char *imageData;
} TGAFILE;
bool LoadTGAFile(const char *filename, TGAFILE *tgaFile);
int main(int argc, const char * argv[]) {
const char
* input,
* output;
if (argc != 3) {
printf("Usage:\n\t%s <input> <output>\n", argv[0]);
}
input = argv[1];
output = argv[2];
MagickWandGenesis();
TGAFILE header;
if (LoadTGAFile(input, &header) == true) {
// Build a blank canvas image matching TGA file.
MagickWand * wand;
wand = NewMagickWand();
PixelWand * background;
background = NewPixelWand();
PixelSetColor(background, "NONE");
MagickNewImage(wand, header.imageWidth, header.imageHeight, background);
background = DestroyPixelWand(background);
// Allocate RGBA8888 buffer
unsigned char * buffer = malloc(header.imageWidth * header.imageHeight * 4);
// Iterate over TGA image data, and convert RGBA4444 to RGBA8888;
size_t pixel_size = header.bitCount / 8;
size_t total_bytes = header.imageWidth * header.imageHeight * pixel_size;
for (int i = 0, j = 0; i < total_bytes; i+=pixel_size) {
// Red
buffer[j++] = (header.imageData[i ] & 0x0f) << 4;
// Green
buffer[j++] = (header.imageData[i ] & 0xf0);
// Blue
buffer[j++] = (header.imageData[i+1] & 0xf0) << 4;
// Alpha
buffer[j++] = (header.imageData[i+1] & 0xf0);
}
// Import image data over blank canvas
MagickImportImagePixels(wand, 0, 0, header.imageWidth, header.imageHeight, "RGBA", CharPixel, buffer);
// Write image
MagickWriteImage(wand, output);
wand = DestroyMagickWand(wand);
} else {
fprintf(stderr, "Could not read TGA file %s\n", input);
}
MagickWandTerminus();
return 0;
}
/*
* Method copied verbatim from https://stackoverflow.com/a/7050007/438117
* Show your love by +1 to Wroclai answer.
*/
bool LoadTGAFile(const char *filename, TGAFILE *tgaFile)
{
FILE *filePtr;
unsigned char ucharBad;
short int sintBad;
long imageSize;
int colorMode;
unsigned char colorSwap;
// Open the TGA file.
filePtr = fopen(filename, "rb");
if (filePtr == NULL)
{
return false;
}
// Read the two first bytes we don't need.
fread(&ucharBad, sizeof(unsigned char), 1, filePtr);
fread(&ucharBad, sizeof(unsigned char), 1, filePtr);
// Which type of image gets stored in imageTypeCode.
fread(&tgaFile->imageTypeCode, sizeof(unsigned char), 1, filePtr);
// For our purposes, the type code should be 2 (uncompressed RGB image)
// or 3 (uncompressed black-and-white images).
if (tgaFile->imageTypeCode != 2 && tgaFile->imageTypeCode != 3)
{
fclose(filePtr);
return false;
}
// Read 13 bytes of data we don't need.
fread(&sintBad, sizeof(short int), 1, filePtr);
fread(&sintBad, sizeof(short int), 1, filePtr);
fread(&ucharBad, sizeof(unsigned char), 1, filePtr);
fread(&sintBad, sizeof(short int), 1, filePtr);
fread(&sintBad, sizeof(short int), 1, filePtr);
// Read the image's width and height.
fread(&tgaFile->imageWidth, sizeof(short int), 1, filePtr);
fread(&tgaFile->imageHeight, sizeof(short int), 1, filePtr);
// Read the bit depth.
fread(&tgaFile->bitCount, sizeof(unsigned char), 1, filePtr);
// Read one byte of data we don't need.
fread(&ucharBad, sizeof(unsigned char), 1, filePtr);
// Color mode -> 3 = BGR, 4 = BGRA.
colorMode = tgaFile->bitCount / 8;
imageSize = tgaFile->imageWidth * tgaFile->imageHeight * colorMode;
// Allocate memory for the image data.
tgaFile->imageData = (unsigned char*)malloc(sizeof(unsigned char)*imageSize);
// Read the image data.
fread(tgaFile->imageData, sizeof(unsigned char), imageSize, filePtr);
// Change from BGR to RGB so OpenGL can read the image data.
for (int imageIdx = 0; imageIdx < imageSize; imageIdx += colorMode)
{
colorSwap = tgaFile->imageData[imageIdx];
tgaFile->imageData[imageIdx] = tgaFile->imageData[imageIdx + 2];
tgaFile->imageData[imageIdx + 2] = colorSwap;
}
fclose(filePtr);
return true;
}
The order of the color channels may need to be switch around.
I have been thinking about this some more and it ought to be possible to reconstruct the image without any special software - I can't quite see my mistake for the moment by maybe #emcconville can cast your expert eye over it and point out my mistake! Pretty please?
So, my concept is that ImageMagick has read in the image size and pixel data correctly but has just allocated the bits according to the standard RGB5551 interpretation of a TARGA file rather than RGBA4444. So, we rebuild the 16-bits of data it read and split them differently.
The first line below does the rebuild into the original 16-bit data, then each subsequent line splits out one of the RGBA channels and then we recombine them:
convert illu_evolution_01.tga -depth 16 -channel R -fx "(((r*255)<<10) | ((g*255)<<5) | (b*255) | ((a*255)<<15))/255" \
\( -clone 0 -channel R -fx "((((r*255)>>12)&15)<<4)/255" \) \
\( -clone 0 -channel R -fx "((((r*255)>>8 )&15)<<4)/255" \) \
\( -clone 0 -channel R -fx "((((r*255) )&15)<<4)/255" \) \
-delete 0 -set colorspace RGB -combine -colorspace sRGB result.png
# The rest is just debug so you can see the reconstructed channels in [rgba].png
convert result.png -channel R -separate r.png
convert result.png -channel G -separate g.png
convert result.png -channel B -separate b.png
convert result.png -channel A -separate a.png
So, the following diagram represents the 16-bits of 1 pixel:
A R R R R R G G G G G B B B B B <--- what IM saw
R R R R G G G G B B B B A A A A <--- what it really meant
Yes, I have disregarded the alpha channel for the moment.

Read OpenCV Mat from raw file [duplicate]

Is there a more efficient way to load a large Mat object into memory than the FileStorage method in OpenCV?
I have a large Mat with 192 columns and 1 million rows I want to store locally in a file and load into memory then my application starts. There is no problem using the FileStorage, but I was wondering if there exists a more efficient method to do this. At the moment it takes about 5 minutes to load the Mat into memory using the Debug mode in Visual Studio and around 3 minutes in the Release mode and the size of the data file is around 1.2GB.
Is the FileStorage method the only method available to do this task?
Are you ok with a 100x speedup?
You should save and load your images in binary format. You can do that with the matwrite and matread function in the code below.
I tested both loading from a FileStorage and the binary file, and for a smaller image with 250K rows, 192 columns, type CV_8UC1 I got these results (time in ms):
// Mat: 250K rows, 192 cols, type CV_8UC1
Using FileStorage: 5523.45
Using Raw: 50.0879
On a image with 1M rows and 192 cols using the binary mode I got (time in ms):
// Mat: 1M rows, 192 cols, type CV_8UC1
Using FileStorage: (can't load, out of memory)
Using Raw: 197.381
NOTE
Never measure performance in debug.
3 minutes to load a matrix seems way too much, even for FileStorages. However, you'll gain a lot switching to binary mode.
Here the code with the functions matwrite and matread, and the test:
#include <opencv2\opencv.hpp>
#include <iostream>
#include <fstream>
using namespace std;
using namespace cv;
void matwrite(const string& filename, const Mat& mat)
{
ofstream fs(filename, fstream::binary);
// Header
int type = mat.type();
int channels = mat.channels();
fs.write((char*)&mat.rows, sizeof(int)); // rows
fs.write((char*)&mat.cols, sizeof(int)); // cols
fs.write((char*)&type, sizeof(int)); // type
fs.write((char*)&channels, sizeof(int)); // channels
// Data
if (mat.isContinuous())
{
fs.write(mat.ptr<char>(0), (mat.dataend - mat.datastart));
}
else
{
int rowsz = CV_ELEM_SIZE(type) * mat.cols;
for (int r = 0; r < mat.rows; ++r)
{
fs.write(mat.ptr<char>(r), rowsz);
}
}
}
Mat matread(const string& filename)
{
ifstream fs(filename, fstream::binary);
// Header
int rows, cols, type, channels;
fs.read((char*)&rows, sizeof(int)); // rows
fs.read((char*)&cols, sizeof(int)); // cols
fs.read((char*)&type, sizeof(int)); // type
fs.read((char*)&channels, sizeof(int)); // channels
// Data
Mat mat(rows, cols, type);
fs.read((char*)mat.data, CV_ELEM_SIZE(type) * rows * cols);
return mat;
}
int main()
{
// Save the random generated data
{
Mat m(1024*256, 192, CV_8UC1);
randu(m, 0, 1000);
FileStorage fs("fs.yml", FileStorage::WRITE);
fs << "m" << m;
matwrite("raw.bin", m);
}
// Load the saved matrix
{
// Method 1: using FileStorage
double tic = double(getTickCount());
FileStorage fs("fs.yml", FileStorage::READ);
Mat m1;
fs["m"] >> m1;
double toc = (double(getTickCount()) - tic) * 1000. / getTickFrequency();
cout << "Using FileStorage: " << toc << endl;
}
{
// Method 2: usign raw binary data
double tic = double(getTickCount());
Mat m2 = matread("raw.bin");
double toc = (double(getTickCount()) - tic) * 1000. / getTickFrequency();
cout << "Using Raw: " << toc << endl;
}
int dummy;
cin >> dummy;
return 0;
}

Cuda Memory access error : CudaIllegalAddress , Image Processing(Stereo vision)

I'm using cuda to deal with image proccessing. but my result is always get 'cudaErrorIllegalAddress : an illegal memory access was encountered'
What i did is below.
First, Load converted image(rgb to gray) to device, i use 'cudaMallocPitch' and 'cudaMemcpy2D'
unsigned char *dev_srcleft;
size_t dev_srcleftPitch
cudaMallocPitch((void**)&dev_srcleft, &dev_srcleftPitch, COLS * sizeof(int), ROWS));
cudaMemcpy2D(dev_srcleft, dev_srcleftPitch, host_srcConvertL.data, host_srcConvertL.step,
COLS, ROWS, cudaMemcpyHostToDevice);
And, Allocating 2D array for store result. the result value is describe as 27bit, so what i'm trying is using 'int' which is 4bytes=32bits, not only for ample size , atomic operation(atomicOr, atomicXor) is needed for performance.
and my device is not supports 64bit atomic operation.
int *dev_leftTrans;
cudaMallocPitch((void**)&dev_leftTrans, &dev_leftTransPitch, COLS * sizeof(int), ROWS);
cudaMemset2D(dev_leftTrans, dev_leftTransPitch, 0, COLS, ROWS);
Memory allocation and memcpy2D works great, and i check by
Mat temp_output(ROWS, COLS, 0);
cudaMemcpy2D(temp_output.data, temp_output.step, dev_srcleft, dev_srcleftPitch, COLS, ROWS, cudaMemcpyDeviceToHost);
imshow("temp", temp_output);
Then, Do kernel code.
__global__ void TestKernel(unsigned char *src, size_t src_pitch,
int *dst, size_t dst_pitch,
unsigned int COLS, unsigned int ROWS)
{
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
}
dim3 dimblock(3, 3);
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcleft, dev_srcleftPitch, dev_leftTrans, dev_leftTransPitch, COLS, ROWS);
Parameter COLS and ROWS is size of image.
I think the error occurs here : TestKerenl.
src_val, reading from global memory works good but when i'm trying to access dst, it blows up with cudaErrorIllegalAddress
I don't know what is wrong, and i sufferd for 4 days. please help me
below is my full code
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <device_functions.h>
#include <cuda_device_runtime_api.h>
#include <device_launch_parameters.h>
#include <math.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include<string>
#define HANDLE_ERROR(err)(HandleError(err, __FILE__, __LINE__))
static void HandleError(cudaError_t err, const char*file, int line)
{
if (err != cudaSuccess)
{
printf("%s in %s at line %d\n", cudaGetErrorString(err), file, line);
exit(EXIT_FAILURE);
}
}
using namespace std;
using namespace cv;
string imagePath = "Ted";
string imagePathL = imagePath + "imL.png";
string imagePathR = imagePath + "imR.png";
__global__ void TestKernel(unsigned char*src, size_t src_pitch,
int *dst, size_t dst_pitch,
unsigned int COLS, unsigned int ROWS)
{
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((COLS< x) && (ROWS < y)) return;
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
}
int main(void)
{
//Print_DeviceProperty();
//Left Image Load
Mat host_srcImgL = imread(imagePathL, CV_LOAD_IMAGE_UNCHANGED);
if (host_srcImgL.empty()){ cout << "Left Image Load Fail!" << endl; return; }
Mat host_srcConvertL;
cvtColor(host_srcImgL, host_srcConvertL, CV_BGR2GRAY);
//Right Image Load
Mat host_srcImgR = imread(imagePathR, CV_LOAD_IMAGE_UNCHANGED);
if (host_srcImgL.empty()){ cout << "Right Image Load Fail!" << endl; return; }
Mat host_srcConvertR;
cvtColor(host_srcImgR, host_srcConvertR, CV_BGR2GRAY);
//Create parameters
unsigned int COLS = host_srcConvertL.cols;
unsigned int ROWS = host_srcConvertR.rows;
unsigned int SIZE = COLS * ROWS;
imshow("Left source image", host_srcConvertL);
imshow("Right source image", host_srcConvertR);
unsigned char *dev_srcleft, *dev_srcright, *dev_disp;
int *dev_leftTrans, *dev_rightTrans;
size_t dev_srcleftPitch, dev_srcrightPitch, dev_dispPitch, dev_leftTransPitch, dev_rightTransPitch;
cudaMallocPitch((void**)&dev_srcleft, &dev_srcleftPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_srcright, &dev_srcrightPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_disp, &dev_dispPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_leftTrans, &dev_leftTransPitch, COLS * sizeof(int), ROWS);
cudaMallocPitch((void**)&dev_rightTrans, &dev_rightTransPitch, COLS * sizeof(int), ROWS);
cudaMemcpy2D(dev_srcleft, dev_srcleftPitch, host_srcConvertL.data, host_srcConvertL.step,
COLS, ROWS, cudaMemcpyHostToDevice);
cudaMemcpy2D(dev_srcright, dev_srcrightPitch, host_srcConvertR.data, host_srcConvertR.step,
COLS, ROWS, cudaMemcpyHostToDevice);
cudaMemset(dev_disp, 255, dev_dispPitch * ROWS);
dim3 dimblock(3, 3);
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
cudaEvent_t start, stop;
float elapsedtime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcleft, dev_srcleftPitch, dev_leftTrans, dev_leftTransPitch, COLS, ROWS);
/*TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcright, dev_srcrightPitch, dev_rightTrans, dev_rightTransPitch, COLS, ROWS);*/
cudaThreadSynchronize();
cudaError_t res = cudaGetLastError();
if (res != cudaSuccess)
printf("%s : %s\n", cudaGetErrorName(res), cudaGetErrorString(res));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cout << elapsedtime << "msec" << endl;
Mat temp_output(ROWS, COLS, 0);
cudaMemcpy2D((int*)temp_output.data, temp_output.step, dev_leftTrans, dev_leftTransPitch, COLS, ROWS, cudaMemcpyDeviceToHost);
imshow("temp", temp_output);
waitKey(0);
return 0;
}
And this is my environment vs2013, cuda v6.5
Device' property's below
Major revision number: 3
Minor revision number: 0
Name: GeForce GTX 760 (192-bit)
Total global memory: 1610612736
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 888500
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 6
Kernel execution timeout: Yes
One problem is that your kernel doesn't do any thread-checking.
When you define a grid of blocks like this:
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
you will often be launching extra blocks. The reason is that if COLS or ROW is not evenly divisible by the block dimensions (3 in this case) then you will get extra blocks to cover the remainder in each case.
These extra blocks will have some threads that are doing useful work, and some that will access out-of-bounds. To protect against this, it's customary to put a thread-check in your kernel to prevent out-of-bounds accesses:
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((x < COLS) && (y < ROWS)) { // add this
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
} // add this
This means that only the threads that have a valid (in-bounds) x and y will actually do any accesses.
As an aside, (3,3) may not be a particularly good choice of block dimensions for performance reasons. It's usually a good idea to create block dimensions whose product is a multiple of 32, so (32,4) or (16,16) might be examples of better choices.
Another problem in your code is pitch usage for dst array.
Pitch is always in bytes, so first you need to cast dst pointer to char*, calculate row offset and then cast it back to int*:
int* dst_row = (int*)(((char*)dst) + y * dst_pitch);
dst_row[x] = src_val;

Creating a Mat object from a YV12 image buffer

I have a buffer which contains an image in YV12 format. Now I want to either convert this buffer to RGB format or create a Mat object from it directly! Can someone help me? I tried this code :
cv::Mat input(widthOfImg, heightOfImg, CV_8UC1, vy12Buffer);
cv::Mat converted;
cv::cvtColor(input, converted, CV_YUV2RGB_YV12);
That's possible.
cv::Mat picYV12 = cv::Mat(nHeight * 3/2, nWidth, CV_8UC1, yv12DataBuffer);
cv::Mat picBGR;
cv::cvtColor(picYV12, picBGR, CV_YUV2BGR_YV12);
cv::imwrite("test.bmp", picBGR); //only for test
Opencv color conversion flags
The height is multiplied by 3/2 because there are 4 Y samples, and 1 U and 1 V sample stored for every 2x2 square of pixels. This results in a byte sample to pixel ratio of 3/2
4*1+1+1 samples per 2*2 pixels = 6/4 = 3/2
YV12 Format
Correction: In the last version of OpenCV (i use oldest 2.4.13 version) is color conversion code changed to
COLOR_YUV2BGR_YV12
cv::cvtColor(picYV12, picBGR, COLOR_YUV2BGR_YV12);
here is the corresponding version in java (Android)...
This method was faster than other techniques like renderscript or opengl(glReadPixels) for getting bitmap from yuv12/i420 data stream (tested with webrtc i420 ).
long startTimei = SystemClock.uptimeMillis();
Mat picyv12 = new Mat(768,512,CV_8UC1); //(im_height*3/2,im_width), should be even no...
picyv12.put(0,0,return_buff); // buffer - byte array with i420 data
Imgproc.cvtColor(picyv12,picyv12,COLOR_YUV2RGB_YV12);// or use COLOR_YUV2BGR_YV12 depending on output result
long endTimei = SystemClock.uptimeMillis();
Log.d("i420_time", Long.toString(endTimei - startTimei));
Log.d("picyv12_size", picyv12.size().toString()); // Check size
Log.d("picyv12_type", String.valueOf(picyv12.type())); // Check type
Utils.matToBitmap(picyv12,tbmp2); // Convert mat to bitmap (height, width) i.e (512,512) - ARGB_888
save(tbmp2,"itest"); // Save bitmap
That's impossible.
Y'UV420p is a planar format, meaning that the Y', U, and V values are
grouped together instead of interspersed. The reason for this is that
by grouping the U and V values together, the image becomes much more
compressible. When given an array of an image in the Y'UV420p format,
all the Y' values come first, followed by all the U values, followed
finally by all the V values.
but cv::Mat is a RGB color model, and arranged like B0 G0 R0 B1 G1 R1... So,we can't create a Mat object from a YV12 buffer directly.
Here is an example:
cv::Mat Yv12ToRgb( uchar *pBuffer,long bufferSize, int width,int height )
{
cv::Mat result(height,width,CV_8UC3);
uchar y,cb,cr;
long ySize=width*height;
long uSize;
uSize=ySize>>2;
assert(bufferSize==ySize+uSize*2);
uchar *output=result.data;
uchar *pY=pBuffer;
uchar *pU=pY+ySize;
uchar *pV=pU+uSize;
uchar r,g,b;
for (int i=0;i<uSize;++i)
{
for(int j=0;j<4;++j)
{
y=pY[i*4+j];
cb=ucharpU[i];
cr=ucharpV[i];
//ITU-R standard
b=saturate_cast<uchar>(y+1.772*(cb-128));
g=saturate_cast<uchar>(y-0.344*(cb-128)-0.714*(cr-128));
r=saturate_cast<uchar>(y+1.402*(cr-128));
*output++=b;
*output++=g;
*output++=r;
}
}
return result;
}
You can try as YUV_I420 array
char filePath[3000];
int width, height;
cout << "file path = ";
cin >> filePath;
cout << "width = ";
cin >> width;
cout << "height = ";
cin >> height;
FILE *pFile = fopen(filePath, "rb");
unsigned char* buff = new unsigned char[width * height *3 / 2];
fread(buff, 1, width * height* 3 / 2, pFile);
fclose(pFile);
cv::Mat imageRGB;
cv::Mat picI420 = cv::Mat(height * 3 / 2, width, CV_8UC1, buff);
cv::cvtColor(picI420, imageRGB, CV_YUV2BGRA_I420);
imshow("imageRGB", imageRGB);
waitKey(0);

Unknown error when inverting image using cuda

i began to implement some simple image processing using cuda but i have an error in my code
the error happens when i copy pixels from device to host
this is my try
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <opencv2\core\core.hpp>
#include <opencv2\highgui\highgui.hpp>
#include <stdio.h>
using namespace cv;
unsigned char *h_pixels;
unsigned char *d_pixels;
int bufferSize;
int width,height;
const int BLOCK_SIZE = 32;
Mat image;
void get_pixels(const char* fileName)
{
image = imread(fileName);
bufferSize = image.size().width * image.size().height * 3 * sizeof(unsigned char);
width = image.size().width;
height = image.size().height;
h_pixels = new unsigned char[bufferSize];
memcpy(h_pixels,image.data,bufferSize);
}
__global__ void invert_image(unsigned char* pixels,int width,int height)
{
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cidx = (row * width + col) * 3;
pixels[cidx] = 255 - pixels[cidx];
pixels[cidx + 1] = 255 - pixels[cidx + 1];
pixels[cidx + 2] = 255 - pixels[cidx + 2];
}
int main()
{
get_pixels("D:\\photos\\z.jpg");
cudaError_t err = cudaMalloc((void**)&d_pixels,bufferSize);
err = cudaMemcpy(d_pixels,h_pixels,bufferSize,cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(width/dimBlock.x,height/dimBlock.y);
invert_image<<<dimBlock,dimGrid>>>(d_pixels,width,height);
unsigned char *pixels = new unsigned char[bufferSize];
err= cudaMemcpy(pixels,d_pixels,bufferSize,cudaMemcpyDeviceToHost);// unknown error
const char * errStr = cudaGetErrorString(err);
cudaFree(d_pixels);
image.data = pixels;
namedWindow("display image");
imshow("display image",image);
waitKey();
return 0;
}
also how can i find out error that occurs in cuda device
thanks for your help
OpenCV images are not continuous. Each row is 4 byte or 8 byte aligned. You should also pass the step field of the Mat to the CUDA kernel, so that you can calculate the cidx correctly. The generic formula to calculate the output index is:
cidx = row * (step/elementSize) + (NumberOfChannels * col);
in your case, it will be:
cidx = row * step + (3 * col);
Referring to the alignment of images, you buffer size is equal to image.step * image.size().height.
Next thing is the one pointed out by #phoad in the third point. You should create enough number of thread blocks to cover the whole image.
Here is a generic formula for Grid which will create enough number of blocks for any image size.
dim3 block(BLOCK_SIZE,BLOCK_SIZE);
dim3 grid((width + block.x - 1)/block.x,(height + block.y - 1)/block.y);
First of all be sure that the image file is read correctly.
Check if the device memory is allocated with CUDA_SAFE_CALL(cudaMalloc(..))
Check the dimensions of the image. If the dimension of the image is not multiples of BLOCKSIZE than you might be missing some indices and the image is not fully inverted.
Call cudaDeviceSynchronize after the kernel call and check its return value.
Do you get any error when you run the code without calling the kernel anyway?
You are not freeing the h_pixels and might have a memory leak.
Instead of using BLOCKSIZE in the kernel you might use "blockDim.x". So calculating indices like "blockIdx.x * blockDim.x + threadIdx.x"
Try to do not touch the memory area in the kernel code, namely comment out the memory updates at the kernel (the lines where you access the pixels array) and check if the program continues to fail. If it does not continue to fail you might be accessing out of the bounds.
Use this command immediately after the kernel invocation to print the kernel errors:
printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()))

Resources