I have written a simple OpenCL program with an objective to make a copy of input image using OpenCL image2d struct. It seemed like a simple job to do but I have been stuck at it.
The kernel has "read_imageui" which always returns zero value. The input image is a all white jpeg image.
Image loading is done using OpenCV imread.
Here is the Kernel :
__kernel void copy(__read_only image2d_t in, __write_only image2d_t out)
int idx = get_global_id(0);
int idy = get_global_id(1);
int2 pos = (int2)(idx,idy);
uint4 pix = read_imageui(in,smp,pos);
Here is the host code :
int main(){
//get all platforms (drivers)
std::vector<cl::Platform> all_platforms;
std::cout<<" No platforms found. Check OpenCL installation!\n";
cl::Platform default_platform=all_platforms[0];
std::cout << "Using platform: "<<default_platform.getInfo<CL_PLATFORM_NAME>()<<"\n";
std::cout <<" Platform Version: "<<default_platform.getInfo<CL_PLATFORM_VERSION>() <<"\n";
//cout << "Image 2D support : " << default_platform.getInfo<CL_DEVICE_IMAGE_SUPPORT>()<<"\n";
//get default device of the default platform
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
std::cout<<" No devices found. Check OpenCL installation!\n";
cl::Device default_device=all_devices[0];
std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";
//creating a context
cl::Context context(default_device);
//cl::Program::Sources sources;
//load kernel coad
cl::Program program(context,LoadKernel("image_test.cl"));
//build kernel code
std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
// Determine and show image format support
vector<cl::ImageFormat > supportedFormats;
cout <<"No. of supported formats " <<supportedFormats.size()<<endl;
Mat white = imread("white_small.jpg");
cvtColor(white, white, CV_BGR2RGBA);
Mat out = Mat(white);
char * inbuffer = reinterpret_cast<char *>(white.data);
char * outbuffer = reinterpret_cast<char *>(out.data);
//cout <<"Type of input : " <<white.type<<endl;
int sizeOfImage = white.cols * white.rows * white.channels();
int outImageSize = white.cols * white.rows * white.channels();
int w = white.cols;
int h = white.rows;
cout <<"Creating Images ... "<<endl;
cout <<"Dimensions ..." <<w << " x "<<h<<endl;
const cl::ImageFormat format(CL_RGBA, CL_UNSIGNED_INT8);
cl::Image2D imageSrc(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, format, white.cols, white.rows,0,inbuffer);
cl::Image2D imageDst(context, CL_MEM_WRITE_ONLY, format , white.cols, white.rows,0,NULL);
cout <<"Creating Kernel Program ... "<<endl;
cl::Kernel kernelCopy(program, "copy");
kernelCopy.setArg(0, imageSrc);
kernelCopy.setArg(1, imageDst);
cout <<"Creating Command Queue ... "<<endl;
cl::CommandQueue queue(context, default_device);
cout <<"Executing Kernel ... "<<endl;
int64 e = getTickCount();
for(int i = 0 ; i < 100 ; i ++)
queue.enqueueNDRangeKernel(kernelCopy, cl::NullRange, cl::NDRange(w, h), cl::NullRange);
cout <<((getTickCount() - e) / getTickFrequency())/100 <<endl;;
cl::size_t<3> origin;
cl::size_t<3> size;
origin[0] = 0;
origin[1] = 0;
origin[2] = 0;
size[0] = w;
size[1] = h;
size[2] = 1;
cout <<"Transfering Images ... "<<endl;
//unsigned char *tmp = new unsigned char (w * h * 4);
//CL_TRUE means that it waits for the entire image to be copied before continuing
queue.enqueueReadImage(imageDst, CL_TRUE, origin, size, 0, 0, outbuffer);
/* OLD CODE ==================================================*/
return 0;
However if I change the kernel as
uint4 pix2 = (uint4)(255,255,255,1);
It outputs a white image. Which means there is something wrong with how I am using the read_image
it came out to be something related to "reference counting" on Mat copy constructor.
if instead of using
Mat white = imread("white_small.jpg");
cvtColor(white, white, CV_BGR2RGBA);
Mat out = Mat(white);
Initialize the output matrix "out" as
Mat out = Mat(white.size,CV_8UC4)
then it works fine.
I couldn't comprehend completely what exactly caused it but I know that it is due to "reference counting" of Mat copy constructor when used as first syntax.
When write:
Mat out = Mat(white);
It is like a shallow copy of white to out. Bot white.data and out.data pointers will be pointing to same memory and reference count will be incremented. So, when you call out.setTo, white Mat will also see same change. Declaring out as below might be good idea:
Mat out = Mat(white.size,CV_8UC(white.channels()));
I want to use clEnqueueReadBufferRect in OpenCL. To do it, I need to define the region as one of its passing arguement. But there is a inconsistency between references of OpenCL
In online reference, it is mention that
The (width, height, depth) in bytes of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region [2] should be 1.
but in the reference book, page 77, it is mentioned that
region defines the (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region [2] should be 1. The values in region cannot be 0
but unfortunately, none of those guides worked for me and I should provide region in (width in columns, height in rows, depth in slices), otherwise, when I defined them as byte not rows/columns, I got the error CL_INVALID_VALUE. Now which one is correct?
#define WGX 16
#define WGY 16
#include "misc.hpp"
int main(int argc, char** argv)
int i;
int n = 1000;
int filterWidth = 3;
int filterRadius = (int) filterWidth/2;
int padding = filterRadius * 2;
double h = 1.0 / n;
int width_x[2];
int height_x[2];
int deviceWidth[2];
int deviceHeight[2];
int deviceDataSize[2];
for (i = 0; i < 2; ++i)
set_domain_length(n, n, height_x[i], width_x[i], i);
float* x = new float [height_x[0] * width_x[0]];
init_unknown(x, height_x[0], width_x[0], 0);
set_bndryCond(x, width_x[0], h);
std::vector<cl::Platform> platforms;
assert(platforms.size() > 0);
cl::Platform myPlatform = platforms[0];
std::vector<cl::Device> devices;
myPlatform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
assert(devices.size() > 0);
cl::Device myDevice = devices[0];
cl_display_info(myPlatform, myDevice);
cl::Context context(myDevice);
std::ifstream kernelFile("iterative_scheme.cl");
std::string src(std::istreambuf_iterator<char>(kernelFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources sources(1,std::make_pair(src.c_str(),src.length() + 1));
cl::Program program(context, sources);
cl::CommandQueue queue(context, myDevice);
deviceWidth[0] = roundUp(width_x[0], WGX);
deviceHeight[0] = height_x[0];
deviceDataSize[0] = deviceWidth[0] * deviceHeight[0] * sizeof(float);
cl::Buffer buffer_x;
buffer_x = cl::Buffer(context, CL_MEM_READ_WRITE, deviceDataSize[0]);
} catch (cl::Error& error)
std::cout << " ---> Problem in creating buffer(s) " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
cl::size_t<3> buffer_origin;
buffer_origin[0] = 0;
buffer_origin[1] = 0;
buffer_origin[2] = 0;
cl::size_t<3> host_origin;
host_origin[0] = 0;
host_origin[1] = 0;
host_origin[2] = 0;
cl::size_t<3> region;
region[0] = (size_t)(deviceWidth[0] * sizeof(float));
region[1] = (size_t)(height_x[0]);
region[2] = 1;
std::cout << "===> Start writing data to device" << std::endl;
queue.enqueueWriteBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin, region,
deviceWidth[0] * sizeof(float), 0, width_x[0] * sizeof(float), 0, x);
} catch (cl::Error& error)
std::cout << " ---> Problem in writing data from Host to Device: " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
// Build the program
std::cout << "===> Start building program" << std::endl;
std::cout << " ---> Build Successfully " << std::endl;
} catch(cl::Error& error)
std::cout << " ---> Problem in building program " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
std::cout << " ---> " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(myDevice) << std::endl;
std::cout << "===> Start reading data from device" << std::endl;
// read result y and residual from the device
buffer_origin[0] = (size_t)(filterRadius * sizeof(float));
buffer_origin[1] = (size_t)filterRadius;
buffer_origin[2] = 0;
host_origin[0] = (size_t)(filterRadius * sizeof(float));
host_origin[1] = (size_t)filterRadius;
host_origin[2] = 0;
// region of x
region[0] = (size_t)((width_x[0] - padding) * sizeof(float));
region[1] = (size_t)(height_x[0] - padding);
region[2] = 1;
queue.enqueueReadBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin,
region, deviceWidth[0] * sizeof(float), 0, deviceWidth[0] * sizeof(float), 0, x);
} catch (cl::Error& error)
std::cout << " ---> Problem reading buffer in device: " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
delete[] (x);
return 0;
The online reference link you provided says:
The (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region[2] should be 1. The values in region cannot be 0.
This is consistent with what you quoted later as "reference book". That's because your first link points to OpenCL 2.0 while the second link to 1.2.
The inconsistency you mention exist between online manual of 1.2 and the PDF of 1.2, but the online manual of 2.0 is consistent with the PDF. So i assume it was a bug in 1.2 online manual which was fixed in 2.0
otherwise, when I defined them as byte not rows/columns
What's a "column", and how is it different from bytes ?
The "elements" of buffer rect copy are always bytes. If you're reading/writing a 1D rect from a buffer, it simply transfers region[0] bytes. The reason why the API has "rows" and "slices" is because if using 2D/3D regions, you can have padding between data; but you can't have padding between elements in a 1D region.
I found out what is the reason of the problem, that's according to the online reference
CL_INVALID_VALUE if host_row_pitch is not 0 and is less than region[0].
so enqueueWriteBufferRect should change as follow:
queue.enqueueWriteBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin, region,
deviceWidth[0] * sizeof(float), 0, deviceWidth[0] * sizeof(float), 0, x);
which means host_row_pitch = deviceWidth[0] * sizeof(float) instead of host_row_pitch = width_x[0] * sizeof(float).
I took the code in https://gist.github.com/kyrs/9adf86366e9e4f04addb (which takes an opencv cv::Mat image as input and converts it to tensor) and I use it to label images with the model inception_v3_2016_08_28_frozen.pb stated in the Tensorflow tutorial (https://www.tensorflow.org/tutorials/image_recognition#usage_with_the_c_api). Everything worked fine when using a batchsize of 1. However, when I increase the batchsize to 2 (or greater), the size of
finalOutput (which is of type std::vector) is zero.
Here's the code to reproduce the error:
// Only for VisualStudio
#define NOMINMAX
#include <string>
#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/tensor.h"
int batchSize = 2;
int height = 299;
int width = 299;
int depth = 3;
int mean = 0;
int stdev = 255;
// Set image paths
cv::String pathFilenameImg1 = "D:/IMGS/grace_hopper.jpg";
cv::String pathFilenameImg2 = "D:/IMGS/lenna.jpg";
// Set model paths
std::string graphFile = "D:/Tensorflow/models/inception_v3_2016_08_28_frozen.pb";
std::string labelfile = "D:/Tensorflow/models/imagenet_slim_labels.txt";
std::string InputName = "input";
std::string OutputName = "InceptionV3/Predictions/Reshape_1";
void read_prepare_image(cv::String pathImg, cv::Mat &imgPrepared) {
// Read Color image:
cv::Mat imgBGR = cv::imread(pathImg);
// Now we resize the image to fit Model's expected sizes:
cv::Size s(height, width);
cv::Mat imgResized;
cv::resize(imgBGR, imgResized, s, 0, 0, cv::INTER_CUBIC);
// Convert the image to float and normalize data:
imgResized.convertTo(imgPrepared, CV_32FC1);
imgPrepared = imgPrepared - mean;
imgPrepared = imgPrepared / stdev;
int main()
// Read and prepare images using OpenCV:
cv::Mat img1, img2;
read_prepare_image(pathFilenameImg1, img1);
read_prepare_image(pathFilenameImg2, img2);
// creating a Tensor for storing the data
tensorflow::Tensor input_tensor(tensorflow::DT_FLOAT, tensorflow::TensorShape({ batchSize, height, width, depth }));
auto input_tensor_mapped = input_tensor.tensor<float, 4>();
// Copy images data into the tensor:
for (int b = 0; b < batchSize; ++b) {
const float * source_data;
if (b == 0)
source_data = (float*)img1.data;
source_data = (float*)img2.data;
for (int y = 0; y < height; ++y) {
const float* source_row = source_data + (y * width * depth);
for (int x = 0; x < width; ++x) {
const float* source_pixel = source_row + (x * depth);
const float* source_B = source_pixel + 0;
const float* source_G = source_pixel + 1;
const float* source_R = source_pixel + 2;
input_tensor_mapped(b, y, x, 0) = *source_R;
input_tensor_mapped(b, y, x, 1) = *source_G;
input_tensor_mapped(b, y, x, 2) = *source_B;
// Load the graph:
tensorflow::GraphDef graph_def;
ReadBinaryProto(tensorflow::Env::Default(), graphFile, &graph_def);
// create a session with the graph
std::unique_ptr<tensorflow::Session> session_inception(tensorflow::NewSession(tensorflow::SessionOptions()));
// run the loaded graph
std::vector<tensorflow::Tensor> finalOutput;
session_inception->Run({ { InputName,input_tensor } }, { OutputName }, {}, &finalOutput);
// Get Top 5 classes:
std::cerr << "final output size = " << finalOutput.size() << std::endl;
tensorflow::Tensor output = std::move(finalOutput.at(0));
auto scores = output.flat<float>();
std::cerr << "scores size=" << scores.size() << std::endl;
std::ifstream label(labelfile);
std::string line;
std::vector<std::pair<float, std::string>> sorted;
for (unsigned int i = 0; i <= 1000; ++i) {
std::getline(label, line);
sorted.emplace_back(scores(i), line);
std::sort(sorted.begin(), sorted.end());
std::reverse(sorted.begin(), sorted.end());
std::cout << "size of the sorted file is " << sorted.size() << std::endl;
for (unsigned int i = 0; i< 5; ++i)
std::cout << "The output of the current graph has category " << sorted[i].second << " with probability " << sorted[i].first << std::endl;
Do I miss anything? Any ideas?
Thanks in advance!
I had the same problem. When I changed to the model used in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark (differente version of inception) bigger batch sizes work correctly.
Notice you need to change the input size from 299,299,3 to 224,224,3 and the input and output layer names to: input:0 and output:0
Probably the graph in the protobuf file had a fixed batch size of 1 and I was only changing the shape of the input, not the graph. The graph has to accept a variable batch size by setting the shape to (None, width, heihgt, channels). This is done when you freeze the graph. Since the graph we have is already frozen, there is no way to change the batch size at this point.
I am a newbie to OpenCV, so pls bear with me.. I am trying to dump the histogram Mat object for the given image.. It fails with the below error - Any help appreciated...
The first cout in the below program i.e of the loaded image prints successfully - While the second cout of the hist of the image fails with the below error
OpenCV Error: Assertion failed (m.dims <= 2) in FormattedImpl, file /mycode/ws/opencv/opencv-3.0.0-beta/modules/core/src/out.cpp, line 86
libc++abi.dylib: terminating with uncaught exception of type cv::Exception: /mycode/ws/opencv/opencv-3.0.0-beta/modules/core/src/out.cpp:86: error: (-215) m.dims <= 2 in function FormattedImpl
Here is the complete code
#include <stdio.h>
#include <string>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: opencv.out <Image_Path>\n");
return -1;
string imagePath = (argv[1]);
cout << "loading image..." << imagePath << endl;
Mat image = imread(imagePath, 1);
Mat hist;
int imgCount = 1;
int dims = 3;
const int histSizes[] = {4, 4, 4};
const int channels[] = {0, 1, 2};
float rRange[] = {0, 256};
float gRange[] = {0, 256};
float bRange[] = {0, 256};
const float *ranges[] = {rRange, gRange, bRange};
Mat mask = Mat();
calcHist(&image, imgCount, channels, mask, hist, dims, histSizes, ranges);
cout << image << "Loaded image..." << endl;
cout << "Hist of image..." << hist;
return 0;
Based on the OpenCV 2.4.9 source code:
static inline std::ostream& operator << (std::ostream& out, const Mat& mtx)
Formatter::get()->write(out, mtx);
return out;
Is the function you are calling when using << operator. Formatter::get() returns appropriate
formatter class based on the programming language you are using.
write() function basicly calls:
static void writeMat(std::ostream& out, const Mat& m, char rowsep, char elembrace, bool singleLine)
CV_Assert(m.dims <= 2);
int type = m.type();
char crowbrace = getCloseBrace(rowsep);
char orowbrace = crowbrace ? rowsep : '\0';
if( orowbrace || isspace(rowsep) )
rowsep = '\0';
for( int i = 0; i < m.rows; i++ )
out << orowbrace;
if( m.data )
writeElems(out, m.ptr(i), m.cols, type, elembrace);
out << crowbrace << (i+1 < m.rows ? ", " : "");
if(i+1 < m.rows)
out << rowsep << (singleLine ? " " : "");
out << "\n ";
As you can see if your Mat dimensionality is greater than 2 assertion will be thrown like in your code (CV_Assert(m.dims<=2)).
calcHist() with the parameters you gave produces 3-dimentional Mat and thus it cannot be displayed using << operator
By calling calcHist() function that way you are getting 3-dimentional histogram and I don't see a simple solution to visualize that in OpenCV (which doesn't mean it can't be done). If it's something you must do I would suggest to look into OpenGL for 3D data visualization. If not you could simply call this function for each channel seperatly - you will get 3 one-dimenational histograms which you can print using << operator.
Ok, so I've decided that using a histogram of oriented gradients is a better method for image fingerprinting vs. creating a histogram of sobel derivatives. I think I finally have it mostly figured out but when I test my code I get the following:
OpenCV Error: Assertion failed ((winSize.width - blockSize.width) % blockStride.width == 0 && (winSize.height - blockSize.height) % blockStride.height == 0).
As of now I'm just trying to figure out how to compute the HOG correctly and see the results; but not visually, I just want some very basic output to see if the HOG was created. Then I'll figure out how to use it in image comparison.
Here is my sample code:
using namespace cv;
using namespace std;
int main(int argc, const char * argv[])
// Initialize string variables.
string thePath, img, hogSaveFile;
thePath = "/Users/Mikie/Documents/Xcode/images/";
img = thePath + "HDimage.jpg";
hogSaveFile = thePath + "HDimage.yml";
// Create mats.
Mat src;
// Load image as grayscale.
src = imread(img, CV_LOAD_IMAGE_GRAYSCALE);
// Verify source loaded.
cout << "No image data. \n ";
return -1;
cout << "Image loaded. \n" << "Size: " << src.cols << " X " << src.rows << "." << "\n";
// Initialize float variables.
float imgWidth, imgHeight, newWidth, newHeight;
imgWidth = src.cols;
imgHeight = src.rows;
newWidth = 320;
newHeight = (imgHeight/imgWidth)*newWidth;
Mat dst = Mat::zeros(newHeight, newWidth, CV_8UC3);
resize(src, dst, Size(newWidth, newHeight), CV_INTER_LINEAR);
// Was resize successful?
if (dst.rows < src.rows && dst.cols < src.cols) {
cout << "Resize successful. \n" << "New size: " << dst.cols << " X " << dst.rows << "." << "\n";
} else {
cout << "Resize failed. \n";
return -1;
vector<float>theHOG(Mat dst);{
if (dst.empty()) {
cout << "Image lost. \n";
} else {
cout << "Setting up HOG. \n";
imshow("Image", dst);
bool gammaC = true;
int nlevels = HOGDescriptor::DEFAULT_NLEVELS;
Size winS(newWidth, newHeight);
// int block_size = 16;
// int block_stride= 8;
// int cell_size = 8;
int gbins = 9;
vector<float> descriptorsValues;
vector<Point> locations;
HOGDescriptor hog(Size(320, 412), Size(16, 16), Size(8, 8), Size(8, 8), gbins, -1, HOGDescriptor::L2Hys, 0.2, gammaC, nlevels);
hog.compute(dst, descriptorsValues, Size(0,0), Size(0,0), locations);
printf("descriptorsValues.size() = %ld \n", descriptorsValues.size()); //prints 960
for (int i = 0; i <descriptorsValues.size(); i++) {
cout << descriptorsValues[i] << endl;
return 0;
As you can see, I messed around with different variables to define the sizes but to no avail so, I commented them out and tried manually setting them. Still nothing. What am I doing wrong? Any help will be greatly appreciated.
Thank you!
You are initializing the HOGDescriptor incorrectly.
The assertion states that each of the first three input parameters must satisfy the constraint:
(winSize - blockSize) % blockStride == 0
in both height and width dimensions.
The problem is that winSize.height does not satisfy this constraint, considering the other parameters you initialize hog with:
(412 - 16) % 8 = 4 //Problem!!
Probably the simplest fix is to increase your window dimensions from cv::Size(320,412) to something divisible by 8, perhaps cv::Size(320,416), but the specific size will depend on your specific requirements. Just pay attention to what the assertion is saying!
I have a big problem (on Linux):
I create a buffer with defined data, then an OpenCL kernel takes this data and puts it into an image2d_t. When working on an AMD C50 (Fusion CPU/GPU) the program works as desired, but on my GeForce 9500 GT the given kernel computes the correct result very rarely. Sometimes the result is correct, but very often it is incorrect. Sometimes it depends on very strange changes like removing unused variable declarations or adding a newline. I realized that disabling the optimization will increase the probability to fail. I have the most actual display driver in both systems.
Here is my reduced code:
#include <CL/cl.h>
#include <string>
#include <iostream>
#include <sstream>
#include <cmath>
void checkOpenCLErr(cl_int err, std::string name){
const char* errorString[] = {
if (err != CL_SUCCESS) {
std::stringstream str;
str << errorString[-err] << " (" << err << ")";
throw std::string(name)+(str.str());
int main(){
cl_context m_context;
cl_platform_id* m_platforms;
unsigned int m_numPlatforms;
cl_command_queue m_queue;
cl_device_id m_device;
cl_int error = 0; // Used to handle error codes
m_platforms = new cl_platform_id[m_numPlatforms];
error = clGetPlatformIDs(m_numPlatforms,m_platforms,&m_numPlatforms);
checkOpenCLErr(error, "getPlatformIDs");
// Device
error = clGetDeviceIDs(m_platforms[0], CL_DEVICE_TYPE_GPU, 1, &m_device, NULL);
checkOpenCLErr(error, "getDeviceIDs");
// Context
cl_context_properties properties[] =
{ CL_CONTEXT_PLATFORM, (cl_context_properties)(m_platforms[0]), 0};
m_context = clCreateContextFromType(properties, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// m_private->m_context = clCreateContext(properties, 1, &m_private->m_device, NULL, NULL, &error);
checkOpenCLErr(error, "Create context");
// Command-queue
m_queue = clCreateCommandQueue(m_context, m_device, 0, &error);
checkOpenCLErr(error, "Create command queue");
//Build program and kernel
const char* source = "#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable\n"
"__kernel void bufToImage(__global unsigned char* in, __write_only image2d_t out, const unsigned int offset_x, const unsigned int image_width , const unsigned int maxval ){\n"
"\tint i = get_global_id(0);\n"
"\tint j = get_global_id(1);\n"
"\tint width = get_global_size(0);\n"
"\tint height = get_global_size(1);\n"
"\tint pos = j*image_width*3+(offset_x+i)*3;\n"
"\tif( maxval < 256 ){\n"
"\t\tfloat4 c = (float4)(in[pos],in[pos+1],in[pos+2],1.0f);\n"
"\t\tc.x /= maxval;\n"
"\t\tc.y /= maxval;\n"
"\t\tc.z /= maxval;\n"
"\t\twrite_imagef(out, (int2)(i,j), c);\n"
"\t\tfloat4 c = (float4)(255.0f*in[2*pos]+in[2*pos+1],255.0f*in[2*pos+2]+in[2*pos+3],255.0f*in[2*pos+4]+in[2*pos+5],1.0f);\n"
"\t\tc.x /= maxval;\n"
"\t\tc.y /= maxval;\n"
"\t\tc.z /= maxval;\n"
"\t\twrite_imagef(out, (int2)(i,j), c);\n"
"__kernel void imageToBuf(__read_only image2d_t in, __global unsigned char* out, const unsigned int offset_x, const unsigned int image_width ){\n"
"\tint i = get_global_id(0);\n"
"\tint j = get_global_id(1);\n"
"\tint pos = j*image_width*3+(offset_x+i)*3;\n"
"\tfloat4 c = read_imagef(in, imageSampler, (int2)(i,j));\n"
"\tif( c.x <= 1.0f && c.y <= 1.0f && c.z <= 1.0f ){\n"
"\t\tout[pos] = c.x*255.0f;\n"
"\t\tout[pos+1] = c.y*255.0f;\n"
"\t\tout[pos+2] = c.z*255.0f;\n"
"\t\tout[pos] = 200.0f;\n"
"\t\tout[pos+1] = 0.0f;\n"
"\t\tout[pos+2] = 255.0f;\n"
cl_int err;
cl_program prog = clCreateProgramWithSource(m_context,1,&source,NULL,&err);
if( -err != CL_SUCCESS ) throw std::string("clCreateProgramWithSources");
err = clBuildProgram(prog,0,NULL,"-cl-opt-disable",NULL,NULL);
if( -err != CL_SUCCESS ) throw std::string("clBuildProgram(fromSources)");
cl_kernel kernel = clCreateKernel(prog,"bufToImage",&err);
cl_uint imageWidth = 80;
cl_uint imageHeight = 90;
//Initialize datas
cl_uint maxVal = 255;
cl_uint offsetX = 0;
int size = imageWidth*imageHeight*3;
int resSize = imageWidth*imageHeight*4;
cl_uchar* data = new cl_uchar[size];
cl_float* expectedData = new cl_float[resSize];
for( int i = 0,j=0; i < size; i++,j++ ){
data[i] = (cl_uchar)i;
expectedData[j] = (cl_float)((unsigned char)i)/255.0f;
if ( i%3 == 2 ){
expectedData[j] = 1.0f;
cl_mem inBuffer = clCreateBuffer(m_context,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,size*sizeof(cl_uchar),data,&err);
checkOpenCLErr(err, "clCreateBuffer()");
cl_image_format imgFormat;
imgFormat.image_channel_order = CL_RGBA;
imgFormat.image_channel_data_type = CL_FLOAT;
cl_mem outImg = clCreateImage2D( m_context, CL_MEM_READ_WRITE, &imgFormat, imageWidth, imageHeight, 0, NULL, &err );
size_t kernelRegion[]={imageWidth,imageHeight};
size_t kernelWorkgroup[]={1,1};
//Fill kernel with data
//Run kernel
err = clEnqueueNDRangeKernel(m_queue,kernel,2,NULL,kernelRegion,kernelWorkgroup,0,NULL,NULL);
//Check resulting data for validty
cl_float* computedData = new cl_float[resSize];;
size_t region[]={imageWidth,imageHeight,1};
const size_t offset[] = {0,0,0};
err = clEnqueueReadImage(m_queue,outImg,CL_TRUE,offset,region,0,0,computedData,0,NULL,NULL);
checkOpenCLErr(err, "readDataFromImage()");
for( int i = 0; i < resSize; i++ ){
if( fabs(expectedData[i]-computedData[i])>0.1 ){
std::cout << "Expected: \n";
for( int j = 0; j < resSize; j++ ){
std::cout << expectedData[j] << " ";
std::cout << "\nComputed: \n";
std::cout << "\n";
for( int j = 0; j < resSize; j++ ){
std::cout << computedData[j] << " ";
std::cout << "\n";
throw std::string("Error, computed and expected data are not the same!\n");
}catch(std::string& e){
std::cout << "\nCaught an exception: " << e << "\n";
return 1;
std::cout << "Works fine\n";
return 0;
I also uploaded the source code for you to make it easier to test it:
Please can you tell me if I've done wrong anything?
Is there any mistake in the code or is this a bug in my driver?
Best reagards,
Edit: changed the program (both: here and the linked one) a little bit to make it more likely to get a mismatch.
I found the bug and this is an annoying one:
When working under linux and just linking the OpenCL program with the most actual "OpenCV" library (yes, the computation lib), the binary parts of the kernels, which get compiled and cached in ~/.nv are damaged.
Can you please install the actual OpenCV library and execute following commands:
Generating bad kernel maybe leading sometimes to bad behaviour:
rm -R ~/.nv && g++ strangeOpenCLError.cpp -lOpenCL -lopencv_gpu -o strangeOpenCLError && ./strangeOpenCLError && ls -la ~/.nv/ComputeCache/*/*
Generating good kernel which performs as desired:
rm -R ~/.nv && g++ strangeOpenCLError.cpp -lOpenCL -o strangeOpenCLError && ./strangeOpenCLError && ls -la ~/.nv/ComputeCache/*/*
In my system when using -lopencv_gpu or -lopencv_core I get a kernel object in ~/.nv with a slightly other size due to sightly different binary parts. So these smaller kernels computed bad results in my systems.
The problem is that the bug does not always appear: Sometimes just when working on buffers, which are big enough. So the more relyable measurement is the different kernel-cache size. I edited the program in my question, now it is more likely that it will create the bad result.
Best regards,
PS: I also created a bug report at NVidia and it is in progress. They could reproduce the bug on their system.
To turn off Nvidia compiler cache, set env. variable CUDA_CACHE_DISABLE=1. That may helps to avoid the problem in future.
In line
m_context = clCreateContextFromType(properties, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
you should use &error as last parameter to get a meaningful error. Without it I got some silly error messages. (I needed to change the platform to get my GPU board.)
I can not reproduce the error with my nVidia GeForce 8600 GTS. I get a 'Works fine'. I tried it >20 times without any issue.
I also can not see any error beside that you code is a little confusing. You should remove all commented out code and introduce some blank lines for grouping the code a little bit.
Do you have the latest drivers? The behavior you describe sounds very familiar like an uninitialized buffer or variable, but I do not see anything like that.