Unhandled exception when using opencv HOG descriptor compute function with OpenCL - opencv

I've recently been looking into using OpenCL to reduce a bottleneck in our application, which is a call to cv::HOGDescriptor::compute(). Using the OpenCL version seems to be a simple case of using UMat instead of Mat and I've already had some success with the following:
// Replaced the following...
// _descriptor_calculator.compute(image, descriptor_flat);
// with...
cv::UMat input = image.getUMat(cv::ACCESS_RW);
_descriptor_calculator.compute(input, descriptor_flat);
This works for images with a resolution (up to and including) 2048x1536. However for an image of 2592x1936 I get an unhandled exception, and my debugger breaks in gshandlereh.c.
GSUnwindInfo = *(PULONG)GSHandlerData;
if (IS_DISPATCHING(ExceptionRecord->ExceptionFlags)
Disposition = __CxxFrameHandler3( // <--------- breaks here
Disposition = ExceptionContinueSearch;
Two important points to mention:
I can fix the error by downsampling the image, so it appears to be related to the size of the image
The CPU version of the code works just fine with 2592x1936 images, so I'm reasonably confident that there isn't an error with any of the input parameters
Downsampling larger images is my fallback option, but I find the conclusion that the OpenCL version of the HOG descriptor compute function can't handle images beyond a certain size a bit unsatisfactory - the fact that there is an unhandled exception rather than an assert and a sensible error message leads me to believe it is more complicated/sinister than this..!
Thanks in advance
Can anybody shed any light on this? Any input would be appreciated.
As requested, a self-contained example. I was able to reproduce the problem by feeding in a random google image that was big enough (searched for jpgs of size 2592x1936):
cv::UMat img = cv::imread("image3.jpg", cv::IMREAD_COLOR).getUMat(cv::ACCESS_RW);
uint16_t image_width = img.cols;
uint16_t image_height = img.rows;
int down_scale = 1;
uint16_t _image_width = (uint16_t)ceil((float)(image_width / down_scale) / 16) * 16;
uint16_t _image_height = (uint16_t)ceil((float)(image_height / down_scale) / 16) * 16;
std::cout << _image_height << " x " << _image_width << std::endl;
std::cout << (_image_width - 16) % 16 << std::endl;
std::cout << (_image_height - 16) % 16 << std::endl;
auto descriptor_calculator = cv::HOGDescriptor(cv::Size(_image_width, _image_height), cv::Size(16, 16), cv::Size(16, 16), cv::Size(16, 16), 9);
std::vector<float> descriptor_flat;
descriptor_calculator.compute(img, descriptor_flat);
Similar to the original problem, works for small enough images, fails for large enough images. Unfortunately did not manage to glean much more information from running outside of the IDE or setting the IDE to break on all exceptions - the only clue that I've got is that the exception is an access violation. I wondered if it was the call to ceil that was causing an access violation, but the resolution is divisible by 16 so not surprisingly changing to floor made no difference.
Additional info as requested:
Running in Windows
Compiler/IDE is Visual Studio 2015
OpenCV version is 3.4.0
Thanks for the help, happy to provide more info


MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
sum += H.coarse[k];
if ( sum > t )
sum -= H.coarse[k];
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm: https://www.researchgate.net/publication/321690537_Efficient_Scalable_Median_Filtering_Using_Histogram-Based_Operations
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

OpenCV ConvertTo CV_32SC1 from CV_8UC1

Hello I am using opencv in version 3.4 and want to read an image (*.pgm) and then convert it to CV_32SC1. Therefore I use the following code (part):
img = imread(f, CV_LOAD_IMAGE_GRAYSCALE);
img.convertTo(imgConv, CV_32SC1);
The problem is the following, all pixels are converted to zero, and I don't understand why. I'm checking by (and imshow("Image", imgConv);)
cout << static_cast<int>(img.at<uchar>(200,100));
cout << static_cast<int32_t>(imgConv.at<int32_t>(200,100)) << endl;
In my example this results in
I tested several points of the image, all pixels are simply the same, but shouldn't them being converted automatically to the 32 bit range, or do I have to manage that manually?
You have to manage that manually. This is why cv::Mat::convertTo() has another parameter, a scale. For instance, if you want to convert from CV_8U to CV_32F you'd typically
img.convertTo(img2, CV_32F, 1.0/255.0);
to scale to the typical float-valued range. I'm not sure what your expected range for CV_32SC1 is, since you're going from unsigned to signed, but just add the scale factor you feel is right.

GPU vs CPU end to end latency for dynamic image resizing

I have currently used OpenCV and ImageMagick for some throughput benchmarking and I am not finding working with GPU to be much faster than CPUs. Our usecase on site is to resize dynamically to the size requested from a master copy based on a service call and trying to evaluate if having GPU makes sense to resize per service call dynamically.
Sharing the code I wrote for OpenCV. I am running the following function for all the images stored in a folder serially and Ultimately I am running N such processes to achieve X number of image resizes.I want to understand if my approach is incorrect to evaluate or if the usecase doesn't fit typical GPU usecases. And what exactly might be limiting GPU performance. I am not even maximizing the utilization to anywhere close to 100%
cv::Mat::setDefaultAllocator(cv::cuda::HostMem::getAllocator (cv::cuda::HostMem::AllocType::PAGE_LOCKED));
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
std::cout<<"Image Not Found: "<< input_file << std::endl;
cuda::GpuMat d_src;
auto t_end_h2d = std::chrono::high_resolution_clock::now();
cuda::GpuMat d_dst;
cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream);
auto t_end_resize = std::chrono::high_resolution_clock::now();
Mat dst;
auto t_end_d2h = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",host2device,"<<std::chrono::duration<double, std::milli>(t_end_h2d-t_end_read).count()
<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_h2d).count()
<<",device2host,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_end_resize).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_start).count()<<endl;
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
std::cout<<"Image Not Found: "<< input_file << std::endl;
Mat dst;
resize(src, dst, Size(400, 400),0,0, CV_INTER_AREA);
auto t_end_resize = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_read).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_start).count()<<endl;
Compiling : g++ -std=c++11 resizeCPU.cpp -o resizeCPU pkg-config --cflags --libs opencv
I am running each program N number of times controlled by following code : runMultipleGPU.sh
echo $1
for (( c=$START; c<=$END; c++ ))
./resizeGPU "$c" &#>/dev/null #&disown;
echo All done
Run : ./runMultipleGPU.sh
Those timers around lead to following aggregate data
No_processes resizeCPU resizeGPU memcpyGPU totalresizeGPU
1 1.51 0.55 2.13 2.68
10 5.67 0.37 2.43 2.80
15 6.35 2.30 12.45 14.75
20 6.30 2.05 10.56 12.61
30 8.09 4.57 23.97 28.55
No of images run per process : 267
Average size of the image: 624Kb
According to data above, as we increase the number of processes(leading to increased number of simultaneous resizes) the resize perform
ance(which includes actual resize + host to device and device to host copy) increases significantly on GPU vs CPU.
Similar results after using ImageMagick which uses OpenCL beneath
Code :
setenv("MAGICK_OCL_DEVICE","OFF",1); //Turn in ON to use GPU acceleration
Image image;
auto t_start_read = std::chrono::high_resolution_clock::now();
image.read( full_path );
auto t_end_read = std::chrono::high_resolution_clock::now();
image.resize( Geometry(400,400) );
auto t_end_resize = std::chrono::high_resolution_clock::now();
Results :
No_procs resizeCPU resizeGPU
1 63.23 8.54
10 76.16 31.04
15 76.56 50.79
20 76.58 71.68
30 86.29 140.17
Test Machine configuration:
4 GPU (Tesla P100) - but test only utilizes 1 GPU
64 CPU cores (over Intel Xeon 2680 v4 CPU )
OpenCV version : 3.4.0
ImageMagick version : 6.9.9-26 Q16 x86_64 2018-01-17
Cuda Toolkit : 9.0
Highly propable this is too late to help you. However for people looking at this answer this is my suggestion to improve performance. The way you are setting pinned memory does not give you the boost you are looking for.
This is: Using
//method 1
In the comments of this discussion. Somebody suggested doing as you. The person answering said that it was slower. I was timing the implementation of a sobel derivatives close to the one in Coldvision.io Sobel The main steps are:
Read color image
gaussian blurring of the color image using a
radius of 3 and a delta of 1;
grayscale conversion;
computing the x and y gradiants
merging them into the final output image.
Instead I implemented a version swaping the order of step 2 and 3. Converting to gray scale first and then denoising the result by passing a gaussian.
I was running openCV 3.4 in windows 10. Cuda 9.0. My CPU is an i7-6820HQ. GPU is a Quadro M1200.
I try your method and this one:
//Method 2
//allocate pinned memory
cv::cuda::HostMem memory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);
//Read input image from the disk
Mat input = imread(input_file, CV_LOAD_IMAGE_COLOR);
if (input.empty())
std::cout << "Image Not Found: " << input_file << std::endl;
// copy the input image from CPU to GPU memory
cuda::GpuMat gpuInput;
cv::cuda::Stream stream;
gpuInput.upload(memory, stream);
//Do your processing...
//allocate pinned memory for output
cv::cuda::HostMem outMemory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);
gpuOutput.download(outMemory, stream);
cv::Mat output = outMemory.createMatHeader();
I calculated the gain as: (t1-t2)/t1*100. Where t1 is the time running the code normally. t2 running it using pinned memory. The negative values is when the method is slower than running in non-pinned memory.
image size Gain % Method 1 Gain % Method 2
800x600 2.9 8.2
1280x1024 2.5 15.3
1600x1200 0.2 7.0
2048x1536 -2.3 14.6
4096x3072 -1.0 17.2

Insufficient Memory Error: Bag of Words OpenCV 2.4.6 Visual Studio 2010

I am implementing the Bag of words Model using SURF and SIFT features and SVM Classifier. I want to train(80% of 2876 images) and test(20% of 2876 images) it. I have kept dictionarySize set to 1000. My Computer configuration is intel Xeon(2 processors)/ 32GB RAM/ 500GB HDD. Here, images are read whenever necessary instead of storing them.
std::ifstream file("C:\\testFiles\\caltech4\\train0.csv", ifstream::in);
if (!file)
string error_message = "No valid input file was given, please check the given filename.";
CV_Error(CV_StsBadArg, error_message);
string line, path, classlabel;
printf("\nReading Training images................\n");
while (getline(file, line))
stringstream liness(line);
getline(liness, path, separator);
Mat image = imread(path, 0);
cout << " " << path << "\n";
detector.detect(image, keypoints1);
detector.compute(image, keypoints1,descriptor1);
Here, the train0.csv contains the paths to the images with the labels. It stops from this loop while reading the images, computing the descriptor and adding it to the features to be clustered. Following error apprears on the console:
Here, in the code, I re-sized images being read to the dimension 256*256; the requirement of the memory is reduced. Ergo, the error disappeared.
Mat image = imread(path, 0);
cout << " " << path << "\n";
detector.detect(image, keypoints1);
detector.compute(image, keypoints1,descriptor1);
But, it might appear with bigger dataset.

gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAy)

I am trying to learn how to use the GPU programs in OpenCV. I have built everything with CUDA and if I run
cout << " Number of devices " << cv::gpu::getCudaEnabledDeviceCount() << endl;
I get the answer 1 device so at least something seems to work. However, I try the following peace of code, it just prints out the message and then nothing happens. It gets stuck on
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
Here is the code
#include <opencv2/opencv.hpp>
using std::cout;
using std::endl;
int main(void){
cv::Mat input = cv::imread("image.jpg");
if (input.empty()){
cout << "Image Not Found" << endl;
return -1;
cv::Mat output;
// Declare the input and output GpuMat
cv::gpu::GpuMat input_gpu;
cv::gpu::GpuMat output_gpu;
cout << "Number of devices: " << cv::gpu::getCudaEnabledDeviceCount() << endl;
// Copy the input cv::Mat to device.
// Device memory will be allocated automatically according to the parameters of input image
// Convert the input image to grayScale on GPU
cv::gpu::cvtColor(input_gpu, output_gpu, CV_BGR2GRAY);
//// Copy the result from GPU back to host
cv::imshow("Input", input);
cv::imshow("Output", output);
return 0;
I just found this issue and it seems to be a problem with the Maxwell architecture, but that post is over a year old. Has anybody else experienced the same problem? I am using windows 7, Visual Studio 2013 and an Nvidia Geforce GTX 770.
/ Erik
Ok, I do not really know what the problem was, but in CMake, I changed the CUDA_GENERATION to Kepler, which is the micro architecture of my GPU, then I recompiled it, and now the code works as it should.
Interestingly, there were only Fermi and Kepler to choose, so I do not know if one will get problem with Maxwell.
/ Erik
