How to implement Sobel operator - vision

I have implemented Sobel operator in vertical direction. But the result which I am getting is very poor. I have attached my code below.
int mask_size= 3;
char mask [3][3]= {{-1,0,1},{-2,0,2},{-1,0,1}};
void sobel(Mat input_image)
/**Padding m-1 and n-1 zeroes to the result where m and n are mask_size**/
Mat result=Mat::zeros(input_image.rows+(mask_size - 1) * 2,input_image.cols+(mask_size - 1) * 2,CV_8UC1);
Mat result1=Mat::zeros(result.rows,result.cols,CV_8UC1);
int sum= 0;
/*For loop for copying original values to new padded image **/
for(int i=0;i<input_image.rows;i++)
for(int j=0;j<input_image.cols;j++)<uchar>(i+(mask_size-1),j+(mask_size-1))<uchar>(i,j);
GaussianBlur( result, result, Size(5,5), 0, 0, BORDER_DEFAULT );
/**For loop to implement the convolution **/
for(int i=0;i<result.rows-(mask_size - 1);i++)
for(int j=0;j<result.cols-(mask_size - 1);j++)
int counter=0;
int counterX=0,counterY=0;
sum= 0;
for(int k= i ; k < i + mask_size ; k++)
for(int l= j ; l< j + mask_size ; l++)
{<uchar>(k,l) * mask[counterX][counterY];
}<uchar>(i+mask_size/2,j+mask_size/2)=sum/(mask_size * mask_size);
/** Truncating all the extras rows and columns **/
result=Mat::zeros( result1.rows - (mask_size - 1) * 2, result1.cols - (mask_size - 1) * 2,CV_8UC1);
for(int i=0;i<result.rows;i++)
for(int j=0;j<result.cols;j++)<uchar>(i,j)<uchar>(i+(mask_size - 1),j+(mask_size - 1));
My input to the algorithm is
My output is
I have also tried using Gaussian blur before actually convolving an image and the output I got is
The output which I am expecting is
The guide I am using is:

Your convolution looks ok although I only had a quick look.
Check your output type. It's unsigned char.
Now think about the values your output pixels may have if you have negative kernel values and if it is a good idea to store them in uchar directly.
If you store -1 in an unsigned char it will be wrapped around and your output is 255. In case you're wondering where all that excess white stuff is coming from. That's actually small negative gradients.
The desired result looks like the absolute of the Sobel output values.


Is there an OpenCV function to copy all pixels under a mask into an array?

I would like to find the median color in a masked area in OpevCV. Does OpenCV have a function that takes an image and a mask, and puts only the pixels from the image where mask != 0 into an array or Mat?
I don't know of any OpenCV function that creates a vector from masked values, I have written my own function to do that in the past, which you could do.
Alternatively you could calculate the histogram and find the median off of that, if your data is uint8.
You should use the following function of the Mat class to copy all the pixels into another Mat by using Mask:
Mat rst;
img.copyTo(rst, mask);
Post is quite old now, but - as there is still no function available in OpenCV - I implemented it for my app. Maybe will be useful for anyone...
cv::Mat extractMaskedData(cv::Mat data, cv::Mat mask)
const bool isContinuous = data.isContinuous() && mask.isContinuous();
const int nRows = isContinuous ? 1 : data.rows;
const int nCols = isContinuous ? data.rows * data.cols : data.cols;
const size_t pixelBitsize = data.channels() * (data.depth() < 2 ? 1 : data.depth() < 4 ? 2 : data.depth() < 6 ? 4 : 8);
cv::Mat extractedData(0, 1, data.type());
uint8_t* m;
uint8_t* d;
for (size_t i = 0; i < nRows; ++i) {
m = mask.ptr<uint8_t>(i);
d = data.ptr(i);
for (size_t j = 0; j < nCols; ++j) {
if(m[j]) {
const cv::Mat pixelData(1, 1, data.type(), d + j * pixelBitsize);
return extractedData;
It returns cv::Mat(1,n,data.type()) where n is the number of non-zero elements in mask.
May be optimised by using image-type-specific d pointer (e.g. cv::Vec3f for CV_32FC3 instead of generic uint8_t* d together with const cv::Mat pixelData(1, 1, data.type(), d + j * pixelBitsize);.

MSE for two Vec3b images in OpenCV

I have two Vec3b images and I want to find the MSE (Mean Square Error) between them. I know how to do it when you have two uchar images, but when you have two Vec3b images where there are 3 different values stored for each pixel how do you calculate it?
You should compute the Euclidean distance for each pair of pixels:
MSE = 0;
for(int i = 0; i < width; i++)
for(int j = 0; j < height; j++)
MSE += sqrt(pow(<Vec3b>(j, i)[0] -<Vec3b>(j, i)[0]), 2) + pow(<Vec3b>(j, i)[1] -<Vec3b>(j, i)[1]), 2) + pow(<Vec3b>(j, i)[2] -<Vec3b>(j, i)[2]), 2));
MSE /= width * height;
This code can be optimized and if you convert your image from BGR to HSV, you could get better results according what you want to do.
To calculate the Mean Square Error for 1D and 3D images in opencv, you can use this post which might be faster since image scanning takes longer times.
double getMSE(Mat& I1, Mat& I2)
Mat s1;
// save the I! and I2 type before converting to float
int im1type = I1.type();
int im2type = I2.type();
// convert to float to avoid producing zero for negative numbers
I1.convertTo(I1, CV_32F);
I2.convertTo(I2, CV_32F);
absdiff(I1, I2, s1); // |I1 - I2|
s1.convertTo(s1, CV_32F); // cannot make a square on 8 bits
s1 = s1.mul(s1); // |I1 - I2|^2
Scalar s = sum(s1); // sum elements per channel
double sse = s.val[0] + s.val[1] + s.val[2]; // sum channels
if( sse <= 1e-10) // for small values return zero
return 0;
double mse =sse /(double)(I1.channels() *;
return mse;
// Instead of returning MSE, the tutorial code returned PSNR (below).
//double psnr = 10.0*log10((255*255)/mse);
//return psnr;
// return I1 and I2 to their initial types
I1.convertTo(I1, im1type);
I2.convertTo(I2, im2type);
The above code returns zero for small mse values (under 1e-10). Terms s.val1 and s.val[2] are zero for 1D images.
If you want to check for 1D image input, use the following code to test (with random unsigned numbers):
Mat I1(12, 12, CV_8UC1), I2(12, 12, CV_8UC1);
double low = 0;
double high = 255;
cv::randu(I1, Scalar(low), Scalar(high));
cv::randu(I2, Scalar(low), Scalar(high));
double mse = getMSE(I1, I2);
cout << mse << endl;
If you want to check for 3D image input, use the following code to test (with random unsigned numbers):
Mat I1(12, 12, CV_8UC3), I2(12, 12, CV_8UC3);
double low = 0;
double high = 255;
cv::randu(I1, Scalar(low), Scalar(high));
cv::randu(I2, Scalar(low), Scalar(high));
double mse = getMSE(I1, I2);
cout << mse << endl;

OpenCV C++: how access pixel value CV_32F through uchar data pointer

Briefly, I would like to know if it is possible to directly access pixel value
of a CV_32F Mat, through Mat member "uchar* data".
I can do it with no problem if Mat is CV_8U, for example:
// a matrix 5 columns and 6 rows, values in [0,255], all elements initialised at 12
cv:Mat A;
A.create(5,6, CV_8UC1);
A = cv::Scalar(12);
//here I successfully access to pixel [4,5]
uchar *p =;
int value = (uchar) p[4*A.step + 5];
The problem is when I try to do the same operation with the following matrix,
// a matrix 5 columns, 6 rows, values in [0.0, 1.0], all elements initialised at 1.2
cv::Mat B;
B.create(5,6, CV_32FC1);
B = cv::Scalar(1.2);
//this clearly does not work, no syntax error but erroneous value reported!
uchar *p =;
float value = (float) p[4*B.step + 5];
//this works, but it is not what I want to do!
float value =<float>(4,5);
Thanks a lot, Valerio
You can use ptr method which returns pointer to matrix row:
for (int y = 0; y < mat.rows; ++y)
float* row_ptr = mat.ptr<float>(y);
for (int x = 0; x < mat.cols; ++x)
float val = row_ptr[x];
You can also cast data pointer to float and use elem_step instead of step if matrix is continous:
float* ptr = (float*);
size_t elem_step = mat.step / sizeof(float);
float val = ptr[i * elem_step + j];
Note that CV_32F means the elements are float instead of uchar. The "F" here means "float". And the "U" in CV_8U stands for unsigned integer. Maybe that's why your code doesn't give the right value. By declaring p as uchar*, p[4*B.step+5] makes p move to the fifth row and advance sizeof(uchar)*5, which tend to be wrong. You can try
float value = (float) p[4*B.step + 5*B.elemSize()]
but I'm not sure if it will work.
Here are some ways to pass the data of [i, j] to value:
value =<float>(i, j)
value = B.ptr<float>(i)[j]
value = ((float*)[i*B.step+j]
The 3rd way is not recommended though, since it's easy to overflow. Besides, a 6x5 matrix should be created by B.create(6, 5, CV_32FC1), I think?

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*<uchar>(y1, x);
}<uchar>(y,x) = sum;
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*<uchar>(y, x1);
}<uchar>(y,x) = sum;
return dst;
A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum =<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp =<uchar>(y + i, x) +<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.
If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*<uchar>(y1, x); // <<<
}<uchar>(y,x) = sum;
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*<uchar>(y, x1); // <<<
}<uchar>(y,x) = sum / (256 * 256); // <<<
return dst;
This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values ​​of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}<uchar>(y,x) = (sum + 32768) / 65536;
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
sum += coeffs_i[i + 3]*ptr[x1];
}<uchar>(y,x) = (sum + 32768) / 65536;
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
sum += coeffs_i[i + 3]*ptr[x1];
}<uchar>(y,x) = (sum + 32768) / 65536;
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}<uchar>(y,x) = (sum + 32768) / 65536;
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
sum += coeffs_i[i + 3]*ptr[x1];
}<uchar>(y,x) = (sum + 32768) / 65536;
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
sum += coeffs_i[i + 3]*ptr[x1];
}<uchar>(y,x) = (sum + 32768) / 65536;
transpose(dst, dst);
return dst;
According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.

Implementing Hough Transform for Lines

I am trying to implement Hough Transform for line detection in an already pre-processed image.
So my input image is a black-white edge image, 0 - background and 255 - foreground. I do not wish to use the inbuilt HoughLines library by OpenCV.
I am actually stuck with creating the accumulator and increasing its values properly. I cant figure out where i went wrong, so here is my code block :
int diagonal = sqrt(height * height + width * width);
IplImage *acc = cvCreateImage (cvSize(180, 2 * diagonal),IPL_DEPTH_8U, 1);
unsigned char* accData = (unsigned char *)acc->imageData;
for (int i=0; i<height; i++)
for (int j=0; j<step; j++)
if (data[i*step + j] > 200)
for (int theta=0; theta<180; theta++)
int p = j * cos(theta) + i * sin(theta);
if (p > 0)
accData[theta*180 + p] += 1;
The output image that i get in acc is not what it should look like. I am not getting any sinusoids, instead only white patches here and there. Can anyone provide any feedback about where i went wrong ?
What I see there is that you don t use sinus with radians values but with degree values you could change it as follows:
int p = j * cos((double)theta*PI/180) + i * sin((double)theta*PI/180);
