I use the following code to take an original image and blur it:
#include <opencv2/core/core.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
int main(int argc, char** argv) {
cv::Mat img = cv::imread("lenna_orig.png");
cv::Mat gray, blurred;
cv::cvtColor(img, gray, CV_BGR2GRAY);
cv::GaussianBlur(gray, blurred, cv::Size(21, 21), 2.0);
cv::imwrite("lenna_blur.png", blurred);
return 0;
}
But is there a way to save the actual image of the Gaussian blur? something like this?
cv::imwrite("gauss.png", cv::GaussianBlur(cv::Size(21, 21), 2.0));
I ask because I eventually want to do a deconvolution problem and compared the computed kernel with the actual Gaussian kernel, so I need to know what the actual Gaussian kernel looks like exactly
EDIT:
I see that if I try
cv::Mat g = cv::getGaussianKernel(15, 2.0, CV_64F);
cv::imshow("g", g);
cv::imwrite("g.bmp", g);
this won't work because this returns a 15x1 matrix as the kernel, according to the docs. But I want a 15x15 kernel
cv::getGaussianKernel returns a 1d Gaussian profile. Since the kernel is symmetric it only needs to calculate a 1d curve.
If you want a 2d version you could stack 15 rows of the 1d ones and then multiply each column by the same profile.
edit: eg. Suppose the Gaussian kernel was 0.2, 0.4, 1.0, 0.4, 0.2 (simplified version for less typing)
Create the square array, with each row equal to the profile.
0.2 0.4 1.0 0.4 0.2
0.2 0.4 1.0 0.4 0.2
0.2 0.4 1.0 0.4 0.2
0.2 0.4 1.0 0.4 0.2
0.2 0.4 1.0 0.4 0.2
Now multiply each column by the same profile
0.2
0.4
1.0
0.4
0.2
To get something like
0.04 0.08 0.2 0.08 0.04
0.08 0.16 0.4 0.16 0.08
0.2 0.4 1.0 0.4 0.2
0.08 0.16 0.4 0.16 0.08
0.04 0.1 0.2 0.08 0.04
Only with the actual Gaussian profile and a 15x15 result.
ps this demonstrates an important feature of these kernels - they are separable. That means you can apply them in the x and y directions independently and then combine the result which makes it a lot more efficient to use.
Related
A similar question has been asked here. However I could not understand it clearly.
I understand that SIFT computation has the following steps:
Finding scale space extrema
Keypoint localization(and filtering)
Orientation assignment (using computation of gradient magnitude and orientation)
Create SIFT descriptor
My question is for the fourth step: How to set the region over which the SIFT descriptor is computed? Also how is the shape of the region for SIFT computation determined?
Suppose the scale space extrema was found at scale "s" in the second octave. I use the gradient orientation to align to a canonical orientation. How do I set the region of computation of the SIFT descriptor using these information? Do I use the scale or the magnitude of the gradient to find the region on which SIFT is to be computed? Also how is the shape of the region determined?
So this was surprisingly tricky to find an answer for.
David Lowe's original paper only seemed to provide vague theoretical explanation on how his algorithm worked.
And as far as I know, his official implementation never had its feature descriptor code open-sourced.
So I'm basing my answer off what I consider the next-most canonical implementation of the SIFT algorithm, being Rob Hess' OpenSIFT implementation;
which became the base for OpenCV's official implementation.
Anyway, here is my understanding of how SIFT roughly works:
Once you have located your extrema, you should know which octave & interval of the Gaussian Pyramid the extrema belongs to.
Based on Rob's code (these two functions on lines 1026-1112), the feature descriptor is calculated from the blurred image of that octave & interval.
And the region for calculating SIFT is a square shape surrounding the keypoint. This medium article also seems to agree (see illustration).
The SIFT formula for the Gaussian Kernel scale, relative to the original image size is (reference):
base_scale * 2^(octave + interval / intervals_per_octave)
Or this formula if working relative to the halved image in each octave:
base_scale * 2^(interval / intervals_per_octave)
Where the original paper defined the parameters through experiments as:
base_scale = 1.6 and intervals_per_octave = 3
So if your SIFT was set to have 3 intervals per octave, with a base Gaussian scale of 1.6, and the extrema was found on octave 2, interval 3;
the image will have been blurred by a Gaussian Kernel of scale : 1.6 * 2^(2 + 3/3) = 12.80 pixels
Now the actual array size of the Gaussian kernel will depend on the code you use, as the scale and the kernel size can be set independently.
In cases like MATLAB, I've found a helpful guidelines from this SO thread.
The selected answer recommends kernel width of 6 times the scale (i.e. 3 sigma rule), our kernel width (and height) is 12.80 * 6 ≈ 77 pixels;
thus, a SIFT descriptor region of size 77x77 pixels.
Meanwhile, the OpenCV implementation appears to leave the size of the kernel to be determined by OpenCV's own built-in Gaussian Blur function.
Line 246 from OpenCV's code leaves the Gaussian Blur function parameter ksize as zeroes,
which the official docs only states the kernel size will be "computed from sigma", and never defines how it is actually calculated...
Finally, for Rob's implementation, I have to admit that I couldn't quite understand what was happening in this final step. ¯\_(ツ)_/¯
From lines 1026-1112 Rob defined the code below, which shows show how he calculates the orientation histogram for the SIFT descriptor.
The code shows he defined a radius and used the nested for-loops with i and j to iterate through the square region around the keypoint, located at point (r,c).
Yet what I don't really understand is:
How he defined radius, with the Gaussian scale scl multiplied with some unknown constant SIFT_DESCR_SCL_FCTR = 3.0
As well as hist_width * sqrt(2) * ( d + 1.0 ) * 0.5 + 0.5, where d = SIFT_DESCR_WIDTH = 4
hist_width = SIFT_DESCR_SCL_FCTR * scl;
radius = hist_width * sqrt(2) * ( d + 1.0 ) * 0.5 + 0.5;
for( i = -radius; i <= radius; i++ )
for( j = -radius; j <= radius; j++ )
{
/*
Calculate sample's histogram array coords rotated relative to ori.
Subtract 0.5 so samples that fall e.g. in the center of row 1 (i.e.
r_rot = 1.5) have full weight placed in row 1 after interpolation.
*/
c_rot = ( j * cos_t - i * sin_t ) / hist_width;
r_rot = ( j * sin_t + i * cos_t ) / hist_width;
rbin = r_rot + d / 2 - 0.5;
cbin = c_rot + d / 2 - 0.5;
if( rbin > -1.0 && rbin < d && cbin > -1.0 && cbin < d )
if( calc_grad_mag_ori( img, r + i, c + j, &grad_mag, &grad_ori ))
{
grad_ori -= ori;
while( grad_ori < 0.0 )
grad_ori += PI2;
while( grad_ori >= PI2 )
grad_ori -= PI2;
obin = grad_ori * bins_per_rad;
w = exp( -(c_rot * c_rot + r_rot * r_rot) / exp_denom );
interp_hist_entry( hist, rbin, cbin, obin, grad_mag * w, d, n );
}
}
But regardless of how the exact size of the region is calculated, I think the general concept is the same.
To calculate the region size based on the original Gaussian scale.
Besides, given that the features are supposed to be "weighted by a Gaussian window" (original paper, section 6.1, page 15);
as long as the region you define is large enough to contain most of the meaningful orientation histograms, you are fine.
In summary:
The SIFT descriptor is calculated from the halved & blurred image of the same octave/interval as the keypoint (OpenSIFT)
The region for the SIFT descriptor is a square shape surrounding the keypoint (medium)(image)
The region size is calculated based on the Gaussian kernel scale, though the exact method for calculation can vary an easy rule of thumb is "width of 6 times the kernel scale" (thread)
I've read about the power law (Gamma) Transformations so let's look to the equation: s = c*r^γ
Suppose that I have one pixel which has intensity of 37. If the gamma is 0.4 and c is 1, then the output intensity is 37^(0.4) which is 4.2. Thus it's darker, not brighter. But then why does it look brighter in the example in my textbook?
The gamma transformation applies to data in the range [0,1]. So, for your typical unsigned 8-bit integer image, you would have to scale it first to that range. The equation, including the scaling, then would be:
s = 255 * (r/255)^γ
Now you'd have, for r = 37 and γ = 0.4: s = 255 * (37/255)^0.4 = 117.8. This is brighter.
I am implementing a very simple segmentation algorithm for single channel images. The algorithm works like so:
For a single channel image:
Calculate the standard deviation, ie, measure how much the luminosity varies across the image.
If the stddev > 15 (aka threshold):
Divide the image into 4 cells/images
For each cell:
Repeat step 1 and step 2 (go recursive)
Else:
Draw a rectangle on the source image to signify a segment lies in these bounds.
My problem occurs because my threshold is constant and when I go recursive 15 is not longer a good signifier of whether that image is homogeneous or not. How can I introduce consistency/normalisation to my homogeneity check?
Should I resize each image to the same size (100x100)? Should my threshold be formula? Say 15 / img.rows * img.cols or 15 / MAX_HISTOGRAM_PEAK?
Edit Code:
void split_mat(const Mat& src, Mat& split1, Mat& split2, Mat& split3, Mat& split4) {
split1 = Mat(src, Rect(Point(0, 0), Point(src.cols / 2, src.rows / 2)));
split1 = Mat(src, Rect(Point(src.cols/2, 0), Point(src.cols, src.rows / 2)));
split3 = Mat(src, Rect(Point(0, src.rows/2), Point(src.cols / 2, src.rows)));
split4 = Mat(src, Rect(Point(src.cols/2, src.rows/2), Point(src.cols, src.rows)));
}
void segment_by_homogeny(const Mat& src, double threshold) {
Scalar mean, stddev;
meanStdDev(src, mean, stddev);
double dev = stddev[0]; // / (src.rows * src.cols) * 100.0;
if (dev >= threshold) {
Mat s1, s2, s3, s4;
split_mat(src, s1, s2, s3, s4);
// Go recursive and segment each sub-segment where necessary
segment_by_homogeny(s1, threshold);
segment_by_homogeny(s2, threshold);
segment_by_homogeny(s3, threshold);
segment_by_homogeny(s4, threshold);
}
else {
// Store 'segment' in global vector 'images'
// and write std dev on it
char d[255];
sprintf_s(d, "Std Dev: %f", stddev[0]);
putText(src, d, cvPoint(30, 60),
FONT_HERSHEY_COMPLEX_SMALL, 0.7, cvScalar(200, 200, 250), 1, CV_AA);
images.push_back(src);
}
}
// current usage for the example image results in inifinite recursion.
// The green and red segment never has a std dev < 25
segment_by_homogeny(img, 25);
I am expecting my algorithm to produce the following 5 segments:
You can simplify your algorithm. Because you want to divide the given region into 4 subregions, you can first divide it into the 4 subregions, then calculate the average luminosity value for each, and have your threshold on the difference between these neighbor values.
The docs for an Embedding Layer in Keras say:
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size, and feeding them into a Dense Layer.
Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?
An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.
Imagine a word-to-embedding layer with these weights:
w = [[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.
For an example, use the weights above and this sentence:
[0, 2, 1, 2]
A naive Dense-based net needs to convert that sentence to a 1-hot encoding
[[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
then do a matrix multiplication
[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],
[0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get
[w[0],
w[2],
w[1],
w[2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
So it's the same result, just obtained in a hopefully faster way.
The Embedding layer does have limitations:
The input needs to be integers in [0, vocab_length).
No bias.
No activation.
However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.
Mathematically, the difference is this:
An embedding layer performs select operation. In keras, this layer is equivalent to:
K.gather(self.embeddings, inputs) # just one matrix
A dense layer performs dot-product operation, plus an optional activation:
outputs = matmul(inputs, self.kernel) # a kernel matrix
outputs = bias_add(outputs, self.bias) # a bias vector
return self.activation(outputs) # an activation function
You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.
Here I want to improve the voted answer by providing more details:
When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.
Embedding layer is much like a table lookup. When the table is small, it is fast.
When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.
I aim to flag if the image captured is blurred. For this I have tried out two methods using opencv and intend to use threshold for deciding if image is blurred :
1. Variance of laplacian using following:
Imgproc.Laplacian(src_gray_image, dest_lap_image, CvType.CV_16S,3,1,0);
Core.meanStdDev(dest_lap_image, mean , std);
var of laplacian = Math.pow(std.get(0,0)[0],2);
Gradient in image in x and y directions:
Imgproc.Sobel( image, Gx, CvType.CV_32F, 1, 0 );
Imgproc.Sobel( image, Gy, CvType.CV_32F, 0, 1 );
double sumSq = normGx * normGx + normGy * normGy;
gradient = (float)( 1. / ( sumSq / image.size().area() + 1e-6 ));
These values differ at lot when same scene is captured with different mobile phones.
E.g laplacian variance = 79 for camera1 and 5000 for camera2
gradient value = 2*10-4 for camera1 and 4*10-5 for camera2.
Following are the meta data:
camera1:
4096x2304
Exposure time: 1/17
Aperture value: 1.53
ISO speed: 121
Focal length: 4.0
camera2:
1456x2592
Exposure time: 1/50
Aperture Value: 2.53
ISO: 160
Focal length: 3.5mm
What I am not able to understand is ,
1. what values of camera decides on sharpness and focus and how does it affect the gradient and laplacian variance values because these values are the features that ideally must me camera independent.
2. How do we calculate these values so that they are device independent.
3. Is there any other method to do a quick basic blur detection that does not depend on image meta data.