Can anybody please show me how to use RANSAC algorithm to select common feature points in two images which have a certain portion of overlap? The problem came out from feature based image stitching.
I implemented a image stitcher a couple of years back. The article on RANSAC on Wikipedia describes the general algortihm well.
When using RANSAC for feature based image matching, what you want is to find the transform that best transforms the first image to the second image. This would be the model described in the wikipedia article.
If you have already got your features for both images and have found which features in the first image best matches which features in the second image, RANSAC would be used something like this.
The input to the algorithm is:
n - the number of random points to pick every iteration in order to create the transform. I chose n = 3 in my implementation.
k - the number of iterations to run
t - the threshold for the square distance for a point to be considered as a match
d - the number of points that need to be matched for the transform to be valid
image1_points and image2_points - two arrays of the same size with points. Assumes that image1_points[x] is best mapped to image2_points[x] accodring to the computed features.
best_model = null
best_error = Inf
for i = 0:k
rand_indices = n random integers from 0:num_points
base_points = image1_points[rand_indices]
input_points = image2_points[rand_indices]
maybe_model = find best transform from input_points -> base_points
consensus_set = 0
total_error = 0
for i = 0:num_points
error = square distance of the difference between image2_points[i] transformed by maybe_model and image1_points[i]
if error < t
consensus_set += 1
total_error += error
if consensus_set > d && total_error < best_error
best_model = maybe_model
best_error = total_error
The end result is the transform that best tranforms the points in image2 to image1, which is exacly what you want when stitching.
Related
I would like to be able to replicate the behaviour of the opencv function warpPerspective which takes as input an image and an homography matrix, and projects the image according to the homography matrix (more details here : https://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html).
It seems like tf.contrib.image.sparse_image_warp should do the job, but I am unable to replicate the behaviour of warpPerspective. The output I get is distorted in a non-linear fashion despite the use of the parameter interpolation_order=1.
With some further research, I suspect this is due to the fact that tf.contrib.image.interpolate_spline does not perform linear interpolation even when its order is 1 but rather uses some RBF kernels.
I can't see a way around this except encoding it with a dense_image_warp, but it seems a bit overkill and maybe costly. Does anyone has another solution ?
After some research, here is a solution. it uses the tf.contrib.image.dense_image_warp function and is not really pretty, but still, it works :
This first function computes the optical flow needed to perform the homography :
def homography_matrix_to_flow(tf_homography_matrix, im_shape1, im_shape2):
Y, X = np.meshgrid(range(im_shape1), range(im_shape2))
Z = np.ones_like(X)
XYZ = np.stack((X, Y, Z), axis=-1)
tf_XYZ = tf.constant(XYZ.astype("float64"))
tf_XYZ = tf_XYZ[tf.newaxis,:,:, :, tf.newaxis]
tf_homography_matrix = tf.tile(tf_homography_matrix[tf.newaxis, tf.newaxis], (1, im_shape2, im_shape1, 1, 1))
tf_unnormalized_transformed_XYZ = tf.matmul(tf_homography_matrix, tf_XYZ, transpose_b=False)
tf_transformed_XYZ = tf_unnormalized_transformed_XYZ / tf_unnormalized_transformed_XYZ[:,:,:, -1][:,:,:, tf.newaxis]
flow = -tf.squeeze(tf_transformed_XYZ-tf_XYZ)[..., :2]
return flow
Then, it used to warp the original image to the distorted image.
There is one trick : due to how the tf.contrib.image.dense_image_warp function works, you need to pass the inverse of the homography matrix to find the correct optical flow to use.
homography_matrix = np.array([[-4.86219067e-01, -2.20871298e+00, 4.08214879e+02],
[-1.02940133e-01, -5.60378659e+00, 3.87573763e+02],
[-1.35051362e-04, -6.59600583e-03, 2.91244998e-01]])
inv_homography_matrix = np.linalg.inv(homography_matrix)
tf_inv_homography_matrix = tf.constant(inv_homography_matrix)[tf.newaxis]
flow = homography_matrix_to_flow(tf_inv_homography_matrix, img.shape[1], img.shape[2])[tf.newaxis]
flow =tf.tile(flow, (self.bs, 1,1,1))
image_warped = tf.contrib.image.dense_image_warp(tf.transpose(img, (0,2,1,3)), flow)
image_warped = tf.transpose(image_warped, (0,2,1,3))
I still hope to find a better answer (one which does not have to compute a whole tensor of flow), therefore, I leave the question unanswered for now.
I have image of text document. It includes text and block-schemes. The main problem is to detect block-schemes. I think there are two approaches to solve this task: 1) detect geometric primitive that make up the scheme; 2) detect the whole scheme.
How can I solve this task, please, give me some aproaches.
UPDATE 1
I try to detect where in document block-scheme is placed. Example is shown on the picture below. I didn't try to detect text in block-scheme.
UPDATE 2 The main problem is that i should find block-schemes in different varieties. Even part of the block-scheme.
You can either do 1) Object Detection 2) Semantic Segmentation. I would suggest segmentation because boundary extraction is crucial for your application.
I'm assuming you have the pages of the documents as images.
The following are the steps involved in projects involving segmentation.
Dataset
Collect the images of the pages required to solve you problem and do
preprocessing steps such as image resizing to bring all images in
your dataset to a common shape and to reduce the number of computations performed. Be sure to maintain variability in your samples.
Now you have to annotate the regions of the images that you are interested and mark them with a name. Here assigning a class (like classification) to certain regions of the image. You can use the following tools for this.
Labelme -- (my recommendation)
Vgg Annotation tool -- (highly portable tool written in html but has less features than labelme)
Model
You can use U-Net Model for your task. Unet Paper. It is very easy to implement but performs very robustly on most real-world tasks such as yours.
We have done something similar at work. This is the blog post. We have explained in detail the steps involved in the pipe line from the data collection stage to the results.
Literature on Document Layout Analysis.
https://arxiv.org/pdf/1804.10371.pdf -- They have used U-Net with ResNet-50 as encoder. They have achieved very good results compared to previous approaches
https://github.com/leonlulu/DeepLayout-- This is a Python implementation of page layout analysis tool using a Deep Lab v2 model which does semantic segmentation.
Conclusion
The approach presented here might seem tedious and time consuming but it is robust to variability in the documents when you are testing. Comment below if you have any questions.
I would prefer if there were more examples for the types of diagram you are searching for, but based on the example you have given, here is my attempt of solving it naively.
1) Resize image to a manageable size to improve speed and reduce operations.
2) Use morphological open to cluster all the dark objects together.
3) Binarize the dark objects.
4) Label the objects using openCV connected components. This will give us the bounding box of each region.
5) Cluster overlapping bounding box together.
6) Analyze each bounding box to find the one with diagram. Here you can apply a more sophisticated algorithm like box detection or even arrow detection but in your example, i think a simple box ratio is sufficient.
Here is the code for the implementation
import cv2
import numpy as np
# Function to fill all the bounding box
def fill_rects(image, stats):
for i,stat in enumerate(stats):
if i > 0:
p1 = (stat[0],stat[1])
p2 = (stat[0] + stat[2],stat[1] + stat[3])
cv2.rectangle(image,p1,p2,255,-1)
# image name
img_name = 'test_image.png'
# Load image file
diagram = cv2.imread(img_name,0)
diagram = cv2.blur(diagram,(5,5))
fScale = 0.25
# Make it smaller to speed up everything and easier to cluster
small_img = cv2.resize(diagram,(0,0),fx = fScale, fy = fScale)
img_h, img_w = np.shape(small_img)
# Morphological close process to cluster nearby objects
fat_img = cv2.morphologyEx(small_img,cv2.MORPH_OPEN,None,iterations = 1)
# Threshold strong signals
_, bin_img = cv2.threshold(fat_img,210,255,cv2.THRESH_BINARY_INV)
# Analyse connected components
num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(bin_img)
# Cluster all the intersected bounding box together
rsmall, csmall = np.shape(small_img)
new_img1 = np.zeros((rsmall, csmall), dtype=np.uint8)
fill_rects(new_img1,stats)
# Analyse New connected components to get filled regions
num_labels_new, labels_new, stats_new, centroids_new = cv2.connectedComponentsWithStats(new_img1)
# Check for regions that satifies conditions coresponds to diagram
min_dia_width = img_w * 0.1
dia_regions = []
for i ,stat in enumerate(stats):
if i > 0:
# get basic dimensions
x,y,w,h = stat[0:4]
# calculate ratio
ratio = w / float(h)
# if condition met, save in list
if ratio < 1 and w > min_dia_width:
dia_regions.append((x/fScale,y/fScale,w/fScale,h/fScale))
# For display purpose
diagram_disp = cv2.imread(img_name)
for region in dia_regions:
x,y,w,h = region
x = int(x)
y = int(y)
w = int(w)
h = int(h)
cv2.rectangle(diagram_disp,(x,y),(x+w,y+h),(0,255,0),2)
labels_disp = np.uint8(200*labels/np.max(labels)) + 50
labels_disp2 = np.uint8(200*labels_new/np.max(labels_new)) + 50
cv2.imshow('small_img',small_img)
cv2.imshow('fat_img',fat_img)
cv2.imshow('bin_img',bin_img)
cv2.imshow("labels",labels_disp)
cv2.imshow("labels_disp2",labels_disp2)
cv2.imshow("diagram_disp",diagram_disp)
cv2.waitKey(0)
Here is the result for another type of input.
I am working on a project and I have to make an AR drone follow a path based on a list of checkpoints which are saved in a directory. Each checkpoint is a scene that the drone should detect along its path. There could be differences between the checkpoints and the actual scenes in terms of brightness, small obstacles present in the actual scenes or small variation of the point of view. To detect the checkpoints while the drone was moving, I have decided to use feature matching to get the number of good matches and the ratio between the inliers and the number of good matches and use these parameters to check if the checkpoint has been reached or not.
Algorithm:
convert the image to grayscale
Use a detector to detect the keypoints (I have tried SIFT,SURF, ORB
and AKAZE)
Use an extractor to calculate the feature vectors
Use a matching algorithm to perform the matching (I have tried
Bruteforce and Bruteforce-Hamming)
Keep only the good matches and compute the number of inliers.
Check if the ratio between inliers and good matches is above a
threshold and the number of good_matches is above another threshold.
If this condition holds, then, the checkpoint has been matched.
Results: The checking algorithm is roughly good but sometimes it detects a checkpoint, that was taken from a landed drone, only after crossing it. While, with the same checkpoint, the checking algorithm does not detect it if the drone is slightly shifted to the left compared to the position in which the checkpoint has been taken.
Is it a good approach for this problem or is there a better way to reach my goal? If it is a good way, how can I improve the checking when the drone is close to the checkpoint?
The code that implements the feature matching is shown below:
matcher->knnMatch(desc1, desc2, dmatches, KNN_best_matches);
vector<Point2f> matches, inliers;
if(matches2points_nndr(kp1,kp2,dmatches,matches,DRATIO,MIN_MATCH_COUNT)){
*match = true;
//compute inliers
compute_inliers_ransac(matches, inliers, MAX_H_ERROR, false);
//update stats
stats.matches = (int)matches.size()/2;
stats.inliers = (int)inliers.size()/2;
stats.outliers = stats.matches - stats.inliers;
stats.ratio = (float) stats.inliers * 100.0 / (float) stats.matches;
}
In another class, stats.ratio is compared with a threshold.
if(draw_stats.ratio > threshold_matching){
//move to the next checkpoint
match = true;
}else{
std::cout << "ratio is under the threshold: " << draw_stats.ratio << std::endl;
match = false;
}
Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too
You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Correlation
Cross correlation
Convolution
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
Prerequisites
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python
The best answer I got were from this document:http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."
Convolution is just like correlation, except that we flip over the filter before correlating
I am creating an AR application that tracks feature , calculates homography and then obtains the object's pose from 3D-2D point correspondences and use that to render any 3D Object.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.Then I filter those matches and estimate Homography of the unmasked region.
The problem is lies in Homography estimation. It differs every time(very slightly, but nonetheless, differs). The effect is : Even on keeping my camera still, I get a vibrating rectangle around my tracked region, which i draw using the estimated homography.
I have already posted a question titled Unstable homography estimation using ORB and got a reassurance of a fact i was considering (not recalculating my homography if the position of the region is similar to its last position).
However, I recently came to know of the Kalman filter, that it gives a better estimate of the position by combining our prior knowledge with our measurement observation.
So ,after looking at various examples (one in particular, http://www.youtube.com/watch?v=GBYW1j9lC1I), I modeled a Kalman filter(rather 4, one for every point of the rectanglular region) for my scenario:
m_KF1.init(4, 2, 1);
setIdentity(m_KF2.transitionMatrix);
m_measurement1 = Mat::zeros(2,1,cv::DataType<float>::type);
m_KF1.statePre.setTo(0);
m_KF1.controlMatrix.setTo(0);
//initialzing filter
m_KF1.statePre.at<float>(0) = m_scene_corners[1].x; //the first reading
m_KF1.statePre.at<float>(1) = m_scene_corners[1].y;
m_KF1.statePre.at<float>(2) = 0;
m_KF1.statePre.at<float>(3) = 0;
setIdentity(m_KF1.measurementMatrix);
setIdentity(m_KF1.processNoiseCov,Scalar::all(.1)) //updated at every step
setIdentity(m_KF1.measurementNoiseCov, Scalar::all(4)); //assuming measurement error of
//not more than 2 pixels
setIdentity(m_KF1.errorCovPost, Scalar::all(.1));
4 state variables (position in x, y and velocity in x,y).
2 measurement variables (position in x,y)
1 control variable (acceleration)
following steps taken at every iteration
//---First,the predicion phase , to update the internal variables-------//
// 'dt' is the time taken between the measurements
//Updating the transitionMatrix
m_KF1.transitionMatrix.at<float>(0,2) = dt;
m_KF1.transitionMatrix.at<float>(1,3) = dt;
//Updating the Control matrix
m_KF1.controlMatrix.at<float>(0,1) = (dt*dt)/2;
m_KF1.controlMatrix.at<float>(1,1) = (dt*dt)/2;
m_KF1.controlMatrix.at<float>(2,1) = dt;
m_KF1.controlMatrix.at<float>(3,1) = dt;
//Updating the processNoiseCovmatrix
m_KF1.processNoiseCov.at<float>(0,0) = (dt*dt*dt*dt)/4;
m_KF1.processNoiseCov.at<float>(0,2) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(1,1) = (dt*dt*dt*dt)/4;
m_KF1.processNoiseCov.at<float>(1,3) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(2,0) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(2,2) = dt*dt;
m_KF1.processNoiseCov.at<float>(3,1) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(3,3) = dt*dt;
Mat prediction1 = m_KF1.predict();
Point2f predictPt1(prediction1.at<float>(0),prediction1.at<float>(1));
// Get the measured corner
m_measurement1.at<float>(0,0) = scene_corners[0].x;
m_measurement1.at<float>(0,1) = scene_corners[0].y;
//----Then, the correction phase which uses the predicted value and our measured value
Mat estimated = m_KF1.correct(m_measurement1);
Point2f statePt1(estimated.at<float>(0),estimated.at<float>(1));
This model hardly corrects my measured value
Now My Questions are:
Is Kalman filter suited for my scenario? Will it give me any better results?
If it is, then what's missing? am I modelling it right? Instead creating 4 filters for four points of the rectangle, should I model it in some other manner (for instance, take the 10 strongest matches based on the distance and use those as input to the filter)
If Kalman filter isn't suited, what else can i do to provide more stability to the estimated homography?
Any help would be highly appreciated.
Thanks.
This question is badly titled and after reading your explanation what you really ask is: "Why my OpenCV Kalman filter will still leaves lots of noise?"
Anyway your answers are:
Yes Kalman works for your scenario
You are using it wrong
Modify: KF.processNoiseCov, You can get the code from here: Opencv kalman filter prediction without new observtion there is nice line explaining it.
See:
setIdentity(KF.processNoiseCov, Scalar::all(.005)); //adjust this for faster convergence - but higher noise
For what I see, you have very basic understanding of it, you can go to a naive approach and use four 2D Kalman filters, for this you can use the code here: . It will work, from there grow and adapt until you get a better understanding.
After that, you could model it more closely to your problem or you can keep using four filters, there is no "perfect" implementation so if that works for you, just go for it.