Find centers of circular clusters in a point cloud - image-processing

I need to find a solution for the following problem. I want to find the centers of circular clusters within a point cloud. For example, in the bottom picture i want to identify 3 centers. I was trying to use clustering algorithms (e.g.,kmeans, kmedians, gaussian mixture models) to find the centers of the clusters, but they did not give proper results without individual parametrization(e.g. number of clusters).
Does anyone has a suggestion which methods can be used for solving this problem?

There are plenty of choices.
Gaussian Mixture Modeling, with a heuristic to optimize the parameters
DBSCAN, with postprocessing to compute cluster centers
OPTICS, with postprocessing to compute cluster centers
Mean-Shift, the most dense points are your centers.
many
many
many
more

Related

Is location-dependent convolution filter possible in PyTorch or TensorFlow?

Let's pretend that in plus of having an image, I also have a gradient from left to right on the X axis of an image, and another gradient from top to bottom on the Y axis. Those two gradients are of the same size of the image, and could both range from -0.5 to 0.5.
Now, I'd like to make the convolution kernel (a.k.a. convolution filter, or convolution weights) depend on the (x, y) location in the gradient. So the kernel is a function of the gradient as if the kernel was the output of a nested mini-neural net. This would make the weights of the filter to be different in every position, but slightly similar to their neighbors. How do I do that within PyTorch or TensorFlow?
Sure, I could compute a Toeplitz matrix (a.k.a. diagonal-constant matrix) by myself, but the matrix multiplication would take O(n^3) operations if pretending x==y==n, whereas convolutions can be implemented in O(n^2) normally. Or I could maybe iterate on every element myself and do the multiplications in an unvectorized fashion.
Any better ideas? I'd like to see creativity here, thinking about how could this be implemented neatly. I believe coding that would be an interesting way to build a network layer capable of doing things similar to a simplified version of a Spatial Transformer Networks, but which's spatial transformation would be independent of the image.
Here is a solution I thought for a simplified version of this problem where a linear combination of weights would be used rather than truly using a nested mini neural network:
It may be possible to do 4 different convolutions passes so as to have 4 feature maps, then to multiply those 4 maps with the gradients (2 vertical and 2 horizontal gradients), and add them together so that only 1 map remains. However, that would be a linear combination of the different maps which is simpler than truly using a nested neural network which would alter the kernel in the first place.
Thinking more about it, here is a solution to an equivalent question. The thing with this solution is that it flips the problem around by placing the "mini neural net" after rather than before, and in a quite different way. So it solves the problem, but offer a much different optimization space and convergence behavior which is less natural for me to think about than how I formulated the problem.
In a sense, a solution to the problem could be very similar to simply concatenating the two gradients to 1 regular feature map (from a regular convolution) such as having a depth of d_2 = d_1 + 2 after the concatenation), and then performing more convolutions on top of this. I won't prove why this is a valid solution to an equivalent problem, but I thought through this and it seems provable.
The optimization space (for the weights) would be here very different and I think it wouldn't converge with the same behavior. I'd like to know what you people think about this solution in terms of optimization convergence.
The reason why convolutions are more efficient than fully connected layers is because they are translation invariant. If you wish to have convolutions which are dependent on location you would need to add two extra parameters to the convolution i.e. having N+2 input channels where x, y coord are the values of the two additonal channels (as in e.g. CoordConv).
As for alternative solutions, is the gradient meaningful? If not, and it is uniform across all images, it might be better to just manually remove it in the pre-processing stage (similar to orientation correction, cropping etc). If it is not (e.g. differences in lighting, shadows) then including other layers under the assumption they would learn the invariance of different lightings is a common hands-off approach.

Finding displacement between two camera frames

I'm currently working on a visual odometry project. Currently I've implemented up to Essential Matrix decomposition stage. But the resulting translation vector is normalized and cannot be able to plot the movement.
Now how can I compute the displacement in some scale? I have seen some suggestions to use planner homography to compute the absolute translation. I didn't got the idea of doing it as, the outdoor environment is not simply planner. At least, by considering the ground as planner, how to obtain, the translation of it. I've seen a suggestion here. Is it possible to use this approach to get the displacement between two frames?
What you are referring to is called registration. This is a vast field. There are methods for linear transformation across the entire image, and per pixel methods ( the two ends of the spectrum). Naturally per pixel methods are far slower typically and have many local errors.
Typically two frames have very little transformation between them and simple Homography will do to find the general scaling between them. Especially if you are talking about aerial photos. If your case is very far from planar then you may want to use something closer to pixel-wise. For example using spline fitting: https://www.mathworks.com/matlabcentral/fileexchange/20057-b-spline-grid--image-and-point-based-registration
You cannot recover scale, generally speaking, unless you can recognize in the scene 1 or more objects of known physical size.

Which machine learning model would be feasible for stripping the background from product photos?

My goal is to be able to have a way to process a product photo through the model, and have it return the same photo with the product against a white background. The product photos will be of varying sizes and product types.
I'd like to feed the model photos of products with backgrounds, and those without. In the future I will also expand on the dataset with partially removed backgrounds.
If you are looking for an easy way of doing this, I'd suggest the K-means clustering algorithm. Assuming that you have a simple plain background and an image (of interest) you can obtain the RGB pixel values and use a K-means clustering algorithm with the number of clusters set to 2.
Let me explain this to you with the help of an example. Suppose you have an image of dimension 28*28 (just another arbitrary dimension). The total number of pixels in the image would be 784. Each pixel is represented as a combination of 3 RGB values ranging from 0-255.
A K-Means clustering algorithm will cluster the pixel values into K clusters thus each cluster represents pixel values which are more similar than the pixel values in another cluster. This technique is especially helpful in drawing contours (borders) around images of interest.
In the K-means clustering algorithm, there would be 784 sample points each represented in a 3 dimensional plane for this example. It will cluster these data points into K (2 in this example) clusters.
Here is a very simple implementation of the K-means clustering algorithm.
If you are looking for advanced machine learning implementation, then I'd suggest you look for Deep Convolution Neural Networks for Background Removal in Images. This machine learning technique has been successfully used for the task for background image removal
Read more about it from here, here and here.

3D reconstruction -- How to create 3D model from 2D image?

If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/

Difference between pHash nearest neighbor and color histogram nearest neighbor?

I'm looking to create an animated series of photos that are as similar as possible. While researching, I've come across two methods:
Generate a pHash of the images and do a nearest neighbor using the Hamming distance of the hash.
Create color histograms and do an n-dimensional nearest neighbor using Euclidian distance.
Many people who have been commenting for the https://stackoverflow.com/questions/6971966/how-to-measure-percentage-similarity-between-two-images question claim that the two processes are essentially the same. I'm looking for a little more insight on this. They seem like different processes.
Thoughts?

Resources