I am looking for an Algorithm that is able to solve this problem.
The problem:
I have the following set points:
I want to group the points that represents a line (with some epsilon) in one group.
So, the optimal output will be something like:
Some notes:
The point belong to one and only line.
If the point can be belong to two lines, it should belong to the strongest.
A line is considered stronger that another when it has more belonging points.
The algorithm should not cover all points because they may be outliers.
The space contains many outliers it may hit 50% of the the total space.
Performance is critical, Real-Time is a must.
The solutions I found till now:
1) Dealing with it as clustering problem:
The main drawback of this method is that there is no direct distance metric between points. The distance metric is on the cluster itself (how much it is linear). So, I can not use traditional clustering methods and I have to (as far as I thought) use some kind of, for example, clustering us genetic algorithm where the evaluation occurs on the while cluster not between two points. I also do not want to use something like Genetic Algorithm While I am aiming real-time solution.
2) accumulative pairs and then do clustering:
While It is hard to make clustering on points directly, I thought of extracting pairs of points and then try to cluster them with others. So, I have a distance between two pairs that can represents the linearity (two pairs are in real 4 points).
The draw-back of this method is how to choose these pairs? If I depend on the Ecledian-Distance between them, it may not be accurate because two points may be so near to each other but they are so far from making a line with others.
I appreciate any solution, suggest, clue or note. Please you may ask about any clarification.
P.S. You may use any ready OpenCV function in thinking of any solution.
As Micka advised, I used Sequential-RANSAC to solve my problem. Results were fantastic and exactly as I want.
The idea is simple:
Apply RANSAC with fit-line model on the points.
Delete all points that are in-liers of the output of RANSAC.
While there are 2 or more points go to 1.
I have implemented my own fit-line RANSAC but unfortnantly I can not share code because it belongs to the company I work for. However, there is an excellent fit-line RANSAC here on SO that was implemented by Srinath Sridhar. The link of the post is : RANSAC-like implementation for arbitrary 2D sets.
It is easy to make a Sequential-RANSAC depending on the 3 simple steps I mentioned above.
Here are some results:
Related
I don't understand what is the multiobjective clustering is it using multiple variables for clustering or what?
I know that stack overflow might not be the best for this kind of questions, but
I've asked it on other website and I did not got a response.
Multiobjective optimization in general means that you have multiple criterions which you are interested in, which cannot be simply converted to something comparable. For example consider problem when you try to have very fast model and very accurate one. Time is measured in s, accuracy in %. How do you compare (1s, 90%) and (10days, 92%)? Which one is better? In general there is no answer. Thus what people usually do - they look for pareto front, so you test K models and selec M <= K of them such that, none of them is clearly "beaten" by any else. For example if we add (1s, 91%) to the previous example, Pareto front will be {(1s, 91%), (10days, 92%)} (as (1s, 90%) < (1s, 91%), and remaining ones are impossible to compare).
And now you can apply the same problem in clustering setting. Say for example that you want to build a model which is fast to classify new instances, minimizes avg. distance inside each cluster, and does not put into each cluster too many special instances labeled with X. Then again you will get models (clusterings) which are now characterized by 3, not comparable, measures, and in Multiobjective Clustering you try to deal with these problems (like for example finding Pareto front of such clusterings).
Last couple of days I spent on searching for curve reconstruction implementations, and found none - not as a library nor as a tool.
To describe my problem.
My main concern are contours with gaps:
From papers I've read in the meantime, I guess solution will require usage of Delaunay triangulation, and the method referenced most seems to be described in 1997 paper "The Crust and the β-Skeleton: Combinatorial Curve Reconstruction
"
Can someone point me to a curve reconstruction implementation, that can help me solve this problem?
Algorithm is implemented in CGAL. Example implementation can be seen in C++ in CGAL ipelets demo package. Even more compiling the demo allows user applying the algorithm in ipe GUI application:
In above example I selected just part of my image, as bottom lines did not meet necessary requirements, so crust can't be applied on that part until corrected. Further, image has to be sampled, as can be noticed.
If no one provides another implementation example, I'll mark my answer as correct after couple of days.
Delaunay triangulation uses discretized curve, and with that loses information. That can cause strange problems where you don't expect them. In your example, probably middle part on lower boundary would cause a problem.
In this situations maybe it is good to collect relevant information from model and try to make a matching.
Something like, for each end point collect contour derivative in a neighbourhood. Than find all end points to which that end point can be connected, with approximative derivative direction and that joint doesn't cross other line. It is possible to give weight to possible connection by joint distance and deviation from local derivative. Giving weight defines weighted graph with possible end point connections. Maximal edge matching in that graph would be good solution to a problem.
There are quite a few ways to solve this;
You could simply write a worm that follows the curves and when you reach the end of one, you take your current direction vector along with gradient and extrapolate it forward. Find all the other endpoints that would best fit and then score them; Reconnect up with the one with the highest score. Simple, and prone to problems if its more than a simple break up.
A hierarchical waterfall method might be interesting
There are threshold methods in waterfall (and level-set methods) that can be used to detect these gaps and fill them in.
I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.
I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.
I am currently developing a piece of software using opencv and qt that plots data points. I need to be able fill in an image from incomplete data. I want to interpolate between the points I have. Can anyone recommend a library or function that could help me. I thought maybe the opencv reMap method but I can't seem to get that to work.
The data is a 2-d matrix of intensity values. I want to create an image of some sort. Its a school project.
Interpolation is a complex subject. There are infinitely many ways to interpolate a set of points, and this assuming that you truly do wish to do interpolation, and not smoothing of any sort. (An interpolant reproduces the original data points exactly.) And of course, the 2-d nature of this problem makes things more difficult.
There are several common schemes for interpolation of scattered data in 2-d. Actually, for those who have access to it, a very nice paper is available (Richard Franke, "Scattered data interpolation: Tests of some methods", Mathematics of Computation, 1982.)
Perhaps the most common method used is based on a triangulation of your data. Merely build a triangulation of the domain from your data points. Then any point inside the convex hull of the data must lie inside exactly one of the triangles, or it will be on a shared edge. This allows you to interpolate linearly inside the triangle. If you are using MATLAB, then the function griddata is available for this express purpose.)
The problem when trying to populate a complete rectangular image from scattered points is that very likely the data does not extend to the 4 corners of the array. In that event, a triangulation based scheme will fail, since the corners of the array do not lie inside the convex hull of the scattered points. An alternative then is to use "radial basis functions" (often abbreviated RBF). There are many such schemes to be found, including Kriging, when used by the geostatistics community.
http://en.wikipedia.org/wiki/Kriging
Finally, inpainting is the name for a scheme of interpolation where elements are given in an array, but where there are missing elements. The name obviously refers to that done by an art conservator who needs to repair a tear or rip in a valuable piece of artwork.
http://en.wikipedia.org/wiki/Inpainting
The idea behind inpainting is typically to formulate a boundary value problem. That is, define a partial differential equation on the region where there is a hole. Using the known boundary values, fill in the hole by solving the PDE for the unknown elements. This can be computationally intensive if there are a huge number of unknown elements, since it typically requires the solution of at least a massive sparse system of linear equations. If the PDE is a nonlinear one, then it becomes a more intensive problem yet. A simple, reasonably good choice for the PDE is the Laplacian, which results in a linear system that extrapolates well. Again, I can offer a solution for a MATLAB user.
http://www.mathworks.com/matlabcentral/fileexchange/4551
Better choices for the PDE may come from nonlinear PDEs. Once such is the Navier/Stokes equation. It is well suited to modeling the types of surfaces typically seen, but it is also more difficult to deal with. As in many facets of life, you get what you pay for.
Phew! Big subject.
The "right" answer depends a lot on your problem domain and various details of what you're doing.
Interpolating in more than 1 dimension requires making some choices. I'll assume that you are plotting on a regular grid, but that some of your grid points have no data. Big question: are the missing points sparse, or do they make big blobs?
You can't add information, so you're just trying to establish something that will look OK.
Conceptually simple suggestion (but the implementation may be some work):
For each region on missing data, identify all the edge points. That is find the x's in this figure
oooxxooo
oox..xoo
oox...xo
ox..xxoo
oox.xooo
oooxoooo
where the .'s are the points missing data, and the x's and o's have data (for a single missing point, this will be the four nearest neighbors). Fill in each missing data point with an average over the edge points around this blob. To make it smooth, weight each point by 1/d where d is the taxidriver distance (delta x + delta y) between the two points..
From before we had any details:
In the absence of that kind of information, have you tried straight ahead linear interpolation? If your data is reasonably dense this might do it for you, and it is simple enough to code in-line when you need it.
Next step is usually a cubic spline, but for that you'll probably want to grab an existing implementation.
When I need something more powerful than a quick linear interpolation, I usually use ROOT (and pick one of the TSpline classes), but this may be more overhead than you need.
As noted in the comments, ROOT is big, and while it is fast, it does try to force you to do things the ROOT way, so it can have a big effect on your program.
A linear interpolation between (or indeed extrapolation from) two points (x1, y1) and (x2, y2) gives you
y_i = (x_i-x1)*(y2-y1)/(x2-x1)
Considering this is a simple school project, probably the easiest interpolation technique to implement is the "Nearest Neighbors"
For each missing data point you find the nearest "filled" data point and use that as the value.
If you want to improve the retults a little bit more, then you can lets say, find K nearest data points, and use their weighted average as the value of your missing data point.
the weight could be proportional to the distance of the point from the missing data point.
There are zillion other techniques, but nearest neighbor is probably the easiest to implement.
if I understand that your need is as follows.
I think you have a subset of x,y,Intensity for a dimension of L by W and you want to fill for all X ranging from 0 to L and Y ranging from 0 to W.
If this is your question, then solution is to get other intensities by using Filters.
I think Bayer filter or Gaussian filter would do the job for you.
You can google these filters and you will get answers to implement.
Best of luck.