Visualizing DBSCAN in maps for Python - geolocation

i have run a DBSCAN model to cluster geo data points, that is, latitude and longitud. To set the parameter i want to have a visual input as to how the clusters look in the map.
How can i achieve this? Also, in a form that wont crash my computer. For example, drawing 800 K points using gmaps tends to slow my computer significantly.
Thanks!

Tools such as ELKI allow you to draw convex hulls and alpha shapes of clusters. These are a lot simpler than the raw data, and easier to interpret.
See e.g. https://doublebyteblog.wordpress.com/2015/05/11/visualizing-micro-blogging-data/

Related

How to accelerate DFT by specifying region of interest in frequency domain

Note: This question is originally asked at OpenCV forum a couple of days ago.
I'm building an image processing program which extensively uses 2-dimensional dft, discrete Fourier transform. I'm trying to speed up so as to run in real-time.
In that application I only use a part of dft output specified by a rectangular ROI. My current implementation follows the steps below:
Compute dft of the input image f (typically size of 512x512) and get the entire dft result F
Crop F into a pre-specified region of interest (ROI, typically size of 32x32, located arbitrary), R
This process basically works well, but involves useless calculation since I only need partial information of F. I'm looking for a way to accelerate this calculation only by computing necessary part of dft.
I found OpenCV with Intel IPP computes dft with Intel IPP functions, which results in an order of magnitude faster than naive OpenCV implementation. I'm wondering if I can even accelerate this computation by only computing pre-specified frequency domain of dft.
Since I'm new to OpenCV, I've lost my way here, so I hope you could provide a way to do this.
Kindly note that I don't mean to do a dft to an ROI of an image, i.e. dft(ROI(f)), but I want to compute ROI(dft(f)).
Thanks in advance.
Just a partial idea.
The DFT is separable. It is always computed by first applying the FFT algorithm to rows of the image, then to the columns of the result (or the other way around, the order doesn't matter).
If you want only an ROI of the output, in the second step you only need to process the columns that fall within the ROI.
I don't think you'll find a way to compute only a subset of frequencies along each 1D row/column. That would likely entail hacking your own FFT, which will likely be more computationally expensive than using the one in IPP or FFTW.

What methods are best for clustering multidimensional data that has irregular shape?

I am new to machine learning and data analysis and I'm struggling to cluster my data. I'm working with about 40,000 observations with 6 features.
I have tried various clustering methods including K-Means, DBSCAN, and also attempted scipy hierarchical clustering with linkage. During preprocessing missing data is imputed and all of the data is normalized. Once I complete PCA to reduce the dimensions from 4 to 6 my data looks like a crescent moon shape that can be seen below as the blue dots.
I determined that using 10 clusters for K-means would be best based on silhouette coefficient analysis and this is the result:
The result does not change much when performing PCA after the data has been clustered.
DBSCAN itself decides on 4 clusters and gives 4 clusters but with most of the data excluded from these clusters and depicted as noise.
For the hierarchical method the data usage was too much when trying to perform linkage() and kept providing a memory error message.
Is there any way I can cluster my data? Is the shape of my data (a crescent moon) lend itself to other modelling methods?
Don't run clustering without thinking first
Clustering algorithms must not be used as black boxes. They need to be carefully used or you get out only garbage. And to use them right, you need to understand the objective of each algorithm. K-means is a least squares approach. if you use it on badly normalized data, it fails.
Judging from your plot, there is a bad record in your database, largely causing that "moon" shape: everything needs tp be as far away as possible from that bad record.
Apart from that: 1. did you scale the data correctly for your problem? 2. did you choose the appropriate distance measure?

object detection LEDs in simple scene

I am new to opencv, I am guessing that this problem could be somewhat simple: I am trying to detect an object which is almost 25 by 15 pixels in an image which is 470 by 590 pixels.
I am attaching a zoomed image of this object, I have several options to go with:
1 - Two close Circles Detection using hough transformation,
2 - Histogram matching
3 - SURF feature detection
Any advise on which direction should I take? Please consider speed and real-time application. Thanks
I think it should go without explicitly saying so, but there are probably hundreds of things that could be tried, and with only one example image it is quite difficult to advise. For instance are the LED always green? we don't know.
That aside, imho, two good places to start would be with the ol' faithful template matching, or blob detection.
Then if that is not robust enough, you will need to look at some alternative representations of the template/blob, like the classic HoG (good for shape, maybe a bit heavy this app.), or even your own bespoke one that encodes your own domain specific knowledge of this problem.
Then if that is not robust enough, build a dataset of representative +ve and -ve examples, as big as you can, and then train a machine like svm , or a boosted classifier.
Template Matching:
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html
Blob detection:
https://code.google.com/p/cvblob/
Machine Learning:
http://docs.opencv.org/modules/ml/doc/ml.html
TIPS:
Add as much domain knowledge as possible, i.e. if they are always green, use color in the representation, like hog on g channel for instance. If they are always circular, try to encode that, like use a log-polar grid in the template,rather than a regular grid... and so on.
Machine Learning is not magic, a linear classifier will essentially weight different points in the feature space, so you still require a good representation, so if the Template matching was a total fail, the it is unlikely that simple linear ml with help, but if the Template matching was okay, then ml may well boost the performance to a good level.
step 1: Remove the black background.
step 2: A snake algorithm can be used to find the boundaries of your object

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

How to interpolate between data points?

I am currently developing a piece of software using opencv and qt that plots data points. I need to be able fill in an image from incomplete data. I want to interpolate between the points I have. Can anyone recommend a library or function that could help me. I thought maybe the opencv reMap method but I can't seem to get that to work.
The data is a 2-d matrix of intensity values. I want to create an image of some sort. Its a school project.
Interpolation is a complex subject. There are infinitely many ways to interpolate a set of points, and this assuming that you truly do wish to do interpolation, and not smoothing of any sort. (An interpolant reproduces the original data points exactly.) And of course, the 2-d nature of this problem makes things more difficult.
There are several common schemes for interpolation of scattered data in 2-d. Actually, for those who have access to it, a very nice paper is available (Richard Franke, "Scattered data interpolation: Tests of some methods", Mathematics of Computation, 1982.)
Perhaps the most common method used is based on a triangulation of your data. Merely build a triangulation of the domain from your data points. Then any point inside the convex hull of the data must lie inside exactly one of the triangles, or it will be on a shared edge. This allows you to interpolate linearly inside the triangle. If you are using MATLAB, then the function griddata is available for this express purpose.)
The problem when trying to populate a complete rectangular image from scattered points is that very likely the data does not extend to the 4 corners of the array. In that event, a triangulation based scheme will fail, since the corners of the array do not lie inside the convex hull of the scattered points. An alternative then is to use "radial basis functions" (often abbreviated RBF). There are many such schemes to be found, including Kriging, when used by the geostatistics community.
http://en.wikipedia.org/wiki/Kriging
Finally, inpainting is the name for a scheme of interpolation where elements are given in an array, but where there are missing elements. The name obviously refers to that done by an art conservator who needs to repair a tear or rip in a valuable piece of artwork.
http://en.wikipedia.org/wiki/Inpainting
The idea behind inpainting is typically to formulate a boundary value problem. That is, define a partial differential equation on the region where there is a hole. Using the known boundary values, fill in the hole by solving the PDE for the unknown elements. This can be computationally intensive if there are a huge number of unknown elements, since it typically requires the solution of at least a massive sparse system of linear equations. If the PDE is a nonlinear one, then it becomes a more intensive problem yet. A simple, reasonably good choice for the PDE is the Laplacian, which results in a linear system that extrapolates well. Again, I can offer a solution for a MATLAB user.
http://www.mathworks.com/matlabcentral/fileexchange/4551
Better choices for the PDE may come from nonlinear PDEs. Once such is the Navier/Stokes equation. It is well suited to modeling the types of surfaces typically seen, but it is also more difficult to deal with. As in many facets of life, you get what you pay for.
Phew! Big subject.
The "right" answer depends a lot on your problem domain and various details of what you're doing.
Interpolating in more than 1 dimension requires making some choices. I'll assume that you are plotting on a regular grid, but that some of your grid points have no data. Big question: are the missing points sparse, or do they make big blobs?
You can't add information, so you're just trying to establish something that will look OK.
Conceptually simple suggestion (but the implementation may be some work):
For each region on missing data, identify all the edge points. That is find the x's in this figure
oooxxooo
oox..xoo
oox...xo
ox..xxoo
oox.xooo
oooxoooo
where the .'s are the points missing data, and the x's and o's have data (for a single missing point, this will be the four nearest neighbors). Fill in each missing data point with an average over the edge points around this blob. To make it smooth, weight each point by 1/d where d is the taxidriver distance (delta x + delta y) between the two points..
From before we had any details:
In the absence of that kind of information, have you tried straight ahead linear interpolation? If your data is reasonably dense this might do it for you, and it is simple enough to code in-line when you need it.
Next step is usually a cubic spline, but for that you'll probably want to grab an existing implementation.
When I need something more powerful than a quick linear interpolation, I usually use ROOT (and pick one of the TSpline classes), but this may be more overhead than you need.
As noted in the comments, ROOT is big, and while it is fast, it does try to force you to do things the ROOT way, so it can have a big effect on your program.
A linear interpolation between (or indeed extrapolation from) two points (x1, y1) and (x2, y2) gives you
y_i = (x_i-x1)*(y2-y1)/(x2-x1)
Considering this is a simple school project, probably the easiest interpolation technique to implement is the "Nearest Neighbors"
For each missing data point you find the nearest "filled" data point and use that as the value.
If you want to improve the retults a little bit more, then you can lets say, find K nearest data points, and use their weighted average as the value of your missing data point.
the weight could be proportional to the distance of the point from the missing data point.
There are zillion other techniques, but nearest neighbor is probably the easiest to implement.
if I understand that your need is as follows.
I think you have a subset of x,y,Intensity for a dimension of L by W and you want to fill for all X ranging from 0 to L and Y ranging from 0 to W.
If this is your question, then solution is to get other intensities by using Filters.
I think Bayer filter or Gaussian filter would do the job for you.
You can google these filters and you will get answers to implement.
Best of luck.

Resources