I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.
Related
I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.
I am new to machine learning and data analysis and I'm struggling to cluster my data. I'm working with about 40,000 observations with 6 features.
I have tried various clustering methods including K-Means, DBSCAN, and also attempted scipy hierarchical clustering with linkage. During preprocessing missing data is imputed and all of the data is normalized. Once I complete PCA to reduce the dimensions from 4 to 6 my data looks like a crescent moon shape that can be seen below as the blue dots.
I determined that using 10 clusters for K-means would be best based on silhouette coefficient analysis and this is the result:
The result does not change much when performing PCA after the data has been clustered.
DBSCAN itself decides on 4 clusters and gives 4 clusters but with most of the data excluded from these clusters and depicted as noise.
For the hierarchical method the data usage was too much when trying to perform linkage() and kept providing a memory error message.
Is there any way I can cluster my data? Is the shape of my data (a crescent moon) lend itself to other modelling methods?
Don't run clustering without thinking first
Clustering algorithms must not be used as black boxes. They need to be carefully used or you get out only garbage. And to use them right, you need to understand the objective of each algorithm. K-means is a least squares approach. if you use it on badly normalized data, it fails.
Judging from your plot, there is a bad record in your database, largely causing that "moon" shape: everything needs tp be as far away as possible from that bad record.
Apart from that: 1. did you scale the data correctly for your problem? 2. did you choose the appropriate distance measure?
If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.
I'm working on an algorithm that makes a guess at K for kmeans clustering. I guess I'm looking for a data set that I could use as a comparison, or maybe a few data sets where the number of clusters is "known" so I could see how my algorithm is doing at guessing K.
I would first check the UCI repository for data sets:
http://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table
I believe there are some in there with the labels.
There are text clustering data sets that are frequently used in papers as baselines, such as 20newsgroups:
http://qwone.com/~jason/20Newsgroups/
Another great method (one that my thesis chair always advocated) is to construct your own small example data set. The best way to go about this is to start small, try something with only two or three variables that you can represent graphically, and then label the clusters yourself.
The added benefit of a small, homebrew data set is that you know the answers and it is great for debugging.
Since you are focused on k-means, have you considered using the various measures (Silhouette, Davies–Bouldin etc.) to find the optimal k?
In reality, the "optimal" k may not be a good choice. Most often, one does want to choose a much larger k, then analyze the resulting clusters / prototypes in more detail to build clusters out of multiple k-means partitions.
The iris flower dataset is a good one to start with, that clustering works nicely on.
Download here
How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.