I have a number of time series data sets, which I want to transform to dft signals in order to reduce dimensionality. After transforming to dft, I want to cluster the resulting dft data sets using k-means algorithm.
Since dft signals contain an imaginary number how can one cluster them?
You could simply treat the imaginary part as another component in your vectors. In other applications, you will want to ignore it!
But you'll be facing other, more severe challenges.
Data mining, and clustering in particular, rarely is as easy as appliyng function a (dft) and function b (k-means) and then you have the result, hooray. Sorry - that is not how exploratory data mining works.
First of all, for many time series, DFT will not be helpful at all. On others, you will first have to do appropriate resampling, or segmentation, or get rid of uninteresting effects such as seasonality. Even if DFT works, it may emphasize artifacts such as the sampling frequency or some interferences.
And then you'll run into one major problem: k-means is based on the assumption that all attributes have the same importance. And DFT is based on the very opposite idea: the first components capture most of the signal, the later ones only minor deviations from it (and that is the very motivation for using this as dimensionality reduction).
So based on this intuition, you maybe never should apply k-means on DFT coefficients at all. At the same time, data-mining repeatedly has shown that appfoaches that are "statistical nonsense" can nevertheless provide useful results... so you can try, but verify your resultd with care, and avoid being too enthusiastic or optimistic.
With the help of FFT, it converts dataset into dft signals. It helps to calculates DFT for each small data set.
Related
I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.
I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.
Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.
After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.
If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.
I'm developing an algorithm to classify different types of dogs based off of image data. The steps of the algorithm are:
Go through all training images, detect image features (ie SURF), and extract descriptors. Collect all descriptors for all images.
Cluster within the collected image descriptors and find k "words" or centroids within the collection.
Reiterate through all images, extract SURF descriptors, and match the extracted descriptor with the closest "word" found via clustering.
Represent each image as a histogram of the words found in clustering.
Feed these image representations (feature vectors) to a classifier and train...
Now, I have run into a bit of a problem. Finding the "words" within the collection of image descriptors is a very important step. Due to the random nature of clustering, different clusters are found each time I run my program. The unfortunate result is that sometimes the accuracy of my classifier will be very good, and other times, very bad. I have chalked this up to the clustering algorithm finding "good" words sometimes, and "bad" words other times.
Does anyone know how I can hedge against the clustering algorithm from finding "bad" words? Currently I just cluster several times and take the mean accuracy of my classifier, but there must be a better way.
Thanks for taking time to read through this, and thank you for your help!
EDIT:
I am not using KMeans for classification; I am using a Support Vector Machine for classification. I am using KMeans for finding image descriptor "words", and then using these words to create histograms which describe each image. These histograms serve as feature vectors that are fed to the Support Vector Machine for classification.
There are many possible ways of making clustering repeatable:
The most basic method of dealing with k-means randomness is simply running it multiple times and selecting the best one (the one that minimizes the inner cluster distances/maximizes the between clusters distance).
One can use some fixed initialization for your data instead of randomization. There are many heuristics for starting the k-means. Or at least minimize the variance by using algorithms like k-means++.
Use modification of k-means which guarantees global minimum of regularized function, ie. convex k-means
Use different clustering method, which is deterministic, ie. Data Nets
I would offer two possible suggestions, in addition to those provided.
K-means optimises an objective related to the distance between cluster points and their centroids. You care about classification accuracy. Depending on the computational cost, a simple brute-force approach is to induce multiple clusterings on a subset of your training data, and evaluate the performance of each on some held-out development set for the task you care about. Then use the highest performing variant as the final model. I don't like the use of non-random initialisation because this is only a solution to avoid the randomness, not find the true global minimum of the objective, and your chosen initialisation may be useless and just produce consistently bad classifiers.
The other approach, which is much harder, is to view the k-means step as a dimensionality reduction to enable classification, and incorporate this into the classifier directly. If you use a deep neural net, the layer(s) closest to the input are essentially dimensionality reducers in the same way as the k-means clustering you induce: the difference is their weights are set wrt the error of the net on the classification problem, rather than some unrelated intermediate step. The downside is that this is much closer to a current research problem: training deep nets is hard. You could start with a standard one-hidden-layer architecture (with binary activations on the hidden layer, and using cross-entropy loss on the output layer with outputs coded as one-of-n categories), and attempt to add layers incrementally, but as far as I'm aware standard training algorithms start to behave poorly beyond the single hidden layer, so you'd need to investigate layer-wise training to initialise, or some of the Hessian-Free stuff coming out of Geoff Hinton's group in Toronto.
That is actually an important problem with the BofW approach, and you should share this prominently. SIFT data may actually not have k-means clusters at all. However, due to the nature of the algorithm, k-means will always produce k clusters. One of the things to test with k-means is to validate that the results are stable. If you get a completely different result each time, they are not much better than random.
Nevertheless, if you just want to get some working results, you can just fix the dictionary once and choose one that is working well.
Or you might look into more advanced clustering (in particular one that is more robust wrt. noise!)
I am wondering how can I claim that I correctly catch the "noise" in my data ?
To be more specific, take Principle Component Analysis as example, we know that in PCA, after doing SVD, we can zeros out the small singular values and reconstruct the original matrix using low-rank approximation.
Then can I claim what's been ignored is indeed noise in the data ?
Is there any evaluation metric for this ?
The only method I can come up with is simply subtract the original data from the reconstructed data.
Then, try to fit a Gaussian over it, seeing if the fitness is good.
Is that conventional method in field like DSP ??
BTW, I think in typical machine learning tasks, the measurement would be the follow up classification performance, but since I am doing purely generative model, there are no labels attached.
The way I see it, the definition of noise would depend on the domain of the problem. Therefore the strategy for reducing it would be different on each domain.
For instance, having a noisy signal in problems like seismic formation classification or a noisy image on a face classification problem would be drastically different to the noise produced by improperly tagged data in a medical diagnostic problem or the noise because similar words with different meaning in a language classification problem for documents.
When the noise is because of a given (or a set of) data point, then the solution is as simple as ignore those data points (although identify those data points most of the time is the challenging part)
From your example I guess you are more concerning about the case when the noise is embedded into the features (like in the seismic example). Sometimes people tend to pre-process the data with a noise reduction filter like the median filter (http://en.wikipedia.org/wiki/Median_filter). In contrast, some other people tend to reduce the dimension of the data to reduce noise, and PCA is used in this scenario.
Both strategies are valid, and normally people try both and cross-validate them to see which one gave better results.
What you did is a good metric to check gaussian noise. However, for non-gaussian noise your metric can give you false negatives (bad fitness but still good noise reduction)
Personally, if you want to prove the efficacy of the noise reduction, I'd use a task-based evaluation. I assume you're doing this for some purpose, to solve some problem? If so, solve the task with the original noisy matrix and the new clean one. If the latter works better, what was discarded was noise, for the purposes of the task you're interested in. I think some objective measure of noise is pretty hard to define.
I have found this. it is very resoureful, needs good time to understand.
https://sci2s.ugr.es/noisydata#Introduction%20to%20Noise%20in%20Data%20Mining