dimensional time series data clustering - time-series

I have a data set which is time-series type and contains three dimensions namely acceleration, speed and grade. I want to apply clustering to identify the clusters that have similar speed (acceleration=0, positive or negative) varying with grade. I do not know what type of clustering should i use, surely k-means cannot help me because there is a serial correlation between my data point because each point is affected by its previous point. Could you please help me with the type of clustering?

Popular time series similarity metrics such as DTW can be implemented for multiple variates the same way as for a single variate. The most challenging part is normalization.
You then can run hierarchical clustering trivially. Do not use KMeans.

Related

Hierarchical Clustering

I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.
Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.
In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.

an algorithm for clustering visually separable clusters

I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.

How can I normalize data to have same average sum of square?

In a lot of articles in my field, this sentence has been repeated: " The 2 matrices has been normalized to have the same average sum-of-squares (computed across all subjects and all voxels for each modality)". Suppose that we have two matrices that the rows define different subjects and the columns are features (voxels). In these articles, no much explanation can be found for normalization method. Does anybody knows how I should normalize data to have "same average sum-of-squares"? I don't understand it at all. Thanks
For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the
Sum of squares:
SUM_i(Xi-Xbar)²
Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you're using and what you're aiming at.
Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.

More accurate approach than k-mean clustering

In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.
Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.
There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,
In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

Resources