I have numerous return time series spanning over a couple of years. I want to see how stable these series are across time. So far I have winsorized and z-scored my data and created histograms and AVG vs. StdDev graphs. Using the histograms I can see how the distribution looks and check for positive or negative skew, with the Avg vs. StdDev chart I tried to get some kind of density measure within the data set (each data point represents a point in time), i.e a big blob means less stable than a dense one
I am looking for other ways to visualise my data. Any ideas welcome
Related
I am trying to perform clustering on a dataset including time series (e.g. sensor recording over a few seconds) and discrete valued variables (e.g. age). I have already tried PCA to combine the original variables and then standard clustering which effectively solves the problem of having time series and discrete valued variables. I would now like to perform time-series clustering using dynamic time warping (DTW) distance but I am not sure how I can incorporate the discrete valued variables.
My first attempt was to calculate DTW distance for the time-series variables, Euclidean distance for the discrete variables and then combine these distances into a single similarity matrix. The issue is that, because of the way DTW is calculated (sum of all the Euclidean distances between optimal matched points in two time series), the scale of the DTW distance is much larger than that of the discrete variables, even after standardising the variables. If I then apply clustering on the resulting distance matrix, the discrete variables would be pretty meaningless, which is not the case in the real world.
I am trying to find similar examples in the literature and cases in all the Stacks but I've not been very lucky. I thought about:
scaling the DTW distance by the length of the series but that can be a bit tricky with time series with different lengths and on initial attempts, it seems it shrinks the distance in the time series variables a lot.
converting the discrete variable into a time series of constant values but I am not sure this is a great idea either.
Does anyone know of any examples or has anyone got any clever ideas?
Thanks
You should be able to leverage any generic stock ticker analysis to get what you want. Here is a link that shows a simple time series analysis of stock data, as well as a few clustering exercises.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb
I use example code to compare HSV histograms using EMD.
I want to find similar images in people's (mobile) picture library. It's quite common that people take several images of the same subject (in a row) with just slight changes: zooming in/out a bit, different angle, different exposure as a result of changing position, other pose, ....
I selected 4 sets of 4 similar images to test this algorithm. When comparing the images inside the sets, I get 22 EMD-L1 values between roughly 0.25 and 2.25 (average 1.47) and 2 outliers around 7.2.
When I cross-comparing between sets I get values between 2 and 15 with an average around 8.
Yes, there is a significant range difference between the two result sets. But I was disappointed that there was no (gap) between these ranges, and instead a small overlap [2.0, 2.25]. I'm hoping to improve the algorithm.
How can I optimise my comparison for my particular use-case? There are various histogram forms, various histogram comparison algorithms, and then each has various parameters.
Does OpenCV implement the fastest known EMD algorithm? I was surprised that the comparison of some histograms took up to a second; especially with the relatively small bin numbers.
Then, some cross-comparisons give good EMD results, but have totally different RGB histograms. Here are two images:
My current EMD-L1 says 1.95, but the RGB histograms are totally different.
Probably you've already refined your comparison method. But this might not be obvious, you could divide the image into overlapping subregions, and then compute the EMD for all 4 parts.
Can I use a pca subspace trained on, say, eight features and one thousand time points to evaluate a single reading? That is, if I keep, say, the top six components, my transformation matrix will be 8x6 and using this to transform test data that is the same size as the training data would give me an 6x1000 vector.
But what if I want to look for anomalies at each time point independently? That is, can rather than use an 8x1000 test set, can I use 1000 separate transformation on 8x1 dimensional test vectors and get the same result? This vector will get transformed into the exact same spot as if it were the first row in a much larger data matrix, but the distance of that one vector from the principal axis doesn't appear to be meaningful. When I perform this same procedure on the truncated reference data, this distance isn't zero either, only the sum of all distances over the entire reference data set is zero. So if I can't show that the reference data is not "anomalous", how can I use this on test data?
Is it the case that the size of the data "object" used to train pca is the size of object that can be evaluated with it?
Thanks for any help you can give.
I'm actually trying to detect characteristics of the time series for a very big region composed of many smaller subregions (in my case pixels). I don't know much about this, so the only way I can come up with is an averaged time series for the entire region, although I know this would definitely conceal many features by averaging.
I'm just wondering if there are any widely used techniques that can detect the common features of a suite of time series? like pattern recognition or time series classification?
Any ideas/suggestions are much appreciated!
Thanks!
Some extra explanations: I'm dealing with remote sensing images of several years with a time step of 7 days. So for each pixel, there is a time series associated, with values extracted from this pixel on different dates.So if I define a region consisting of many pixels, is there a way to detect or extract some common features charactering all or most of the time series of pixels within this region? Such as the shape of the time series, or a date around which there's an obvious increase in the values?
You could compute the correlation matrix for the pixels. This would simply be:
corr = np.zeros((npix,npix))
for i in range(npix):
for j in range(npix):
corr(i,j) = sum(data(i,:)*data(j,:))/sqrt(sum(data(i,:)**2)*sum(data(j,:)**2))
If you want more information, you can compute this as a function of time, i.e. divide your time series into blocks (say minutes) and compute the correlation for each of them. Then you can see how the correlation changes over time.
If the correlation changes a lot, you may be more interested in the cross-power spectrum of the pixels. This is defined as
cpow(i,j,:) = (fft(data(i,:))*conj(fft(data(j,:)))
This will tell you how much pixel i and j tend to change together on various time-scales. For example, they could be moving in unison in time-scales of a second (1 Hz), but also have changes on a time-scale of, say, 10 seconds which are not correlated with each other.
It all depends on what you need, really.
I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
use
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.