Time Series Analysis, ARIMA model - time-series

I'm currently performing a time series analysis using the ARIMA model. I cannot read the PACF and ACF graphs in order to determine P & Q values. Any help would be appreciated.
Thanks

Autocorrelation shows the correlation of past observations (lags) with the time series, which is the correlation of the time series with itself. If you have a time series y(t), then you calculate the correlation of y(t) and y(t-1), y(t) and y(t-2), and so on.
The problem with the autocorrelation is that so called intermediary effects/indirect correlations are also included. If y(t) and y(t-1) correlate, and y(t-1) and y(t-2) also correlate. This influences the correlation of y(t) and y(t-2) indirectly. You can find a more detailed explanation here:
https://otexts.com/fpp2/non-seasonal-arima.html
Partial autocorrelation also shows the correlation of a time series and it’s lags, but intermediary effects are removed. That means in the PACF you can only see how y(t) is influenced directly by y(t-1), y(t-2), and so on. Maybe also have a look here:
https://towardsdatascience.com/time-series-from-scratch-autocorrelation-and-partial-autocorrelation-explained-1dd641e3076f
Now, how do you read ACF and PACF plots? The plot shows you the correlation per each lag. A correlation coefficient usually ranges from -1 meaning a perfect negative relationship to +1 meaning a perfect positive relationship. 0 means no relationship at all. In your case, y(t) and y(t-1) -> lag 1 correlate with a coefficient of around -0.55, meaning a medium strong negative relationship. y(t) and y(t-8) -> lag 8 correlate with a coefficient of +0.3, meaning a weak positive relationship. The confidence limit shows you whether the correlation is statistically significant. Basically, this means that every bar that crosses the line is a “true” correlation that is not random or so, you can use these correlations. On the other hand, t(y) and t(y-2) -> lag 2 have a very weak correlation that seems to be more or less random. You cannot use this relationship.
In general, strong correlations in the PACF indicate the usage of an MA model, so you should use and ARIMA(0, d, q) model. I would recommend to use the first, third and fourth lag, maybe also fifth lag, since these have at least medium strong, significant correlations. This means an ARIMA(0, d, [1, 3, 4, 5]) model.
But you also need the ACF plot to find the best ARIMA order, especially the p value.

First perform one differentiaiton, and then read number of lags for PACF, for current plot P is 8, which is quite high

Related

Determine if one time series forecast another (in terms of trend only)

I have 2 time series, X_t and Y_t, which are on different scales.
Y_t can be 0 to infinite, while X_t is limited to 0 to 100.
How can I determine if the trend of X_t forecast the trend of Y_t? In other words if there is a peak in Xt, then the peak of Yt will follow after some lag.
If this is indeed the case, what is the lag?
I am not interested in forecasting the actual value of Yt.
Using the following chart as an illustration, the red line is Xt (which in my data the values are between 27 to 34), and the black line is Yt (which is about 40000).
I tried to use Time Lagged Pearson Correlation, but I am aware the pearson correlation (of the 2 time series) does not have the concept of time. Pearson correlation simply treats the time series as lists of data.
I have read some guides on Granger causality, but it seems this checks if (the value of) Xt is useful in forecasting the value of Yt, which is similar to a regression framework. (which I am mostly interested in forecasting the trend of Yt)
I am a newbie in time series analysis, Thanks for your time!

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

Gaussian Progress Regression usecase

while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

How many principal components to take?

I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix?
To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).
There is no correct answer, it is somewhere between 1 and n.
Think of a principal component as a street in a town you have never visited before. How many streets should you take to get to know the town?
Well, you should obviously visit the main street (the first component), and maybe some of the other big streets too. Do you need to visit every street to know the town well enough? Probably not.
To know the town perfectly, you should visit all of the streets. But what if you could visit, say 10 out of the 50 streets, and have a 95% understanding of the town? Is that good enough?
Basically, you should select enough components to explain enough of the variance that you are comfortable with.
As others said, it doesn't hurt to plot the explained variance.
If you use PCA as a preprocessing step for a supervised learning task, you should cross validate the whole data processing pipeline and treat the number of PCA dimension as an hyperparameter to select using a grid search on the final supervised score (e.g. F1 score for classification or RMSE for regression).
If cross-validated grid search on the whole dataset is too costly try on a 2 sub samples, e.g. one with 1% of the data and the second with 10% and see if you come up with the same optimal value for the PCA dimensions.
There are a number of heuristics use for that.
E.g. taking the first k eigenvectors that capture at least 85% of the total variance.
However, for high dimensionality, these heuristics usually are not very good.
Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.
Matlab example
I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.
I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.
n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
data = data + rand(n, 1)*rand(1, p);
end
The image will look similar to:
For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:
[coeff,score] = pca(data,'Economy',true);
relativeError = zeros(p, 1);
for ndim=1:p
reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
residuals = data - reconstructed;
relativeError(ndim) = max(max(residuals./data));
end
Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:
Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.
Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.
Code reference
Iterative algorithm is based on the source code of pcares
A StackOverflow post about pcares
I highly recommend the following paper by Gavish and Donoho: The Optimal Hard Threshold for Singular Values is 4/sqrt(3).
I posted a longer summary of this on CrossValidated (stats.stackexchange.com). Briefly, they obtain an optimal procedure in the limit of very large matrices. The procedure is very simple, does not require any hand-tuned parameters, and seems to work very well in practice.
They have a nice code supplement here: https://purl.stanford.edu/vg705qn9070

Resources