Interpreting Seasonality in Time Series - time-series

I have a discrete time series covering 49 quarters between January 2007 and March 2019, which I am trying to analyse. Before undertaking various forms of analysis I wanted to check for the existence of seasonality and have tried to methods for such in R. In the first I used the WO function (Webel and Ollech) from the seastests package, which informed me that the data did not display seasonality.
library(seastests)
summary(wo(tt))
> summary(wo(tt))
Test used: WO
Test statistic: 0
P-value: 0.8174965 0.5785041 0.2495668
The WO - test does not identify seasonality
However, I wanted to check such again and used the decompose function, from which I got the below, which would appear to suggest a seasonal component. Can anyone advise if;
I am reading the decomposed data correctly?
AND
Why there is such disagreement between decompose and the seastest results?

The decompose function is a simple function that basically estimates the (moving) period average. The volatility of your time series increases strongly in the last years. Thus the averages may pick up on some random increases. Also, the seasonal component that you obtain using the decompose() function will basically always look seasonal.
set.seed(1234)
x <- ts(rnorm(80), frequency=4)
seastests::wo(x)
plot(decompose(x))
Therefore, seasonality tests are preferable to assessing whether a time series really is seasonal.
Still, if you have information that the data generating process has changed, you may want to use the test on the last few years of observations.

Related

Setting correct input for RNN

In a database there are time-series data with records:
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
...
For every device there are 4 hours of time series data (with an interval of 5 minutes) before an alarm was raised and 4 hours of time series data (again with an interval of 5 minutes) that didn't raise any alarm. This graph describes better the representation of the data, for every device:
I need to use RNN class in python for alarm prediction. We define alarm when the temperature goes below the min limit or above the max limit.
After reading the official documentation from tensorflow here, i'm having troubles understanding how to set the input to the model. Should i normalise the data beforehand or something and if yes how?
Also reading the answers here didn't help me as well to have a clear view on how to transform my data into an acceptable format for the RNN model.
Any help on how the X and Y in model.fit should look like for my case?
If you see any other issue regarding this problem feel free to comment it.
PS. I have already setup python in docker with tensorflow, keras etc. in case this information helps.
You can begin with a snippet that you mention in the question.
Any help on how the X and Y in model.fit should look like for my case?
X should be a numpy matrix of shape [num samples, sequence length, D], where D is a number of values per timestamp. I suppose D=1 in your case, because you only pass temperature value.
y should be a vector of target values (as in the snippet). Either binary (alarm/not_alarm), or continuous (e.g. max temperature deviation). In the latter case you'd need to change sigmoid activation for something else.
Should i normalise the data beforehand
Yes, it's essential to preprocess your raw data. I see 2 crucial things to do here:
Normalise temperature values with min-max or standardization (wiki, sklearn preprocessing). Plus, I'd add a bit of smoothing.
Drop some fraction of last timestamps from all of the time-series to avoid information leak.
Finally, I'd say that this task is more complex than it seems to be. You might want to either find a good starter tutorial on time-series classification, or a course on machine learning in general. I believe you can find a better method than RNN.
Yes you should normalize your data. I would look at differencing by every day. Aka difference interval is 24hours / 5 minutes. You can also try and yearly difference but that depends on your choice in window size(remember RNNs dont do well with large windows). You may possibly want to use a log-transformation like the above user said but also this seems to be somewhat stationary so I could also see that not being needed.
For your model.fit, you are technically training the equivelant of a language model, where you predict the next output. SO your inputs will be the preciding x values and preceding normalized y values of whatever window size you choose, and your target value will be the normalized output at a given time step t. Just so you know a 1-D Conv Net is good for classification but good call on the RNN because of the temporal aspect of temperature spikes.
Once you have trained a model on the x values and normalized y values and can tell that it is actually learning (converging) then you can actually use the model.predict with the preciding x values and preciding normalized y values. Take the output and un-normalize it to get an actual temperature value or just keep the normalized value and feed it back into the model to get the time+2 prediction

Clustering "access-time" data sequences

I have many sequences of data looking like this:
s1 = t11, t12, ..., t1m_1
s2 = t21, t22, ..., t2m_2
...
si = ti1, ti2, ..., tim_i
si means the i-th sequence, tij means the i-th sequence be accessed at time tj
each sequence has different length of data (m_1 may not equal to m_2),
and each sequence's data means that the sequence si was accessed time at ti1, ti2, ..., tim_i.
My goal is to cluster the similar access-time sequences.
I'm not sure whether I can translate this problem to a time-series problem.
For my understanding the time-series data like that each sequence's data means the value at that time like stock data, but my sequence's value means which time the sequence be accessed.
If it can translate to time-series problem, but there is another problem. The problem is that the sequence's access time is very discrete (may be accessed at 1s, 1000s, 2000s), so if I translate to time-series format, its space would be very large, I think this can't run cluster with some algorithm like (DTW), its time complexity may too large.
As you pointed out, DTW would be quite slow, since comparing the first two series takes k * m_1 * m_2 operations.
To avoid this, and to more easily compare your sequences, you might somehow hammer them into the same format (thereby also losing information).
Here are some ideas:
Differentiate to obtain times-between-accesses, and build histograms with fixed bins across all data.
Count the number of accesses during each minute every week (and divide by number of times that minute-of-week appears in each series). Adapt to timescales of interest.
Count "number of accesses up until now". So, instead of having data points only when an access was made ("sparse"), you'd get a data point for every timestamp ("dense") showing accesses for every minute up to the current one.
#3 would be similar to an "integral image" in computer vision. After this, new summarization techniques open up, like moving averages, or even direct comparison (if the recordings happen in parallel).
In order to pick a more useful representation, you need to think about what is meaningful in your application.
After you get a uniform-length representation, you can use cheaper similarity measures. A typical one is cosine similarity (but be sure to normalize first).

Time series prediction using GP - training data

I am trying to implement time series forecasting using genetic programming. I am creating random trees (Ramped Half-n-Half) with s-expressions and evaluating each expression using RMSE to calculate the fitness. My problem is the training process. If I want to predict gold prices and the training data looked like this:
date open high low close
28/01/2008 90.959999 91.889999 90.75 91.75
29/01/2008 91.360001 91.720001 90.809998 91.150002
30/01/2008 90.709999 92.580002 90.449997 92.059998
31/01/2008 90.919998 91.660004 90.739998 91.400002
01/02/2008 91.75 91.870003 89.220001 89.349998
04/02/2008 88.510002 89.519997 88.050003 89.099998
05/02/2008 87.900002 88.690002 87.300003 87.68
06/02/2008 89 89.650002 88.75 88.949997
07/02/2008 88.949997 89.940002 88.809998 89.849998
08/02/2008 90 91 89.989998 91
As I understand, this data is nonlinear so my questions are:
1- Do I need to make any changes to this data like exponential smoothing? and why?
2- When looping the current population and evaluating the fitness of each expression on the training data, should I calculate the RMSE on just part of this data or all of it?
3- When the algorithm finishes and I get an expression with the best (lowest) fitness, does this mean that when I apply any row from the training data, the output should be the price of the next day?
I've read some research papers about this and I noticed some of them mentioning dividing the training data when calculating the fitness and some of them are doing exponential smoothing. However, I found them a bit difficult to read and understand, and most implementations I've found are either in python or R which I am not familiar with.
I appreciate any help on this.
Thank you.

Learning from time-series data to predict time-series (not forecasting)

I have a number of datasets where each of them contains a number of input variables (lets say 3) as time series and an output variable, also as a time series and all over the same time period.
Each of these series has the same number of datapoints (say 1000*10 if 10 second data was gathered at 1000Hz).
I want to learn from this data and given a new dataset with 3 time serieses for input variables, I want to predict the time series for the output variable.
I will write the problem below in some non-English notation. I will avoid using terms like features, sample, target etc because since I haven't formulated the problem for any algorithm, I don't want to speculate what will be what.
Datasets to learn from look like this:
dataset1:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset2:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset3:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
.
.
datasetn:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
Now, given a new (timSeries1, timSeries2, timSeries3) I want to predict (timSeriesOut)
datasetPredict:{Inputs=(timeSeries1,timSeries2,timSeries3), Output = ?}
What technique should I use and how should the problem be formulated? Should I just break it as separate learning problem for each time stamp with three features and one target (either for that or next timestamp)?
Thank you all!

How to predict several unlabelled attributes at once using WEKA and NaiveBayes?

I have a binary array that has 96 elements, it could look someting like this:
[false, true, true, false, true, true, false, false, false, true.....]
Each element represents a time interval in 15 minutes starting from 00.00. The first element is 00.15, the second is 00.30, the third 00.45 etc. The boolean tells whether a house has been occupied in that time interval.
I want to train a classifier, so that it can predict the rest of a day, when only some part of the day is known. Let's say I have observations for the past 100 days, and I only know the the first 20 elements of the current day.
How can I use classification to predict the rest of the day?
I tried creating a ARFF file that looks like this:
#RELATION OccupancyDetection
#ATTRIBUTE Slot1 {true, false}
#ATTRIBUTE Slot2 {true, false}
#ATTRIBUTE Slot3 {true, false}
...
#ATTRIBUTE Slot96 {true, false}
#DATA
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,true,true,false,true,true,true,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
.....
And did a Naive Bayes classification on it. The problem is, that the results only show the success of one attribute (the last one, for instance).
A "real" sample taken on a given day might look like this:
true,true,true,true,true,true,true,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
How can i predict all the unlabelled attributes at once?
I made this based on the WekaManual-3-7-11, and it works, but only for a single attribute:
..
Instances unlabeled = DataSource.read("testWEKA1.arff");
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
DataSink.write("labeled.arff", labeled);
Sorry, but I don't believe that you can predict multiple attributes using Naive Bayes in Weka.
What you could do as an alternative, if running Weka through Java code, is loop through all of the attributes that need to be filled. This could be done by building classifiers with n attributes and filling in the next blank until all of the missing data is entered.
It also appears that what you have is time-based as well. Perhaps if the model was somewhat restructured, it may be able to all fit within a single model. For example, you could have attributes for prediction time, day of week and presence over the last few hours as well as attributes that describe historical presence in the house. It might be going over the top for your problem, but could also eliminate the need for multiple classifiers.
Hope this Helps!
Update!
As per your request, I have taken a couple of minutes to think about the problem at hand. The thing about this time-based prediction is that you want to be able to predict the rest of the day, and the amount of data available for your classifier is dynamic depending on the time of day. This would mean that, given the current structure, you would need a classifier to predict values for each 15 minute time-slot, where earlier timeslots contain far less input data than the later timeslots.
If it is possible, you could instead use a different approach where you could use an equal amount of historical information for each time slot and possibly share the same classifier for all cases. One possible set of information could be as outlined below:
The Time Slot to be estimated
The Day of Week
The Previous hour or two of activity
Other Activity for the previous 24 Hours
Historical Information about the general timeslot
If you obtain your information on a daily basis, it may be possible to quantify each of these factors and then us it to predict any time slot. Then, if you wanted to predict it for a whole day, you could keep feeding it the previous predictions until you have completed the predictions for the day.
I have done a similar problem for predicting time of arrival based on similar factors (previous behavior, public holidays, day of week, etc.) and the estimates were usually reasonable, though as accurate as you could expect for human process.
I can't tell if there's something wrong with your arff file.
However, here's one idea: you can add a NominalToBinary unsupervised-Attribute-filter to make sure that the attributes slot1-slot96 are recognized as binary.
There two frameworks which provide multi-label learning and work on top of WEKA:
MULAN: http://mulan.sourceforge.net/
MEKA: http://meka.sourceforge.net/
I only tried MULAN and it works very good. To get the latest release you need to clone their git repository and build the project.

Resources