SARIMAX model fitting too slow in statsmodels - time-series

I am trying a grid search to perform model selection by fitting SARIMAX(p, d, q)x(P, D, Q, s) models using SARIMAX() method in statsmodels. I do set d and D to 1 and s to 7 and iterate over values of p in {0, 1}, q in {0, 1, 2}, P in {0, 1}, Q in {0, 1}, and trend in {None, 'c'}, which makes for a total of 48 iterations. During the model fitting phase, if any combination of the parameters leads to a non-stationary or non-invertible model, I move to the next combination of parameters.
I have a set of time-series, each one representing the performance of an agent over time and consisting of 83 (daily) measurements with a weekly seasonality. I keep 90% of the data for model fitting, and the last 10% for forecasting/testing purposes.
What I find is that model fitting during the grid search takes a very long time, about 11 minutes, for a couple of agents, whereas the same 48 iterations take much less time, less than 10 seconds, for others.
However, if, before performing my grid search, I log-transform the data corresponding to the agents whose analyses take a very long time, the same 48 iterations take about 15 seconds! However, as much as I love the speed-up factor, the final forecast turns out to be poorer compared to the case where the original (that is, not log-transformed) data was used. So, I'd rather keep the data in its original format.
My questions are the following:
What causes such slow down for certain time-serires?
And is there a way to speed-up the model fitting by giving SARIMAX() or SARIMAX.fit() certain arguments? I have tried simple_differencing = True which, constructing a smaller model in the state-space, reduced the time from 11 minutes to 6 minutes, but that's still too long.
I'd appreciate any help.

Related

Can someone explain batch-size and timestep for a regression model using RNN?

I am working on a regression model, which has 50 datapoints per hour. I am having a hard time deciding on the difference between batch size and time-step. From my understanding, batch size is used to decide how many datapoints do we want to consider before making a prediction. The larger the value, the longer it takes for the model to converge. If that is the case I am clear on definition of batch size. So if my model isn't taking very long, can I just use the maximum? Would that maximum be the test datasize?
How about timesteps then? For a model where you measure let's say temperature every minute till 30 hours, what would the timestep be?
I would appreciate it if someone who knows about regression using RNN could answer my doubts.
Given :
import numpy as np
x = np.array([[[1], [0], [1]]])
print(x.shape)
Output:
(1, 3, 1)
This is for m samples s timesteps with e measurements per timesteps.
(m, s, e)
In any case the number of data points is the size of the array so:
m * s * e
The number of data point per sample:
s * e
If you meaure temperature every second for an hour on one sample.
(1, 3600, 1)
If you measure let say temperature and humidity.
(1, 3600, 2)
Let say you do that simultaneously for 2 samples (in place A and B).
(2, 3600, 2)
Batch size is just not related at all.
For each epoch it just means how many samples you want to run at once. For 100 samples batch size 50 you have two weight update per epoch for instance.

Analyzing final value given changing x values through time

I am analyzing data that contains the Y variable (final chemical content of a plant when it was harvested ~ 8 weeks) and the explanatory x variables of light quality measured each week for 8 weeks. I understand that this is not a typical time series analysis because my y value is not measured at each of these intervals but rather only at week 8, but I want to see how the changes throughout the 8 weeks influences the final chemical concentration. One possibility would be a nested regression where treatments (categorical) would have the light measurements (numerical) nested within them for each week. However, I'm not sure if this is the best approach. Any suggestions would be helpful.

Genetic Algorithm CrossOver

I have a GA of population X.
After I run the gene and get the result for each gene I do some weighted multiply for the genes(so the better ranked genes get multiplied the most)
I get either x*2 or x*2+(x*100/10) genes. The 10% is random new genes it may or may not trigger depending on the mutation rate.
The problem is, I don' know what is the best approach to reduce the population to X again.
If the gene is a List should I just use list[::2] (or get every even index item from list)
What is a common practice when crossing genes?
EDIT:
Example of my GA with a population of 100;
Run the the 100 genes in the fitness function and get the result. Current Population: 100
Add 10% new random genes. Current Population: 110
Duplicate top 10% genes. Current Population: 121
Remove 10% worst genes. Current Population: 108
Combine all possible genes(no duplicates). Current Population: 5778
Remove genes from genepool until Population = 100. Current Population: 100
Restart the fitness function
What I want to know is: How should I do the last step? Currently I have a list with 5778 items and I take one every '58' or expressed as len(list)/startpopulation-1
Or should I use a 'while True' with a random.delete until len(list) == 100?
The new random genes should be added before or after the crossover?
Is there a way to make a gausian multiplication of the top-to-lowest rated items?
e.g: the top rated are multiplied by n, the second best by (n-1), the third by (n-2) ..., the worst rated multiplied by (n-n).
I do not really know why you are performing GA like that, could you give some references?
In any case here goes my typical solution for implementing a functional GA method:
Run the the 100 genes in the fitness function and get the result.
Randomly choose 2 genes based on the normalized fitness function
(consider this the probability of each gene to be chosen from the
pool) and cross-over. Repeat this step until you have 90 new genes
(45 times for this case). Save the top 5 without modification and
duplicate. Total genes: 100.
For the 90 new genes and the 5 duplicates on the new pool allow
them to mutate based on your mutation probability (typically 1%).
Total genes: 100.
Repeat from 1) to 3) until convergence, or X number of
iterations.
Note: You always want to keep unchanged the best genes such as you always get a better solution in each iteration.
Good luck!

Cross validation is very slow in Grid search (libsvm)

I am using libsvm on 62 classes with 2000 samples each. The problem is i wanted to optimize my parameters using grid search. i set the range to be C=[0.0313,0.125,0.5,2,8] and gamma=[0.0313,0.125,0.5,2,8] with 5-folds. the crossvalition does not finish at the first two parameters of each. Is there a faster way to do the optimization? Can i reduce the number of folds to 3 for instance? The number of iterations written keeps playing in (1629,1630,1627) range I don't know if that is related
optimization finished,
#iter = 1629 nu = 0.997175 obj = -81.734944, rho = -0.113838 nSV = 3250, nBSV = 3247
This is simply expensive task to find a good model. Lets do some calculations:
62 classes x 5 folds x 4 values of C x 4 values of Gamma = 4960 SVMs
You can always reduce the number of folds, which will decrease the quality of the search, but will reduce the whole amount of trained SVMs of about 40%.
The most expensive part is the fact, that SVM is not well suited for multi label classification. It needs to train at least O(log n) models (in the error correcting code scenario), O(n) (in libsvm one-vs-all) to even O(n^2) (in one-vs-one scenario, which achieves the best results).
Maybe it would be more valuable to switch to some fast multilabel model? Like for example some ELM (Extreme Learning Machine)?

How to evaluate predictions from incomplete data, where not all data is incomplete

I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was
original = [1, 1, 0, 1, 1, 0]
And now I want to see how good I can reconstruct the original data when the given data is incomplete:
incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]
And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?
EDIT:
Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.
The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.
I think it's first important to be clear what you are trying to measure, and what your input is.
Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.
FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.
Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.
But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.

Resources