Time series forecasting (DeepAR): Prediction results seem to have basic flaw - machine-learning

I'm using the DeepAR algorithm to forecast survey response progress with time. I want the model to predict the next 20 data points in the survey progress. Each survey is a time series in my training data. The length of each time series is the # days for which the survey ran. For example, the below series indicates that the survey started on 29-June-2011 and the last response was received on 24-Jul-2011 (25 days is the length).
{"start":"2011-06-29 00:00:00", "target": [37, 41.2, 47.3, 56.4, 60.6, 60.6,
61.8, 63, 63, 63, 63.6, 63.6, 64.2, 65.5, 66.1, 66.1, 66.1, 66.1, 66.1, 66.1,
66.1, 66.1, 66.1, 66.1, 66.7], "cat": 3}
As you can see the values in the time series can remain the same or increase. The training data would never indicate a downward trend. Surprisingly, when I generated predictions, I noticed that the predictions had a downward trend. When there is no trace of downward trend in the training data, I'm wondering how the model could have possibly learned this. To me, this seems to be a basic flaw in the predictions. Can someone please throw some light on why the model might behave in this way? I build the DeepAR model with the below hyper parameters. The model was tested and the RMSE is about 9. Would it help if I change any of the hyper parameters? Any recommendations for this.
time_freq= 'D',
context_length= 30,
prediction_length= 20,
cardinality= 8,
embedding_dimension= 30,
num_cells= 40,
num_layers= 2,
likelihood= 'student-T',
epochs= 20,
mini_batch_size= 32,
learning_rate= 0.001,
dropout_rate= 0.05,
early_stopping_patience= 10

If there is an up-trend in all time series, there should not be a problem learning this. If your time series usually have a rising and then falling period, then the algorithm may learn this and thus generate a similar pattern, even though the example you forecast only had an up-trend so far.
How many time series do you have and how long are they on average?
All your hyper parameters look reasonable and it is a bit hard to tell, what to improve without knowing more about the data. If you don't have that many time series, you can try to increasing the number of epochs (perhaps try a few hundred) and increase early stopping to 20 - 30.

Related

Can you use hypothesis testing for feature selection?

How do you apply hypothesis testing to your features in a ML model? Let say for example that I am doing a regression task and I want to cut some features (once I have trained my model) to increase performance. How do I apply hypothesis testing to decide whether that feature is useful or not? I am just a bit confused about what my null hypothesis would be, level of significance and how to run the experimentation to get the p-value of the feature (I have heard that a level of significance of 0.15 is a good threshold, but I am not sure).
For example. I am doing a regression task to predict the cost of my factory, considering the production of three machines (A,B,C). I make a linear regression with the data and I find out that the p-values of machine A is greater than my level of significance, hence, it is not statistically significant and I decide to discard that feature for my model.
I have taken this example from a video on Youtube. I put the link below.
The relevant bit start from min 4:00 to 7:00
https://www.youtube.com/watch?v=HgfHefwK7VQ
I have tried reading about it, but I haven't been able to understand how he decided that level of significance and how he applied hypothesis testing in this case.
The data looks something like this
d = ('Cost': [44439, 43936, 44464, 41533, 46343],
'A': [515, 929, 800, 979, 1165],
'B': [541, 710, 675, 1147, 939],
'C': [928, 711, 824, 758, 635, 901])
df = pd.DataFrame(data=d)
After the model has been fit, the weights are as follow:
Bias weight: 35102,
Machine A: 2.066,
Machine B: 4.17,
Machine C: 4.79
Now, the issue is that the p-value for Machine A = 0.23, which was considered too high and therefore, this feature was excluded from the predictive model

Binary Classification for 1-feature Time Series Data using LSTM

I'm a beginner for RNN.
My current interest is to use LSTM to implement a time series classification. I have seen many examples such as MNIST classification.
However, I try implementing LSTM (using Tensorflow framework) to model a binary classification for temperature trend prediction (i.e. up or down).
The followings are some details on my experiment setting.
Dataset:
The dataset (Daily minimum temperatures in Melbourne) was obtained from here with 3650 observations. I divided the data into windows with length 10; therefore, this will produce 3640 instances. And their corresponding labels are one-hot vectors representing the trend of the next value of the last value of the window, i.e. [1, 0] or [0, 1] in case of increasing or decreasing respectively.
For example, the dataset contains
1984/7/30,10
1984/7/31,10.6
1984/8/1,11.5
1984/8/2,10.2
1984/8/3,11.1
1984/8/4,11
1984/8/5,8.9
1984/8/6,9.9
1984/8/7,11.7
1984/8/8,11.6
1984/8/9,9
1984/8/10,6.3
1984/8/11,8.7
1984/8/12,8.5
1984/8/13,8.5
1984/8/14,8
1984/8/15,6
...
Two possible training window would be
(1) [10, 10.6, 11.5, 10.2, 11.1, 11, 8.9, 9.9, 11.7, 11.6] and label: [0, 1],
(2) [10.6, 11.5, 10.2, 11.1, 11, 8.9, 9.9, 11.7, 11.6, 9] and label: [0, 1],
(3) ...
The problem: My problem is that the model produces the outputs with accuracy wiggling around 49-50% accuracy. So my question is "does it make sense to build a binary classification model having structure like above mentioned?".
Any help is appreciated.
Thanks.

convolutional neural networks for sentiment analysis

I was trying to modify YoonKim's code for sentiment analysis using CNN's. He applies three filters of heights=[3,4,5] and width=300 on
input=(batch_size, 1, len(sentence_vector), len(wordVector))
I'm stuck after the first Conv,Pool computation. Consider
input=(batch_size, 1, 64, 300)
64 is the length of every sentence vector and 300 is the word embedding size.
map=(20, 1, 3, 300)
In his implementation, he first applies a kernel of height=3 and width=300. Hence the output would be
convolution_output=(batch_size, 20, 62, 1)
After which he downsamples using poolsize=(62, 1). The output after MaxPooling becomes
maxpool_output=(batch_size, 20, 1, 1)
This is where i'm stuck.
In the paper he applies 3 filters of heights[3,4,5] and width=300. But after applying the first filter, there is no input left for convolution. How(and on what) do I apply the second kernel?.
Any help or suggestions would be great. The git page contains a link to the paper.

Which spark MLIB algorithm to use?

I'm newbie to machine learning and would like to understand what algorithm (Classification algorithm or co-relation algorithm?) to use in order to understand what is the relationship between one or more attributes.
for example consider I have following set of attributes,
Bill No, Bill Amount, Tip amount, Waiter Name
and would like to figure out which are the attribute(s) that are contributing to Tip amount.
Following is the sample set of data,
Bill No, Bill Amount, Tip amount, Waiter detail
1, 100, 10, Sathish
2, 200, 20, Sathish
3, 150, 10, Rahul
4, 200, 10, Simon
5, 100, 10, Sathish
In this case we know the Tip amount would be 99% influenced by the Bill Amount. But i want to know what is the Spark MLib algorithm that i should use to figure out the same? If so i could apply the similar techniques to long set of attributes.
One thing you can do is calculate correlation between rows. Take a look at the tutorial about summary statistics at mllib website.
More advanced approach would be use dimensionality reduction. This should discover more complex dependencies.
You can calculate the correlation between different rows. Please refer to Correlations(https://spark.apache.org/docs/latest/mllib-statistics.html#correlations). For example, if you calculate the correlation between Bill Amount and Tip amount, most probably you will get the correlation value near to 1.

How to evaluate predictions from incomplete data, where not all data is incomplete

I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was
original = [1, 1, 0, 1, 1, 0]
And now I want to see how good I can reconstruct the original data when the given data is incomplete:
incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]
And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?
EDIT:
Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.
The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.
I think it's first important to be clear what you are trying to measure, and what your input is.
Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.
FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.
Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.
But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.

Resources