How should I read the sum of the values that RMSPROP produces? - machine-learning

I have a 2D time series data-set with integers ranging in 1,000,000 - 2,000,000 output on any given day. Of course my data is not limited, as I can sum up to weekly values hence the range increasing to over 10,000,000.
I'm able to achieve RMSE = 0.02 whenever I normalize my data, but when I feed the raw(1 million range) data into the algorithm, RSME can equal up to 30k - 150k error range.
Why in one version of the RMSE outputs my "global minima" is 0.02, while the other output in higher ranges? I've been testing with AdaDelta.

The definition of RMSE is:
The scale of this value directly depends on the scale on predictions and actuals, so it's quite normal that you get a higher RMSE value when you don't normalize the dataset.
This is why normalization is important, as it lets us compare error metrics across models and datasets.

Related

MLP Time Series Forecasting Model not learning the correct distribution

The goal is to build a model that can rank the stocks based on their Target Return.
The dataset I am using is structured the following way:
Date
Stock_Id
Volume
Open
Close
High
Low
Target
2022-01-04
8341
103422
2734
2742
2755
2730
0.0007304601899196
The data is chronological, and all the stocks are grouped by day. There are only 2000 stocks.
Using the Open, High, Low, Close, and Volume features, I was able to generate about 65+ new features using talib library
The Neural Net takes in all 74 total features and is meant to predict the Target number of each stock. The Target represents the rate of change of the stock between t+2 and t+1.
What I've done:
I have normalized the data accross time. This means that I take all of the time domain data for any given StockId and use it to compute mean and std in order to normalize the data.
The dataloader indexes into the dataset by day. This acts as a mini batch since when getitem is called a (2000, 74) tensor is returned.
The train dataloader is set to shuffle, so the indexing is random.
The loss function is meant to optimize for the highest returns all while keeping the standard deviation of those returns in check.
Because of the non chronological nature of the dataloader and the fact that I need all outputs to compute mean and std, the training loop using optim.SGD but acts as Gradient Descent.
During training, the loss converges very well, but the predictions are very off.
After training, I use the test set to view the accuracy of the prediction. This is the distribution of the predictions. And this is the distribution of the true targets.
The range of the Targets percentages are mainly between -15 and 15. Where as the predictions have a smaller range, -0.07 to -0.05.
I have tried running the model using different learning rates, changing the hidden layer sizes in the network, multiplying the targets by 100 to represent them as percentages, changing the timeperiod argument in many of the talib features. But I always get a model that poorly represents the data.
More Info:
The data is from 1300 trading days with each day containing data for 2000 stocks.
I have adjusted the Open, Close, High, Low prices before normalizing.
I have tried different neural network sizes. Having input layer length 74, and output being 1, with any arbitrary number of layers and sizes in between.
This is what the training loop looks like.
loss_lst = []
for epoch in range(50): # loop over the dataset multiple times
optimizer.zero_grad()
running_error = torch.tensor(0.0)
running_return = []
running_loss = torch.tensor(0.0)
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels, date = data
inputs = inputs.cuda().squeeze()
labels = labels.cuda().squeeze()
# forward pass
outputs = net(inputs).squeeze()
error = error_criterion(outputs, labels) # Error between prediction and label
running_error += error.cpu()
day_return = return_criterion(outputs,labels).cpu()
running_return.append(day_return) # Array of daily returns that will be used to compute the std for the loss function.
del outputs
del inputs
del labels
avg_error = running_error / len(trainloader)
std_return = torch.stack(running_return).cuda().std()
loss = (mean_hp * avg_error) + (std_hp * std_return)
loss.backward()
optimizer.step()
running_loss += loss.item()
loss_lst.append(loss.item())
print(f'[{epoch + 1}] loss: {running_loss:.10f}')
# loss_lst.append(sum(running_loss)/len(trainloader))
running_loss = 0.0
print('Finished Training')
Let me know if any additional clarifications are needed.

How does facebook's prophet librabt compute RMSE , does it takes into account for confidence interval and how?

I am using my own models to predict and compute RMSE value, generally, we take a window size say 10 days (units of time) and 1-day horizon, and then we test the prediction but when working with prophet library it's not very clear as the predictions are not that jittery its smooth line with a confidence interval also plotted so the question is, is that confidence interval also being used in calculating RMSE if yes then how?
output of this code snippet :
m=Prophet(seasonality_mode='multiplicative')
m.fit(train)
future=m.make_future_dataframe(periods=12,freq='MS')
forecast=m.predict(future)
forecast.tail()
# interested in output of below code , which given rmse value
fb_cv=cross_validation(m,initial=initial,period=period,horizon=horizon)
performance_metrics(fb_cv)

Normalize data with outlier inside interval

I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

Resources