Avoiding overfitting with random forest - random-forest

I am training a random forest model for the first time and I find this situation.
My accuracy on the training set, with the default parameters (as in
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ) is very high, 0.95 or more , which looks a lot like overfitting. On the test set, accuracy goes to 0.66. My goal would be to make the model less overfitting, hoping to improve performance on the test set.
I tried to perform 5-fold cross validation, using random grid search like here ( https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 ) with the following grid:
n_estimators = [16,32,64,128]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
The best model had an accuracy of 0.7 across the folds.
I used the best selected parameters in step 2 on the training set and test set, but again accuracy on training set was 0.95 and test set 0.66.
Any suggestion ? What do you think is going on here ? How can I reach the result to avoid overfitting ( and maybe improve model performance ) ?

Over here someone had the same question and received some helpful answers:
https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting
Your approach to use 5-fold crossvalidation is already very good and can perhaps be improved by utilizing 10-fold crossvalidation instead.
Another question you can ask yourself is about the quality of your data set. Are your classes balanced? If they aren't you could try to handle a class imbalance issue, because with imbalance comes usually a bias towards the majority class.
It is also possible that the dataset is perhaps not big enough and increasing it could boost your performance as well.
I hope this helps a bit.

Adding this late comment in case it helps others.
In addition to the parameters mentioned above (n_estimators, max_features, max_depth, and min_samples_leaf) consider setting 'min_impurity_decrease'.
You can use 'gini' or 'entropy' for the Criterion, however, I recommend sticking with 'gini', the default. In the majority of cases, they produce the same result but 'entropy' is more computational expensive to compute.
Max depth works well and is an intuitive way to stop a tree from growing, however, just because a node is less than the max depth doesn't always mean it should split. If the information gained from splitting only addresses a single/few misclassification(s) then splitting that node may be supporting overfitting. You may or may not find this parameter useful, depending on the size of your dataset and/or your feature space size and complexity, but it is worth considering while tuning your parameters.
You do not describe how you split your dataset, so consider using a slightly smaller training set. Also make sure you do not have categorical variables in your feature space. If you do, use OneHotEncoding or pd.get_dummies to break those out.
I'm not sure how large your feature space is but you may want to use a smaller subset of your features (depending on how many noise variables you have). You also may want to look at a smaller max_depth. Your depth test all the way up to 110, that's very large. Again, I do not know your feature space but look to the lower end of your range to start and expand from there. i.e. try: [5, 7, 9] if 9 is optimal then adjust to say [9, 11, 13] etc. Although even a depth of 9 can cause overfitting (depending on the data) so be careful not to grow this too much. Possible pair with the gini index.

Related

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Is Total Error Mean an adequate performance metric for regression models?

I'm working on a regression model and to evaluate the model performance, my boss thinks that we should use this metric:
Total Absolute Error Mean = mean(y_predicted) / mean(y_true) - 1
Where mean(y_predicted) is the average of all the predictions and mean(y_true) is the average of all the true values.
I have never seen this metric being used in machine learning before and I convinced him to add Mean Absolute Percentage Error as an alternative, yet even though my model is performing better regarding MAPE, some areas underperform when we look at Total Absolute Error Mean.
My gut feeling is that this metric is wrong in displaying the real accuracy, but I can't seem to understand why.
Is Total Absolute Error Mean a valid performance metric? If not, then why? If it is, why would a regression model's accuracy increase in terms of MAPE, but not in terms of Total Absolute Error Mean?
Thank you in advance!
I would kindly suggest to inform your boss that, when one wishes to introduce a new metric, it is on him/her to demonstrate why it is useful on top of the existing ones, not the other way around (i.e. us demonstrating why it is not); BTW, this is exactly the standard procedure when someone really comes up with a new proposed metric in a research paper, like the recent proposal of the Maximal Information Coefficient (MIC).
That said, it is not difficult to demonstrate in practice that this proposed metric is a poor one with some dummy data:
import numpy as np
from sklearn.metrics import mean_squared_error
# your proposed metric:
def taem(y_true, y_pred):
return np.mean(y_true)/np.mean(y_pred)-1
# dummy true data:
y_true = np.array([0,1,2,3,4,5,6])
Now, suppose that we have a really awesome model, which predicts perfectly, i.e. y_pred1 = y_true; in this case both MSE and your proposed TAEM will indeed be 0:
y_pred1 = y_true # PERFECT predictions
mean_squared_error(y_true, y_pred1)
# 0.0
taem(y_true, y_pred1)
# 0.0
So far so good. But let's now consider the output of a really bad model, which predicts high values when it should have predicted low ones, and vice versa; in other words, consider a different set of predictions:
y_pred2 = np.array([6,5,4,3,2,1,0])
which is actually y_pred1 in reverse order. Now, it easy to see that here we will also have a perfect TAEM score:
taem(y_true, y_pred2)
# 0.0
while of course MSE would have warned us that we are very far indeed from perfect predictions:
mean_squared_error(y_true, y_pred2)
# 16.0
Bottom line: Any metric that ignores element-wise differences in favor of only averages suffers from similar limitations, namely taking identical values for any permutation of the predictions, a characteristic which is highly undesirable for a useful performance metric.

Is this the right way to predict a time series with the stateful LSTM neural network?

I am using LSTM neural networks (stateful) for time series prediction.
I'm hoping that the stateful LSTM can capture the hidden patterns and make a satisfactory prediction (the physical law that cause the variation of the time series is not clear).
I have a time series X with a length of 1500 (actual observational data), and my purpose is to predict the future 100.
I suppose predict the next 10 will be more promising than predict the next 100 (is that right?).
So, I prepare the training data like this (always using 100 values to predict the next 10; x_n denotes the n-th element in X):
shape of trainX: [140, 100, 1]
shape of trainY: [140, 10, 1]
---
0: [x_0, x_1, ..., x_99] -> [x_100, x_101, ..., x_109]
1: [x_10, x_11, ..., x_109] -> [x_110, x_111, ..., x_119]
2: [x_20, x_21, ..., x_119] -> [x_120, x_121, ..., x_129]
...
139: [x_1390, x_1391, ..., x_1489] -> [x_1490, x_1491, ..., x_1499]
---
After the training, I want to use the model to predict the next 10 values [x_1500 - x_1509] with [x_1400 - x_1499], and then predict the next 10 values [x_1510 - x_1519] with [x_1410 - x_1509].
Is this the right way?
After a lot of reading of documents and examples, I can train a model and make the prediction, but the result seems not satisfactory.
To validate the method, I assume that the last 100 (x_1400 - x_1499) values are unknown, and remove them from trainX and trainY, then try to train a model and predict them. Lastly, compare the predicted values with the observed values.
Any suggestions or comments will be appreciated.
The time series looks like this:
Your question is really complexed. Before I will try to answer it - I'll share my doubts with you about is it sensible to use LSTM for your task. You want to use a really advanced model (LSTM are capable to learn really complexed patterns) to a time series which seems to be relatively easy. Moreover - you have a really small amout of data. To be honest - I would try to train simpler and easier methods first (like ARMA or ARIMA).
To answer your question - if your approach is good - it seems to be reasonable. Other reasonable methods are predicting all 100 steps or e.g. 50 steps twice. With 10 steps you might come across error cumulation - but still it might be a good method.
As I mentioned earlier - I would rather try easier ML method for this task but if you really want to use LSTM you may tackle this problem in a following way:
Define metaparameters like number of steps ahead you want to predict, the size of input fed to network.
Try to use e.g. grid search in order to find the best value of this metaparameters. Evaluate each setup using k-fold crossvalidation.
Retrain final model using the best metaparameter setup.
You have relatively small amount of data so you may easily find the best values of hyperparameters. This will also show you if your approach is good or not - simply check the results provided by the best solution.

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Resources