Terrible performance with h2o xgboost imbalanced data - machine-learning

I have a dataset of around 1M rows with a high imbalance (743 / 1072780). I am training xgboost model in h2o with the following parameters and it looks like it is overfitting
H2OXGBoostEstimator(max_depth=10,
subsample=0.7,
ntrees=200,
learn_rate=0.5,
min_rows=3,
col_sample_rate_per_tree = .75,
reg_lambda=2.0,
reg_alpha=2.0,
sample_rate = .5,
booster='gbtree',
nfolds=10,
keep_cross_validation_predictions = True,
stopping_metric = 'AUCPR',
min_split_improvement= 1e-5,
categorical_encoding = 'OneHotExplicit',
weights_column = "Products"
)
The output is:
Training data AUCPR: 0.6878932664592388 Validation data AUCPR: 0.04033158660014747
Training data AUC: 0.9992170372214433 Validation data AUC: 0.7000804189162043
Training data MSE: 0.0005722912424124134 Validation data MSE: 0.0010002949568585474
Training data RMSE: 0.023922609439866994 Validation data RMSE: 0.03162743993526108
Training data Gini: 0.9984340744428866 Validation data Gini: 0.40016083783240863
Confusion Matrix for Training Data:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.15900755567210062:
0 1 Error Rate
----- ------ --- ------- ----------------
0 709201 337 0.0005 (337.0/709538.0)
1 189 516 0.2681 (189.0/705.0)
Total 709390 853 0.0007 (526.0/710243.0)
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.047459165255228676:
0 1 Error Rate
----- ------ --- ------- ----------------
0 202084 365 0.0018 (365.0/202449.0)
1 140 52 0.7292 (140.0/192.0)
Total 202224 417 0.0025 (505.0/202641.0)
{'train': , 'valid': }
I am using h2o 3.32.0.1 version (since it's a requirement), xgboost h2o doesnt support balance_classes or scale_pos_weight hyperparameters.
What can cause this to have such performance? Also, What can be improved here for such an imbalanced dataset that might improve the performance?

Training with such severely imbalanced data set is pointless. I would try a combination of up sampling and down sampling to get a more balanced data set that does not get too small.

This may be the worst class imbalance I have ever seen in a problem.
If you can subset your majority class - not until the point that it is balanced - but until the balance is less sever while still being representative (i.e., 15/85% minority/majority), you'll have more luck with other conventional techniques, or a mixture (i.e., up sampling and
augmentation.) Can the data logically be subset to help with the imbalance? For example if data ranges back several years, you could use only the last year's worth of data. I'd also manually optimize the threshold against the minority class, like true positive rate.

Related

Why is train loss and validate loss both in a straight line?

I am using Conv-LSTM for training, and the input features have been proven to be effective in some papers, and I can use CNN+FC networks to extract features and classify them. I change the task to regression here, and I can also achieve model convergence with Conv+FC. Later, I tried to use Conv-LSTM for processing to consider the timing characteristics of the corresponding data. Specifically: return the output of the current moment based on multiple historical inputs and the input of the current moment. The Conv-LSTM code I used: https://github.com/ndrplz/ConvLSTM_pytorch. My Loss is L1-Loss and optimizer is Adam.
A loss curve is below:
Example loss value:
Epoch:1/500 AVG Training Loss:16.40108 AVG Valid Loss:22.40100
Best validation loss: 22.400997797648113
Saving best model for epoch 1
Epoch:2/500 AVG Training Loss:16.42522 AVG Valid Loss:22.40100
Epoch:3/500 AVG Training Loss:16.40599 AVG Valid Loss:22.40100
Epoch:4/500 AVG Training Loss:16.40175 AVG Valid Loss:22.40100
Epoch:5/500 AVG Training Loss:16.42198 AVG Valid Loss:22.40101
Epoch:6/500 AVG Training Loss:16.41907 AVG Valid Loss:22.40101
Epoch:7/500 AVG Training Loss:16.42531 AVG Valid Loss:22.40101
My attempt:
Adjust the data set to only a few samples, verify that it can be overfitted, and the network code should be fine.
Adjusting the learning rate, I tried 1e-3, 1e-4, 1e-5 and 1e-6, but the loss curve is still flat as before, and even the value of the loss curve has not changed much.
Replace the optimizer with SGD, and the training result is also the above problem.
Because my data is wireless data (I-Q), neither CV nor NLP input type, here are some questions to ask about deep learning training.
After some testing, I finally found that my initial learning rate was too small. According to my previous single-point data training, the learning rate of 1e-3 is large enough, so here is preconceived, and it is adjusted from 1e-3 to a small tune, but in fact, the learning rate of 1e-3 is too small, resulting in the network not learning at all. Later, the learning rate was adjusted to 1e-2, and both the train loss and validate loss of the network achieved rapid decline (And the optimizer is Adam). When adjusting the learning rate later, you can start from 1 to the minor, do not preconceive.

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.
after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?
X_loc = df[{'area','rooms','location'}]
y_loc = df[:]['price']
X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_train[0:1])
DATASET:
price rooms area location
0 0 22000 3 1339 140
1 1 45000 3 1580 72
3 3 72000 3 2310 72
4 4 40000 3 1800 41
5 5 35000 3 2100 57
expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?
What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point).
22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.
Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results
Nonlinear data often appears when there are (statistical) interactions between features.
A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model
In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf
https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Training accuracy increases aggresively, test accuracy settles

While training a convolutional neural network following this article, the accuracy of the training set increases too much while the accuracy on the test set settles.
Below is an example with 6400 training examples, randomly chosen at each epoch (so some examples might be seen at the previous epochs, some might be new), and 6400 same test examples.
For a bigger data set (64000 or 100000 training examples), the increase in training accuracy is even more abrupt, going to 98 on the third epoch.
I also tried using the same 6400 training examples each epoch, just randomly shuffled. As expected, the result is worse.
epoch 3 loss 0.54871 acc 79.01
learning rate 0.1
nr_test_examples 6400
TEST epoch 3 loss 0.60812 acc 68.48
nr_training_examples 6400
tb 91
epoch 4 loss 0.51283 acc 83.52
learning rate 0.1
nr_test_examples 6400
TEST epoch 4 loss 0.60494 acc 68.68
nr_training_examples 6400
tb 91
epoch 5 loss 0.47531 acc 86.91
learning rate 0.05
nr_test_examples 6400
TEST epoch 5 loss 0.59846 acc 68.98
nr_training_examples 6400
tb 91
epoch 6 loss 0.42325 acc 92.17
learning rate 0.05
nr_test_examples 6400
TEST epoch 6 loss 0.60667 acc 68.10
nr_training_examples 6400
tb 91
epoch 7 loss 0.38460 acc 95.84
learning rate 0.05
nr_test_examples 6400
TEST epoch 7 loss 0.59695 acc 69.92
nr_training_examples 6400
tb 91
epoch 8 loss 0.35238 acc 97.58
learning rate 0.05
nr_test_examples 6400
TEST epoch 8 loss 0.60952 acc 68.21
This is my model (I'm using RELU activation after each convolution):
conv 5x5 (1, 64)
max-pooling 2x2
dropout
conv 3x3 (64, 128)
max-pooling 2x2
dropout
conv 3x3 (128, 256)
max-pooling 2x2
dropout
conv 3x3 (256, 128)
dropout
fully_connected(18*18*128, 128)
dropout
output(128, 128)
What could be the cause?
I'm using Momentum Optimizer with learning rate decay:
batch = tf.Variable(0, trainable=False)
train_size = 6400
learning_rate = tf.train.exponential_decay(
0.1, # Base learning rate.
batch * batch_size, # Current index into the dataset.
train_size*5, # Decay step.
0.5, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(cost, global_step=batch)
This is very much expected. This problem is called over-fitting. This is when your model starts "memorizing" the training examples without actually learning anything useful for the Test set. In fact, this is exactly why we use a test set in the first place. Since if we have a complex enough model we can always fit the data perfectly, even if not meaningfully. The test set is what tells us what the model has actually learned.
Its also useful to use a Validation set which is like a test set, but you use it to find out when to stop training. When the Validation error stops lowering you stop training. why not use the test set for this? The test set is to know how well your model would do in the real world. If you start using information from the test set to choose things about your training process, than its like your cheating and you will be punished by your test error no longer representing your real world error.
Lastly, convolutional neural networks are notorious for their ability to over-fit. It has been shown the Conv-nets can get zero training error even if you shuffle the labels and even random pixels. That means that there doesn't have to be a real pattern for the Conv-net to learn to represent it. This means that you have to regularize a conv-net. That is, you have to use things like Dropout, batch normalization, early stopping.
I'll leave a few links if you want to read more:
Over-fitting, validation, early stopping
https://elitedatascience.com/overfitting-in-machine-learning
Conv-nets fitting random labels:
https://arxiv.org/pdf/1611.03530.pdf
(this paper is a bit advanced, but its interresting to skim through)
P.S. to actually improve your test accuracy you will need to change your model or train with data augmentation. You might want to try transfer learning as well.

Training accuracy on SGD

How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.

Resources