Model training when calculating training error and cross-validation error - machine-learning

I want to calculate training error and cross validation error for the same training set.
Model: RandomForestRegressor
Metrics: Training error -> RMSE, Cross validation error -> k fold cross validation
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
scores = cross_val_score(forest_reg, X_train_transformed, y_train,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
print(tree_rmse_scores.mean())
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train_transformed, y_train)
error = mean_squared_error(y_train, forest_reg.predict(X_train_transformed))
print(error)
My understanding is that model has to be trained explicitly when calculating training error but for cross val score, model is fitted for every k-1 folds and validated on 1 fold for k times. In this case, no explicit fitting is not required before calling cross_val_score().
Any issue with above code? I'm getting huge training error than CV error. Is my above understanding incorrect?

You forgot to take the square root of the MSE in the approach without CV:
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train_transformed, y_train)
error = mean_squared_error(y_train, forest_reg.predict(X_train_transformed))
print(np.sqrt(error)) # <-- take the square root here or already above
The difference was so huge because you compared RMSE with MSE. Now you should see what you expected.

How much larger than validation error?
Isn't it natural that training error is larger than validation error with or without it is just 'validation' or 'cross validataion'.

Related

scikit-learn and imblearn: does GridSearchCV/RandomSearchCV apply preprocessing to the validation set as well?

I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance

why test data is also involved in lightGBM train() and also used for calculate prediction error?

I would like to use lightGBM to do a machine learning model training.
I checked the example at https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/advanced_example.py
I have some questions about the correctness of the code.
(1) What kind models can be created from lightgbm.train() ?
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html
It is a regressor or classifier ?
(2) Why test dataset is also used for training ? How this can assure that the test results are still valid ?
# line 31
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
weight=W_test, free_raw_data=False)
# line 52
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets=lgb_train, # eval training data with test data !!!
feature_name=feature_name,
categorical_feature=[21])
# line 84
y_pred = bst.predict(X_test) # why x_test is also used to predict y? X_test has been involved in training the model !!!
Thanks
You can train both regression and classifier models using lgb.train. It depends on the parameters, which you define, namely objective.
Test set (valid_sets) is used only for validation, it isn't used for training.

Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
X.shape
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
regressor.add(Dropout(0.2))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
#4th
regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history = regressor.fit(X[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.plot(history.history['loss'])
pyplot.plot(history.history['val_loss'])
pyplot.title('model train vs validation loss')
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
Results:
trainingmodel
Plot
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

Training Error is Lower than Testing error in a Random Forest Model

I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder() \
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) \
.addGrid(rf_model.impurity,['entropy','gini']) \
.addGrid(rf_model.maxDepth,[2,3,4,5]) \
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

Difference between cross_val_score and another way of calculating accuracy

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.
First way of counting, that gives
[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]
kf = KFold(shuffle=True, n_splits=5)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(np.sum(y_pred == y_test) / len(y_test))
Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')
What's my mistake?
cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-
https://stackoverflow.com/a/48314533/3374996
On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.
So in each fold, there is data difference in your two approaches and hence different results.
The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.
References
Learn the right way to validate models

Resources