Why same paths of xgboost tree give 2 different predictions? - machine-learning

I'm trying to investigate the xgboost prediction.
It seems that 2 inputs with same 2 paths gives 2 different predictions.
I'm running on the following dateset:
f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1
prediction and tree creation code:
df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.33, random_state = 42)
model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
res = model.predict(X_test)
print ("X_test (first 2 rows:")
print(X_test.head(2))
print("Predictions (first 2 rows:")
print(res[0:2])
plot_tree(model)
plt.show()
Output:
X_test (first 2 rows:
f1 f2 f3 f4 f5 f6 f7 f8
33 6 92 92 0 0 19.9 0.188 28
36 11 138 76 0 0 33.2 0.420 35
Predictions (first 2 rows:
[0 1]
The same 2 inputs have f2<146.5 and f4=0 => get into same leaf (-0.34)
So why the prediction for those 2 are different ? (0 and 1) ?

What you have plotted in not the whole XGBoost model; it is only the first tree of it.
To see why this is so, check the source code of plot_tree:
def plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs):
"""Plot specified tree.
and the documentation:
num_trees (int, default 0) – Specify the ordinal number of target tree
from where it is apparent that, when you do not specify the num_trees argument, like here, it takes the default value of 0, i.e. the first tree of the ensemble.
Using different values for num_trees you will get different trees, hence different decision paths for each sample.
You cannot plot all the trees of the boosting ensemble (and even if you could, it would not be of any practical use). plot_tree is just a utility function in order to be able to have a look at individual trees of the model. You can have a look at its use in How to Visualize Gradient Boosting Decision Trees With XGBoost in Python.

Related

How to get individual tree's prediction value for XGBoost Regressor?

I have tried this by reading How to get each individual tree's prediction in xgboost?
model = XGBRegressor(n_estimators=1000)
model.fit(X_train, y_train)
booster_ = model.get_booster()
individual_preds = []
for tree_ in booster_:
individual_preds.append(
tree_.predict(xgb.DMatrix(X_test)),
)
individual_preds = np.vstack(individual_preds)
The results from individual trees are far away from the results of using booster_.predict(xgb.DMatrix(X_test)) (centered at 0.5). How to get the individual tree's prediction value for XGBoost Regressor? And how to make them comparable to the ensembled prediction?
From xgboost api, iteration_range seems to be suitable for this request, if understood the question ok:
iteration_range (Tuple[int, int]) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
For illustration, I used California housing data to train a XGB regressor model:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_valid, y_train, y_valid = train_test_split(housing.data, housing.target, \
test_size = 0.33, random_state = 11)
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dvalid= xgb.DMatrix(data=X_valid, label=y_valid, feature_names=list(housing.feature_names))
# define model and train
params_reg = {"max_depth":4, "eta":0.3, "objective":"reg:squarederror", "subsample":1}
xgb_model_reg = xgb.train(params=params_reg, dtrain=dtrain, num_boost_round=100, \
early_stopping_rounds=20,evals=[(dtrain, "train")])
# predict
y_pred = xgb_model_reg.predict(dvalid)
The prediction for a random row 500 is 1.9630624. I used iteration_range below to include one tree for prediction and then displayed the prediction results against each tree index:
for tree in range(0,100):
print(a,xgb_model_reg.predict(dvalid,iteration_range=(tree,tree+1))[500])
Here is the output extract:
0 0.9880972
1 0.5706124
2 0.59768033
3 0.51785016
4 0.58512527
5 0.5990092
6 0.6660166
7 0.46186835
8 0.5213114
9 0.5857907
10 0.4683379
11 0.54352343
12 0.46028078
13 0.4823497
14 0.51296484
15 0.49818778
16 0.50080884
...
97 0.5000746
98 0.49949
99 0.5004089

How to combine confusion matrices from the outputs of multiple algorithms?

I want to extract prediction confusion matrices of multiple prediction models in one table for easier comparison.
I collected confusion matrix tables from each model and combined them to created the following table.
Table1 <- rbind(confmat_nb$table, confmat_nb_2$table, confmat_svm$table)
Table1
Reference
excluded included
naive_bayes excluded 6234 46
included 3107 470
naive_bayes_2 excluded 5774 60
included 3567 456
svm excluded 7197 101
included 2144 415
Table 2 was created using the following code
byClass <- rbind(naive_bayes =round(confmat_nb$byClass, 3),
naive_bayes_2=round(confmat_nb_2$byClass, 3),
svm = round(confmat_svm$byClass, 3)
overall <- rbind(naive_bayes =round(confmat_nb$overall, 3),
naive_bayes_2=round(confmat_nb_2$overall, 3),
svm = round(confmat_svm$overall, 3)
Table2 <- rbind(byClass, overall)
Table2
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
naive_bayes 0.680 0.911 0.667 0.131 0.993 0.131
naive_bayes_2 0.632 0.884 0.618 0.113 0.990 0.113
svm 0.772 0.804 0.770 0.162 0.986 0.162
My goal is to merge Table 1 and Table 2 and extract the outputs in one table. Something like this

How to overfit data with Keras?

I'm trying to build a simple regression model using keras and tensorflow. In my problem I have data in the form (x, y), where x and y are simply numbers. I'd like to build a keras model in order to predict y using x as an input.
Since I think images better explains thing, these are my data:
We may discuss if they are good or not, but in my problem I cannot really cheat them.
My keras model is the following (data are splitted 30% test (X_test, y_test) and 70% training (X_train, y_train)):
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=() activation="relu", name="first_layer"))
model.add(tf.keras.layers.Dense(16, activation="relu", name="second_layer"))
model.add(tf.keras.layers.Dense(1, name="output_layer"))
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=500, batch_size=1, verbose=0, shuffle=False)
eval_result = model.evaluate(X_test, y_test)
print("\n\nTest loss:", eval_result, "\n")
predict_Y = model.predict(X)
note: X contains both X_test and X_train.
Plotting the prediction I get (blue squares are the prediction predict_Y)
I'm playing a lot with layers, activation funztions and other parameters. My goal is to find the best parameters to train the model, but the actual question, here, is slightly different: in fact I have hard times to force the model to overfit the data (as you can see from the above results).
Does anyone have some sort of idea about how to reproduce overfitting?
This is the outcome I would like to get:
(red dots are under blue squares!)
EDIT:
Here I provide you the data used in the example above: you can copy paste directly to a python interpreter:
X_train = [0.704619794270697, 0.6779457393024553, 0.8207082120250023, 0.8588819357831449, 0.8692320257603844, 0.6878750931810429, 0.9556331888763945, 0.77677964510883, 0.7211381534179618, 0.6438319113259414, 0.6478339581502052, 0.9710222750072649, 0.8952188423349681, 0.6303124926673513, 0.9640316662124185, 0.869691568491902, 0.8320164648420931, 0.8236399177660375, 0.8877334038470911, 0.8084042532069621, 0.8045680821762038]
y_train = [0.7766424210611557, 0.8210846773655833, 0.9996114311913593, 0.8041331063189883, 0.9980525368790883, 0.8164056182686034, 0.8925487603333683, 0.7758207470960685, 0.37345286573743475, 0.9325789202459493, 0.6060269037514895, 0.9319771743389491, 0.9990691225991941, 0.9320002808310418, 0.9992560731072977, 0.9980241561997089, 0.8882905258641204, 0.4678339275898943, 0.9312152374846061, 0.9542371205095945, 0.8885893668675711]
X_test = [0.9749191829308574, 0.8735366740730178, 0.8882783211709133, 0.8022891400991644, 0.8650601322313454, 0.8697902997857514, 1.0, 0.8165876695985228, 0.8923841531760973]
y_test = [0.975653685270635, 0.9096752789481569, 0.6653736469114154, 0.46367666660348744, 0.9991817903431941, 1.0, 0.9111205717076893, 0.5264993912088891, 0.9989199241685126]
X = [0.704619794270697, 0.77677964510883, 0.7211381534179618, 0.6478339581502052, 0.6779457393024553, 0.8588819357831449, 0.8045680821762038, 0.8320164648420931, 0.8650601322313454, 0.8697902997857514, 0.8236399177660375, 0.6878750931810429, 0.8923841531760973, 0.8692320257603844, 0.8877334038470911, 0.8735366740730178, 0.8207082120250023, 0.8022891400991644, 0.6303124926673513, 0.8084042532069621, 0.869691568491902, 0.9710222750072649, 0.9556331888763945, 0.8882783211709133, 0.8165876695985228, 0.6438319113259414, 0.8952188423349681, 0.9749191829308574, 1.0, 0.9640316662124185]
Y = [0.7766424210611557, 0.7758207470960685, 0.37345286573743475, 0.6060269037514895, 0.8210846773655833, 0.8041331063189883, 0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0, 0.4678339275898943, 0.8164056182686034, 0.9989199241685126, 0.9980525368790883, 0.9312152374846061, 0.9096752789481569, 0.9996114311913593, 0.46367666660348744, 0.9320002808310418, 0.9542371205095945, 0.9980241561997089, 0.9319771743389491, 0.8925487603333683, 0.6653736469114154, 0.5264993912088891, 0.9325789202459493, 0.9990691225991941, 0.975653685270635, 0.9111205717076893, 0.9992560731072977]
Where X contains the list of the x values and Y the corresponding y value. (X_test, y_test) and (X_train, y_train) are two (non overlapping) subset of (X, Y).
To predict and show the model results I simply use matplotlib (imported as plt):
predict_Y = model.predict(X)
plt.plot(X, Y, "ro", X, predict_Y, "bs")
plt.show()
Overfitted models are rarely useful in real life. It appears to me that OP is well aware of that but wants to see if NNs are indeed capable of fitting (bounded) arbitrary functions or not. On one hand, the input-output data in the example seems to obey no discernible pattern. On the other hand, both input and output are scalars in [0, 1] and there are only 21 data points in the training set.
Based on my experiments and results, we can indeed overfit as requested. See the image below.
Numerical results:
x y_true y_pred error
0 0.704620 0.776642 0.773753 -0.002889
1 0.677946 0.821085 0.819597 -0.001488
2 0.820708 0.999611 0.999813 0.000202
3 0.858882 0.804133 0.805160 0.001026
4 0.869232 0.998053 0.997862 -0.000190
5 0.687875 0.816406 0.814692 -0.001714
6 0.955633 0.892549 0.893117 0.000569
7 0.776780 0.775821 0.779289 0.003469
8 0.721138 0.373453 0.374007 0.000554
9 0.643832 0.932579 0.912565 -0.020014
10 0.647834 0.606027 0.607253 0.001226
11 0.971022 0.931977 0.931549 -0.000428
12 0.895219 0.999069 0.999051 -0.000018
13 0.630312 0.932000 0.930252 -0.001748
14 0.964032 0.999256 0.999204 -0.000052
15 0.869692 0.998024 0.997859 -0.000165
16 0.832016 0.888291 0.887883 -0.000407
17 0.823640 0.467834 0.460728 -0.007106
18 0.887733 0.931215 0.932790 0.001575
19 0.808404 0.954237 0.960282 0.006045
20 0.804568 0.888589 0.906829 0.018240
{'me': -0.00015776709314323828,
'mae': 0.00329163070145315,
'mse': 4.0713782563067185e-05,
'rmse': 0.006380735268216915}
OP's code seems good to me. My changes were minor:
Use deeper networks. It may not actually be necessary to use a depth of 30 layers but since we just want to overfit, I didn't experiment too much with what's the minimum depth needed.
Each Dense layer has 50 units. Again, this may be overkill.
Added batch normalization layer every 5th dense layer.
Decreased learning rate by half.
Ran optimization for longer using the all 21 training examples in a batch.
Used MAE as objective function. MSE is good but since we want to overfit, I want to penalize small errors the same way as large errors.
Random numbers are more important here because data appears to be arbitrary. Though, you should get similar results if you change random number seed and let the optimizer run long enough. In some cases, optimization does get stuck in a local minima and it would not produce overfitting (as requested by OP).
The code is below.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
# Set seed just to have reproducible results
np.random.seed(84)
tf.random.set_seed(84)
# Load data from the post
# https://stackoverflow.com/questions/61252785/how-to-overfit-data-with-keras
X_train = np.array([0.704619794270697, 0.6779457393024553, 0.8207082120250023,
0.8588819357831449, 0.8692320257603844, 0.6878750931810429,
0.9556331888763945, 0.77677964510883, 0.7211381534179618,
0.6438319113259414, 0.6478339581502052, 0.9710222750072649,
0.8952188423349681, 0.6303124926673513, 0.9640316662124185,
0.869691568491902, 0.8320164648420931, 0.8236399177660375,
0.8877334038470911, 0.8084042532069621,
0.8045680821762038])
Y_train = np.array([0.7766424210611557, 0.8210846773655833, 0.9996114311913593,
0.8041331063189883, 0.9980525368790883, 0.8164056182686034,
0.8925487603333683, 0.7758207470960685,
0.37345286573743475, 0.9325789202459493,
0.6060269037514895, 0.9319771743389491, 0.9990691225991941,
0.9320002808310418, 0.9992560731072977, 0.9980241561997089,
0.8882905258641204, 0.4678339275898943, 0.9312152374846061,
0.9542371205095945, 0.8885893668675711])
X_test = np.array([0.9749191829308574, 0.8735366740730178, 0.8882783211709133,
0.8022891400991644, 0.8650601322313454, 0.8697902997857514,
1.0, 0.8165876695985228, 0.8923841531760973])
Y_test = np.array([0.975653685270635, 0.9096752789481569, 0.6653736469114154,
0.46367666660348744, 0.9991817903431941, 1.0,
0.9111205717076893, 0.5264993912088891, 0.9989199241685126])
X = np.array([0.704619794270697, 0.77677964510883, 0.7211381534179618,
0.6478339581502052, 0.6779457393024553, 0.8588819357831449,
0.8045680821762038, 0.8320164648420931, 0.8650601322313454,
0.8697902997857514, 0.8236399177660375, 0.6878750931810429,
0.8923841531760973, 0.8692320257603844, 0.8877334038470911,
0.8735366740730178, 0.8207082120250023, 0.8022891400991644,
0.6303124926673513, 0.8084042532069621, 0.869691568491902,
0.9710222750072649, 0.9556331888763945, 0.8882783211709133,
0.8165876695985228, 0.6438319113259414, 0.8952188423349681,
0.9749191829308574, 1.0, 0.9640316662124185])
Y = np.array([0.7766424210611557, 0.7758207470960685, 0.37345286573743475,
0.6060269037514895, 0.8210846773655833, 0.8041331063189883,
0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0,
0.4678339275898943, 0.8164056182686034, 0.9989199241685126,
0.9980525368790883, 0.9312152374846061, 0.9096752789481569,
0.9996114311913593, 0.46367666660348744, 0.9320002808310418,
0.9542371205095945, 0.9980241561997089, 0.9319771743389491,
0.8925487603333683, 0.6653736469114154, 0.5264993912088891,
0.9325789202459493, 0.9990691225991941, 0.975653685270635,
0.9111205717076893, 0.9992560731072977])
# Reshape all data to be of the shape (batch_size, 1)
X_train = X_train.reshape((-1, 1))
Y_train = Y_train.reshape((-1, 1))
X_test = X_test.reshape((-1, 1))
Y_test = Y_test.reshape((-1, 1))
X = X.reshape((-1, 1))
Y = Y.reshape((-1, 1))
# Is data scaled? NNs do well with bounded data.
assert np.all(X_train >= 0) and np.all(X_train <= 1)
assert np.all(Y_train >= 0) and np.all(Y_train <= 1)
assert np.all(X_test >= 0) and np.all(X_test <= 1)
assert np.all(Y_test >= 0) and np.all(Y_test <= 1)
assert np.all(X >= 0) and np.all(X <= 1)
assert np.all(Y >= 0) and np.all(Y <= 1)
# Build a model with variable number of hidden layers.
# We will use Keras functional API.
# https://www.perfectlyrandom.org/2019/06/24/a-guide-to-keras-functional-api/
n_dense_layers = 30 # increase this to get more complicated models
# Define the layers first.
input_tensor = Input(shape=(1,), name='input')
layers = []
for i in range(n_dense_layers):
layers += [Dense(units=50, activation='relu', name=f'dense_layer_{i}')]
if (i > 0) & (i % 5 == 0):
# avg over batches not features
layers += [BatchNormalization(axis=1)]
sigmoid_layer = Dense(units=1, activation='sigmoid', name='sigmoid_layer')
# Connect the layers using Keras Functional API
mid_layer = input_tensor
for dense_layer in layers:
mid_layer = dense_layer(mid_layer)
output_tensor = sigmoid_layer(mid_layer)
model = Model(inputs=[input_tensor], outputs=[output_tensor])
optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mae', metrics=['mae'])
model.fit(x=[X_train], y=[Y_train], epochs=40000, batch_size=21)
# Predict on various datasets
Y_train_pred = model.predict(X_train)
# Create a dataframe to inspect results manually
train_df = pd.DataFrame({
'x': X_train.reshape((-1)),
'y_true': Y_train.reshape((-1)),
'y_pred': Y_train_pred.reshape((-1))
})
train_df['error'] = train_df['y_pred'] - train_df['y_true']
print(train_df)
# A dictionary to store all the errors in one place.
train_errors = {
'me': np.mean(train_df['error']),
'mae': np.mean(np.abs(train_df['error'])),
'mse': np.mean(np.square(train_df['error'])),
'rmse': np.sqrt(np.mean(np.square(train_df['error']))),
}
print(train_errors)
# Make a plot to visualize true vs predicted
plt.figure(1)
plt.clf()
plt.plot(train_df['x'], train_df['y_true'], 'r.', label='y_true')
plt.plot(train_df['x'], train_df['y_pred'], 'bo', alpha=0.25, label='y_pred')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Train data. MSE={np.round(train_errors["mse"], 5)}.')
plt.legend()
plt.show(block=False)
plt.savefig('true_vs_pred.png')
A problem you may encountering is that you don't have enough training data for the model to be able to fit well. In your example, you only have 21 training instances, each with only 1 feature. Broadly speaking with neural network models, you need on the order of 10K or more training instances to produce a decent model.
Consider the following code that generates a noisy sine wave and tries to train a densely-connected feed-forward neural network to fit the data. My model has two linear layers, each with 50 hidden units and a ReLU activation function. The experiments are parameterized with the variable num_points which I will increase.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
def generate_data(num_points=100):
X = np.linspace(0.0 , 2.0 * np.pi, num_points).reshape(-1, 1)
noise = np.random.normal(0, 1, num_points).reshape(-1, 1)
y = 3 * np.sin(X) + noise
return X, y
def run_experiment(X_train, y_train, X_test, batch_size=64):
num_points = X_train.shape[0]
model = keras.Sequential()
model.add(layers.Dense(50, input_shape=(1, ), activation='relu'))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(1, activation='linear'))
model.compile(loss = "mse", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=10,
batch_size=batch_size, verbose=0)
yhat = model.predict(X_test, batch_size=batch_size)
plt.figure(figsize=(5, 5))
plt.plot(X_train, y_train, "ro", markersize=2, label='True')
plt.plot(X_train, yhat, "bo", markersize=1, label='Predicted')
plt.ylim(-5, 5)
plt.title('N=%d points' % (num_points))
plt.legend()
plt.grid()
plt.show()
Here is how I invoke the code:
num_points = 100
X, y = generate_data(num_points)
run_experiment(X, y, X)
Now, if I run the experiment with num_points = 100, the model predictions (in blue) do a terrible job at fitting the true noisy sine wave (in red).
Now, here is num_points = 1000:
Here is num_points = 10000:
And here is num_points = 100000:
As you can see, for my chosen NN architecture, adding more training instances allows the neural network to better (over)fit the data.
If you do have a lot of training instances, then if you want to purposefully overfit your data, you can either increase the neural network capacity or reduce regularization. Specifically, you can control the following knobs:
increase the number of layers
increase the number of hidden units
increase the number of features per data instance
reduce regularization (e.g. by removing dropout layers)
use a more complex neural network architecture (e.g. transformer blocks instead of RNN)
You may be wondering if neural networks can fit arbitrary data rather than just a noisy sine wave as in my example. Previous research says that, yes, a big enough neural network can fit any data. See:
Universal approximation theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
Zhang 2016, "Understanding deep learning requires rethinking generalization". https://arxiv.org/abs/1611.03530
As discussed in the comments, you should make a Python array (with NumPy) like this:-
Myarray = [[0.65, 1], [0.85, 0.5], ....]
Then you would just call those specific parts of the array whom you need to predict. Here the first value is the x-axis value. So you would call it to obtain the corresponding pair stored in Myarray
There are many resources to learn these types of things. some of them are ===>
https://www.geeksforgeeks.org/python-using-2d-arrays-lists-the-right-way/
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=video&cd=2&cad=rja&uact=8&ved=0ahUKEwjGs-Oxne3oAhVlwTgGHfHnDp4QtwIILTAB&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DQgfUT7i4yrc&usg=AOvVaw3LympYRszIYi6_OijMXH72

I was training an Ann machine learning model using GridSearchCV and got stuck with an IndexError in gridSearchCV

My model starts to train and while executing for sometime it gives an error :-
IndexError: index 37 is out of bounds for axis 0 with size 37
It executes properly for my model without using gridsearchCV with fixed parameters
Here is my code
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_classifier(optimizer, nb_layers,unit):
classifier = Sequential()
classifier.add(Dense(units = unit, kernel_initializer = 'uniform', activation = 'relu', input_dim = 14))
i = 1
while i <= nb_layers:
classifier.add(Dense(activation="relu", units=unit, kernel_initializer="uniform"))
i += 1
classifier.add(Dense(units = 38, kernel_initializer = 'uniform', activation = 'softmax'))
classifier.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier)
parameters = {'batch_size': [10,25],
'epochs': [100,200],
'optimizer': ['adam'],
'nb_layers': [5,6,7],
'unit':[48,57,76]
}
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv=5,n_jobs=-1)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
The error IndexError: index 37 is out of bounds for axis 0 with size 37 means that there is no element with index 37 in your object.
In python, if you have an object like array or list, which has elements indexed numerically, if it has n elements, indexes will go from 0 to n-1 (this is the general case, with the exception of reindexing in dataframes).
So, if you ahve 37 elements you can only retrieve elements from 0-36.
This is a multi-class classifier with a huge Number of Classes (38 classes). It seems like GridSearchCV isn't spliting your dataset by stratified sampling, may be because you haven't enough data and/or your dataset isn't class-balanced.
According to the documentation:
For integer/None inputs, if the estimator is a classifier and y is
either binary or multiclass, StratifiedKFold is used. In all other
cases, KFold is used.
By using categorical_crossentropy, KerasClassifier will convert targets (a class vector (integers)) to binary class matrix using keras.utils.to_categorical. Since there are 38 classes, each target will be converted to a binary vector of dimension 38 (index from 0 to 37).
I guess that in some splits, the validation set doesn't have samples from all the 38 classes, so targets are converted to vectors of dimension < 38, but since GridSearchCV is fitted with samples from all the 38 classes, it expects vectors of dimension = 38, which causes this error.
Take a look at the shape of your y_train. It need to be a some sort of one hot with shape (,37)

Recursive Feature Elimination (RFE) SKLearn

I created a table to test my understanding
F1 F2 Outcome
0 2 5 1
1 4 8 2
2 6 0 3
3 9 8 4
4 10 6 5
From F1 and F2 I tried to predict Outcome
As you can see F1 have a strong correlation to Outcome,F2 is random noise
I tested
pca = PCA(n_components=2)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
Explained Variance
[ 0.57554896 0.42445104]
Which is what I expected and shows that F1 is more important
However when I do RFE (Recursive Feature Elimination)
model = LogisticRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, Y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
1
[False True]
[2 1]
It asked me to keep F2 instead? It should ask me to keep F1 since F1 is a strong predictor while F2 is random noise... why F2?
Thanks
It is advisable to do a Recursive Feature Elimination Cross Validation (RFECV) before running the Recursive Feature Elimination (RFE)
Here is an example:
Having columns :
df.columns = ['age', 'id', 'sex', 'height', 'gender', 'marital status', 'income', 'race']
Use RFECV to identify the optimal number of features needed.
from sklearn.ensemble import RandomForestClassifier
rfe = RandomForestClassifier(random_state = 32) # Instantiate the algo
rfecv = RFECV(estimator= rfe, step=1, cv=StratifiedKFold(2), scoring="accuracy") # Instantiate the RFECV and its parameters
fit = rfecv.fit(features(or X), target(or y))
print("Optimal number of features : %d" % rfecv.n_features_)
>>>> Optimal number of output is 4
Now that the optimal number of features has been known, we can use Recursive Feature Elimination to identify the Optimal features
from sklearn.feature_selection import RFE
min_features_to_select = 1
rfc = RandomForestClassifier()
rfe = RFE(estimator=rfc, n_features_to_select= 4, step=1)
fittings1 = rfe.fit(features, target)
for i in range(features.shape[1]):
print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))
output will be something like:
>>> Column: 0, Selected True, Rank: 1.000
>>> Column: 1, Selected False, Rank: 4.000
>>> Column: 2, Selected False, Rank: 7.000
>>> Column: 3, Selected False, Rank: 10.000
>>> Column: 4, Selected True, Rank: 1.000
>>> Column: 5, Selected False, Rank: 3.000
Now display the features to remove based on recursive feature elimination done above
columns_to_remove = features.columns.values[np.logical_not(rfe.support_)]
columns_to_remove
output will be something like:
>>> array(['age', 'id', 'race'], dtype=object)
Now create your new dataset by dropping the un-needed features and selecting the needed one
new_df = df.drop(['age', 'id', 'race'], axis = 1)
Then you can cross validation to know how well this newly selected features (new_df) predicts the target column.
# Check how well the features predict the target variable using cross_validation
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores = cross_val_score(RandomForestClassifier(), new_df, target, cv= cv)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
>>> 0.84 accuracy with a standard deviation of 0.01
Don't forget you can also read up on the best Cross-Validation (CV) parameters to use in this documentation
Recursive Feature Elimination (RFE) documentation to learn more and understand better
Recursive Feature Elimination Cross validation (RFECV) documentation
You are using LogisticRegression model. This is a classifier, not a regressor. So your outcome here is treated as labels (not numbers). For good training and prediction, a classifier needs multiple samples of each class. But in your data, only single row is present for each class. Hence the results are garbage and not to be taken seriously.
Try replacing that with any regression model and you will see the outcome which you thought would be.
model = LinearRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
# Output
1
[ True False]
[1 2]

Resources