I'm trying to compare accuracy results (on titanic dateset) between random forest and XGBoost, and I can't figure out why random forest gives better results.
XGBoost is kind of optimized tree base model. It calculating optimized tree every cycle (every new estimator).
Random forest build many trees (with different data and different features) and select the best tree.
I'm working on titanic dateset (after I handle Nan and remove some noise). (Both models get the same dateset)
For both algorithms I'm tuning with hyper parameters.
XGBoost model:
model = XGBClassifier(n_jobs=-1, random_state=42)
hyperparams = {'max_depth': [2,3,4,5,6,7,8],
'n_estimators': [20, 50, 100, 120],
'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5]}
randomized = RandomizedSearchCV(model, hyperparams, n_iter=40, cv=5, random_state=42, scoring='accuracy')
best_params = randomized.best_estimator_
model = XGBClassifier(n_jobs=-1,
model.fit(train_df_x, train_df_y)
y_pred = model.predict(test_df_x)
Random forest model:
model = RandomForestClassifier(random_state=42, n_jobs=-1)
hyperparams = {'n_estimators': [20, 50, 100, 120],
'max_depth': [2,3,4,5,6,7,8]}
randomized = RandomizedSearchCV(model, hyperparams, n_iter=20, cv=5, random_state=42, scoring='accuracy')
randomized.fit(x, y)
best_params = randomized.best_estimator_
model = RandomForestClassifier(random_state=42,
model.fit(train_df_x, train_df_y)
y_pred = model.predict(test_df_x)
As you can see:
I'm using more iterations of hyper parameters on XGBoost (because it has more parameters to tune).
I'm getting the following accuracy results:
Random forest: 86.6
XGBoost: 85.41
Before running the test, I was sure that XGBoost will give me better results.
How can it be that random forest give better results ? What am I missing when using XGBoost ?


Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

How to train model with features selected by SelectKBest?

I am using SelectKBest() in Sklearn's Pipeline() class to reduce the number of features down from 30 to the 5 best features. When I fit the classifer, I get different test results as expected with feature selection. However I spotted an error in my code which doesn't seem to cause an actual error in runtime.
When I call predict(), I realised that it was still being given all 30 features as input as if feature selection wasn't occurring. Even though I only trained the model on the 5 best features. Surely giving 30 features to an SVM to predict a class will crash if it was only trained on the 5 best features?
In my train_model(df) function, my code looks as follows:
def train_model(df):
x,y = balance_dataset(df)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feature_selection = SelectKBest()
pipe = Pipeline([('sc', preprocessing.MinMaxScaler()),
('feature_selection', feature_selection),
('SVM', svm.SVC(decision_function_shape = 'ovr', kernel = 'poly'))])
candidate_parameters = [{'SVM__C': [0.01, 0.1, 1], 'SVM__gamma': [0.01, 0.1, 1], 'feature_selection__k': [5]}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train )
return clf
However this is when what happens when I call trade():
def trade(df):
clf = train_model(df)
for index, row in trading_set.iterrows():
features = row[:-3] #features is now an array of 30 features, even though model is only trained on 5
if trade_balance > 0:
trades[index] = trade_balance
if clf.predict(features) == 1: #So this should crash and give an input Shape error, but it doesn't
#Rest of code unneccesary#
So my question is, how do I know that the model is really being trained on only the 5 best features?
Your code is correct, and there is no reason why it should throw you any error. You are confused between the pipeline object and the model itself, which is only one block of the pipeline.
In your example, the pipeline is taking 30 features, scaling them, selecting the 5 best, then training an SVM on these 5 best features. So your SVM has been trained on 5 best features, but you still need to pass all 30 features to your pipeline, because your pipeline expects data to come in in the same format as during the training.

Using cross-validation to select optimal threshold: binary classification in Keras

I have a Keras model that takes a transformed vector x as input and outputs probabilities that each input value is 1.
I would like to take the predictions from this model and find an optimal threshold. That is, maybe the cutoff value for "this value is 1" should be 0.23, or maybe it should be 0.78, or something else. I know cross-validation is a good tool for this.
My question is how to work this in to training. For example, say I have the following model (taken from here):
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
I train the model and get some output probabilities:
model.fit(train_x, train_y)
predictions = model.predict(train_y)
Now I want to learn the threshold for the value of each entry in predictions that would give the best accuracy, for example. How can I learn this parameter, instead of just choosing one after training is complete?
EDIT: For example, say I have this:
def fake_model(self):
#Model that returns probability that each of 10 values is 1
a_input = Input(shape=(2, 10), name='a_input')
dense_1 = Dense(5)(a_input)
outputs = Dense(10, activation='sigmoid')(dense_1)
def hamming_loss(y_true, y_pred):
return tf.to_float(tf.reduce_sum(abs(y_true - y_pred))) /tf.to_float(tf.size(y_pred))
fakemodel = Model(a_input, outputs)
#Use the outputs of the model; find the threshold value that minimizes the Hamming loss
#Record the final confusion matrix.
How can I train a model like this end-to-end?
If an ROC curve isn't what you are looking for, you could create a custom Keras Layer that takes in the outputs of your original model and tries to learn an optimal threshold given the true outputs and the predicted probabilities.
This layer subtracts the threshold from the predicted probability, multiplies by a relatively large constant (in this case 100) and then applies the sigmoid function. Here is a plot that shows the function at three different thresholds (.3, .5, .7).
Below is the code for the definition of this layer and the creation of a model that is composed solely of it, after fitting your original model, feed it's outputs probabilities to this model and start training for an optimal threshold.
class ThresholdLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(ThresholdLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name="threshold", shape=(1,), initializer="uniform",
super(ThresholdLayer, self).build(input_shape)
def call(self, x):
return keras.backend.sigmoid(100*(x-self.kernel))
def compute_output_shape(self, input_shape):
return input_shape
out = ThresholdLayer()(input_layer)
threshold_model = keras.Model(inputs=input_layer, outputs=out)
threshold_model.compile(optimizer="sgd", loss="mse")
First, here's a direct answer to your question. You're thinking of an ROC curve. For example, assuming some data X_test and y_test:
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_pred = model.predict(X_test).ravel()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
my_auc = auc(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve close-up')
Second, regarding my comment, here's an example of one attempt. It can be done in Keras, or TF, or anywhere, although he does it with XGBoost.
Hope that helps!
First idea I have is kind of brute force.
You compute on a test set a metric separately for each of your input and its corresponding predicted output.
Then for each of them iterate over values for the threshold betzeen 0 and 1 until the metric is optimized for the given input/prediction pair.
For many of the popular metrics of classification quality (accuracy, precision, recall, etc) you just cannot learn the optimal threshold while training your neural network.
This is because these metrics are not differentiable - therefore, gradient updates will fail to set the threshold (or any other parameter) correctly. Therefore, you are forced to optimize a nice smooth loss (like negative log likelihood) during training most of the parameters, and then tune the threshold by grid search.
Of course, you can come up with a smoothed version of your metric and optimize it (and sometimes people do this). But in most cases it is OK to optimize log-likelihood, get a nice probabilistic classifier, and tune the thresholds on top of it. E.g. if you want to optimize accuracy, then you should first estimate class probabilities as accurately as possible (to get close to the perfect Bayes classifier), and then just choose their argmax.

GridsearchCV best score drop when using the best parameters to build model

I'm trying to found a set of best hyperparameters for my Logistic Regression estimator with Grid Search CV and build the model using pipeline:
my problem is when trying to use the best parameters I get through
grid_search.best_params_ to build the Logistic Regression model, the accuracy is different from the one I get by
Here is my code
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=.20, random_state=42)
pipeline = Pipeline([
('chi', SelectKBest()),
('classifier', LogisticRegression())])
grid = {
'vectorizer__ngram_range': [(1, 1), (1, 2),(1, 3)],
'vectorizer__stop_words': [None, 'english'],
'vectorizer__norm': ('l1', 'l2'),
'vectorizer__use_idf':(True, False),
'vectorizer__analyzer':('word', 'char', 'char_wb'),
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [1.0, 0.8],
'classifier__class_weight': [None, 'balanced'],
'classifier__n_jobs': [-1],
'classifier__fit_intercept':(True, False),
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=10)
and when I get best score and pram using
the result is
{'classifier__C': 1.0, 'classifier__class_weight': None, 'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__penalty': 'l1', 'vectorizer__analyzer': 'word', 'vectorizer__ngram_range': (1, 1), 'vectorizer__norm': 'l2', 'vectorizer__stop_words': None, 'vectorizer__use_idf': False}
Now if I use these parameters to build my model
pipeline = Pipeline([
('vectorizer',TfidfVectorizer(ngram_range=(1, 1),stop_words=None,norm='l2',use_idf= False,analyzer='word')),
('chi', SelectKBest(chi2,k=1000)),
('classifier', LogisticRegression(C=1.0,class_weight=None,fit_intercept=True,n_jobs=-1,penalty='l1'))])
print(accuracy_score(Y_test, model.predict(X_test)))
the result drops to 0.68.
also, it is tedious work, so how can I pass the best parameters to model. I could not figure out how to do it like in this(answer) since my way is slightly different than him.
The reason why your score is lower in the second option is because you are evaluating your pipeline model on the test set, whereas you are evaluating your gridsearch model using cross-validation (in your case, a 10-fold stratified cross-validation). This cross-validation score is the average of 10 models fitted each on 9/10 of your train data and evaluated on the last 1/10 of this train data. Hence, you cannot expect the same score from both evaluations.
As far your second question, why can't you just do grid_search.best_estimator_ ? This takes the best model from your grid search and you can evaluate it without rebuilding it from scratch. For instance:
best_model = grid_search.best_estimator_
best_model.score(X_test, Y_test)
I put both Logistic Regression and MLPClassifier in a pipeline switching between each classifier. I used GridSearchCV to find the best parameters between the classifiers. I adjusted the parameters then selected the most accurate classifier for the data. Originally the MLPClassifier was more accurate but after adjusting the C value for the logistic regression, it became more accurate.
pipeline= Pipeline([
#('pca', PCA()),
('clf',LogisticRegression(C=5,max_iter=10000, tol=0.1)),
#('clf',MLPClassifier(hidden_layer_sizes=(25,150,25), max_iter=800, solver='lbfgs', activation='relu', alpha=0.7,
# learning_rate_init=0.001, verbose=False, momentum=0.9, random_state=42))

Learning curve is the same for training and validation?

I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)
