Post Average Ensemble classification, I am getting a wierd confusion matrix and even weirder metric scores.
x = data_train[categorical_columns + numerical_columns]
y = data_train['target']
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())
x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)
# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
# Train-Test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_sample, y_sample,
Accuracy is 99.9% but recall,f1-score and precision are 0. Never faced this problem before.I have used Adaboost Classifier.
Confusion Matrix for ADB:
[[46399 25]
[ 0 0]]
Accuracy for ADB:
Precision for ADB:
Recall for ADB:
f1_score for ADB:
Since it is an imbalanced dataset so I have used SMOTE. And now I am getting the results as follows:
Confusion Matrix for ETC:
[[ 0 0]
[ 336 92002]]
Accuracy for ETC:
Precision for ETC:
Recall for ETC:
f1_score for ETC:
This is happening because you have unbalanced dataset (99.9% 0's and only 0.1% 1's). In such scenario's using accuracy as metric can be misleading.
You can read more about what metrics to use in such scenario's here
HI as above answers mentioned it is because of skewed(unbalanced data). However, I would like to give a simpler solution. Use SVM's.
model = sklearn.svm.SVC(class_weight = 'balanced'), y_train)
Using balanced class_weight would automatically give equal importance to all the classes irrespective of the number of datapoints of each class in the dataset. Also, using 'rbf' kernel in SVM would give a really good accuracy.
I'm trying to found a set of best hyperparameters for my Logistic Regression estimator with Grid Search CV and build the model using pipeline:
my problem is when trying to use the best parameters I get through
grid_search.best_params_ to build the Logistic Regression model, the accuracy is different from the one I get by
Here is my code
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=.20, random_state=42)
pipeline = Pipeline([
('chi', SelectKBest()),
('classifier', LogisticRegression())])
grid = {
'vectorizer__ngram_range': [(1, 1), (1, 2),(1, 3)],
'vectorizer__stop_words': [None, 'english'],
'vectorizer__norm': ('l1', 'l2'),
'vectorizer__use_idf':(True, False),
'vectorizer__analyzer':('word', 'char', 'char_wb'),
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [1.0, 0.8],
'classifier__class_weight': [None, 'balanced'],
'classifier__n_jobs': [-1],
'classifier__fit_intercept':(True, False),
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=10),Y_train)
and when I get best score and pram using
the result is
{'classifier__C': 1.0, 'classifier__class_weight': None, 'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__penalty': 'l1', 'vectorizer__analyzer': 'word', 'vectorizer__ngram_range': (1, 1), 'vectorizer__norm': 'l2', 'vectorizer__stop_words': None, 'vectorizer__use_idf': False}
Now if I use these parameters to build my model
pipeline = Pipeline([
('vectorizer',TfidfVectorizer(ngram_range=(1, 1),stop_words=None,norm='l2',use_idf= False,analyzer='word')),
('chi', SelectKBest(chi2,k=1000)),
('classifier', LogisticRegression(C=1.0,class_weight=None,fit_intercept=True,n_jobs=-1,penalty='l1'))]),Y_train)
print(accuracy_score(Y_test, model.predict(X_test)))
the result drops to 0.68.
also, it is tedious work, so how can I pass the best parameters to model. I could not figure out how to do it like in this(answer) since my way is slightly different than him.
The reason why your score is lower in the second option is because you are evaluating your pipeline model on the test set, whereas you are evaluating your gridsearch model using cross-validation (in your case, a 10-fold stratified cross-validation). This cross-validation score is the average of 10 models fitted each on 9/10 of your train data and evaluated on the last 1/10 of this train data. Hence, you cannot expect the same score from both evaluations.
As far your second question, why can't you just do grid_search.best_estimator_ ? This takes the best model from your grid search and you can evaluate it without rebuilding it from scratch. For instance:
best_model = grid_search.best_estimator_
best_model.score(X_test, Y_test)
I put both Logistic Regression and MLPClassifier in a pipeline switching between each classifier. I used GridSearchCV to find the best parameters between the classifiers. I adjusted the parameters then selected the most accurate classifier for the data. Originally the MLPClassifier was more accurate but after adjusting the C value for the logistic regression, it became more accurate.
pipeline= Pipeline([
#('pca', PCA()),
('clf',LogisticRegression(C=5,max_iter=10000, tol=0.1)),
#('clf',MLPClassifier(hidden_layer_sizes=(25,150,25), max_iter=800, solver='lbfgs', activation='relu', alpha=0.7,
# learning_rate_init=0.001, verbose=False, momentum=0.9, random_state=42))
I use a CatBoostClassifier and my classes are highly imbalanced. I applied a scale_pos_weight parameter to account for that. While training with an evaluation dataset (test) CatBoost shows a high precision on test. However, when I make predictions on test using a predict method, I only get a low precision score (calculated using the sklearn.metrics).
I think this might be related to class weights that I applied. However, I don't quite understand how a precision score is affected by this.
params = frozendict({
'task_type': 'CPU',
'loss_function': 'Logloss',
'eval_metric': 'F1',
'custom_metric': ['F1', 'Precision', 'Recall'],
'iterations': 100,
'random_seed': 20190128,
'scale_pos_weight': 56.88657244809081,
'learning_rate': 0.5412829495147387,
'depth': 7,
'l2_leaf_reg': 9.526905230698302
from catboost import CatBoostClassifier
model = cb.CatBoostClassifier(**params)
X_train, y_train,
cat_features=np.where(X_train.dtypes == np.object)[0],
eval_set=(X_test, y_test),
{'learn': {'Recall': 0.9243007537531925,
'Logloss': 0.15892360013680026,
'F1': 0.9416723809244181,
'Precision': 0.9640191600545249},
'validation_0': {'Recall': 0.914252301192093,
'Logloss': 0.1714387314107052,
'F1': 0.9357892623978286,
'Precision': 0.9642642597943112}}
y_test_pred = model.predict(data=X_test)
from sklearn.metrics import balanced_accuracy_score, recall_score, precision_score, f1_score
print('Balanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_test_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_test_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_test_pred)))
print('F1: {:.2f}'.format(f1_score(y_test, y_test_pred)))
Balanced accuracy: 0.94
Precision: 0.29
Recall: 0.91
F1: 0.44
I expected to get the same precision as CatBoost show while training, however, it's not so. What am I doing wrong?
Default use_weights is set to True , which means adding weights to the evaluation metrics, e.g. Precision:use_weights=True,
To let your own precision calculator the same as his, change to Precision: use_weights=False
Also, get_best_score gives the highest score over the iterations, you need to specify which iteration to be used in prediction. You can set use_best_model=True in to automatically choose the iteration.
The predict function uses a standard threshold of 0.5 to convert the probabilities of the prediction into a binary value. When you are dealing with a imbalanced problem, the threshold of 0.5 is not always the best value, that's why on the test set you are achieving a poor precision.
In order to find a better threshold, catboost has some methods that help you to do so, like get_roc_curve, get_fpr_curve, get_fnr_curve. These 3 methods can help you to visualize the true positive, false positive and false negative rates by changing the prediction threhsold.
Besides these visualization methods, catboost has a method called select_threshold which gives you the best threshold by that optimizes one of the curves.
You can check this on their documentation.
In addition to setting the use_bet_model=True, ensure that the class balance in both datasets is the same, or use balanced accuracy metrics to account for different class balance.
If you've done both of these, and you still see much worse accuracy metrics on a test set versus the train set, it is a sign of overfitting. I'd recommend you take advantage of the CatBoost's overfitting detector. The most common first method is to set early_stopping_rounds to an integer like 10, which will stop training once an improvement in the selected loss function isn't achieved after that number of training rounds (see early_stopping_rounds documentation).
I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.
This is the GridSearch output:
GridSearchCV(cv=5, error_score='raise',
estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
random_state=1, verbose=2, warm_start=False),
fit_params={}, iid=True, n_jobs=-1,
param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)
Here I am finding out what the best params were:
print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}
Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.
for i in np.arange(100):
x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
print np.max(val)
print np.mean(val), np.std(val)
563.92 10.3399032877
Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?
So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.
After doing more research, I think I've figured out what's going on. GridSearchCV uses cross-validation on the training data to determine the best parameters, but the estimator it returns is fit on the entire training set, not one of the CV-folds. This makes sense because more training data is usually better.
So, the BaggingClassifier you get back from GridSearchCV is fit to the full dataset of 891 data samples. It's true then, that with max_sample=1., each base estimator will randomly draw 891 samples from the training set. However, by default samples are drawn with replacement, so the number of unique samples will be less than the total number of samples due to duplicates. If you want to draw without replacement, set the bootstrap keyword of BaggingClassifier to false.
Now, exactly how close should we expect the number of distinct samples to be to the size of the dataset when drawing without replacement?
Based off this question, the expected number of distinct samples when drawing n samples with replacement from a set of n samples is n * (1-(n-1)/n) ^ n.
When we plug 891 into this, we get
>>> 891 * (1.- (890./891)**891)
The expected number of samples (563.4) is very close to your observed mean (563.8), so it appears that nothing abnormal is going on.
I'm using an example extracted from the book "Mastering Machine Learning with scikit learn".
It uses a decision tree to predict whether each of the images on a web page is an
advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. The data is publicly available from the Internet Advertisements Data Set:, which contains data for 3,279 images.
The following is the complete code for completing the classification task:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random
def main(argv):
df = pd.read_csv('ad-dataset/', header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)
pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1'), y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
if __name__ == '__main__':
The RESULTS of using scoring='f1' in GridSearchCV as in the example is:
The RESULTS of using scoring=None (by default Accuracy measure) is the same as using F1 score:
If I'm not wrong optimizing the parameter search by different scoring functions should yield different results. The following case shows that different results are obtained when scoring='precision' is used.
The RESULTS of using scoring='precision' is DIFFERENT than the other two cases. The same would be true for 'recall', etc:
I agree with both answers by Fabian & Sebastian. The problem should be the small param_grid. But I just wanted to clarify that the problem surged when I was working with a totally different (not the one in the example here) highly imbalance dataset 100:1 (which should affect the accuracy) and using Logistic Regression. In this case also 'F1' and accuracy gave the same result.
The param_grid that I used, in this case, was the following:
parameters = {"penalty": ("l1", "l2"),
"C": (0.001, 0.01, 0.1, 1, 10, 100),
"solver": ("newton-cg", "lbfgs", "liblinear"),
I guess that the parameter selection is also too small.
I think that the author didn't choose this example very well. I may be missing something here, but min_samples_split=1 doesn't make sense to me: Isn't it the same as setting min_samples_split=2 since you can't split 1 sample -- essentially, it's a waste of computational time.
From the documentation: min_samples_split: "The minimum number of samples required to split an internal node."
Btw. this is a very small grid and there is not much choice anyways, which may explain why accuracy and f1 give you the same parameter combinations and hence the same scoring tables.
Like mentioned above, the dataset may be well balanced which is why F1 and accuracy scores may prefer the same parameter combinations. So, looking further at your GridSearch results using (a) F1 score and (b) Accuracy, I conclude that in both cases a depth of 150 works best. Since this is the lower boundary, it gives you a slight hind that lower "depth" values may work even better. However, I suspect that the tree doesn't even go that deep on this dataset (you can end up with "pure" leaves even well before reaching the max depth).
So, let's repeat the experiment with a little bit more sensible values using the following parameter grid
parameters = {
'clf__max_depth': list(range(2, 30)),
'clf__min_samples_split': (2,),
'clf__min_samples_leaf': (1,)
The optimal "depth" for the best F1 score seems to be around 15.
Best score: 0.878
Best parameters set:
clf__max_depth: 15
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.99 716
1 0.92 0.89 0.91 104
avg / total 0.98 0.98 0.98 820
Next, let's try it using "accuracy" (or None) as our scoring metric:
> Best score: 0.967
Best parameters set:
clf__max_depth: 6
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.98 716
1 0.93 0.85 0.88 104
avg / total 0.97 0.97 0.97 820
As you can see, you get different results now, and the "optimal" depth is different if you use "accuracy."
I don't agree that optimizing the parameter search by different scoring functions should yield necessarily different results necessarily. If your dataset is balanced (roughly same number of samples in each class), I would expect that model selection by accuracy and F1 would yield very similar results.
Also, have in mind that GridSearchCV optimizes over a discrete grid. Maybe using a thinner grid of parameters would yield the results that you are looking for.
On an unbalanced dataset use the "labels" parameter of the f1_score scorer to use only the f1 score of the class you are interested in. Or consider using "sample_weight".