Related
I am playing around with the XGBoostClassifier and tuning this with GridSearchCV. I first created the variable xgbc:
xgbc = xgb.XGBClassifier()
I did'nt use any parameters as I wanted to see the default model performance. This gave me accuracy_score = 85.65%, recall_score = 77.91% and roc_auc_score = 84.21%, using the following lines of code:
print("Accuracy: ", accuracy_score(y_test, xgbc.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc.predict(X_test)))
Next, I used GridSearchCV to try to tune the parameters, like this:
Setting up the parameter dictionary:
xgbc_params = {'max_depth': [5, 6, 7], #6
'learning_rate': [0.25, 0.300000012, 0.35], #0.300000012
'gamma':[0, 0.001, 0.1], #0
'reg_lambda': [0.8, 0.95, 1], #1
'scale_pos_weight': [0, 1, 2], #1
'n_estimators': [95, 100, 105]} #100
(The numbers after the # are the default values, which gave me the above scores.)
And now run the GridSearchCV like this:
xgbc_grid = GridSearchCV(xgbc, param_grid=xgbc_params, scoring = make_scorer(accuracy_score), cv = 10, n_jobs = -1)
Next, fit this to the training data:
xgbc_grid.fit(X_train, y_train, verbose = 1, early_stopping_rounds = 10, eval_metric = 'aucpr', eval_set = [(X_test, y_test)])
Finally, run the metrics again:
print("Best Reg estimators: ", xgbc_grid.best_params_)
print("Accuracy: ", accuracy_score(y_test, xgbc_grid.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc_grid.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc_grid.predict(X_test)))
Now, the scores change: accuracy_score = 0.8340807174887892, recall_score = 0.7325581395348837 and roc_auc_score = 0.8420896282464777. Also, here is the best_params_ result:
Best Reg estimators: {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Here is my problem:
The parameter values that GridSearchCV returns through xgbc_grid.best_params_ are not the most optimal for accuracy, as the accuracy score decreases. Can you please help me figure out why this is happenning?
In the parameter dictionary above, I have provided the default values. If I set the parameters to only these single values, then I get the 85% accuracy, like, 'max_depth': [6]. However, as soon as I add other values, like 'max_depth': [5, 6, 7], then GridSearchCV gives the parameters that are not the highest on accuracy score. Full details below:
Base Reg estimators (acc = 85%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Best Reg estimators (acc = 83%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 6, 'n_estimators': 100, 'reg_lambda': 1, 'scale_pos_weight': 1}
I want to do a grid search on OnevsRest Classifier and my model is SVC but it shows me the following error on using the grid search --how to resolve??
Code-
from sklearn.model_selection import GridSearchCV
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
svc_model_orc = OneVsRestClassifier(SVC())
grid = GridSearchCV(svc_model_orc, param_grid, refit = True, verbose = 3)
# fitting the model for grid search
grid.fit(X_train, y_train)
# svc_pred_train=grid.predict(X_train)
# svc_pred_test = grid.predict(X_valid)
# print(accuracy_score(y_train, svc_pred_train))
# print(f1_score(y_train, svc_pred_train, average='weighted'))
# print(accuracy_score(y_valid, svc_pred_test))
# print(f1_score(y_valid, svc_pred_test, average='weighted'))
Error-
ValueError: Invalid parameter C for estimator OneVsRestClassifier(estimator=SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001,
verbose=False),
n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.
Since you're performing a GridSearch over nested estimators (even though you just have one, OneVsRestClassifier fits a classifier per class), you need to define the parameters with the syntax estimator__some_parameter.
In the case of having nested objects, such as in pipelines for instance, this is the syntax GridSerach expects to access the different model's parameters, i.e. <component>__<parameter> . In such case, you'd name each model and then set their parameters as SVC__some_parameter for example for a SVC parameter. But for this case, the classifier is under estimator, note that the actual model is accessed through the estimator attribute:
print(svc_model_orc.estimator)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
So in this case, you should set the parameter grid as:
param_grid = {'estimator__C': [0.1, 1, 10, 100, 1000],
'estimator__gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'estimator__kernel': ['rbf']}
I'm working on a machine learning model, I have a dataframe with the data
I normalize the data with a standard distribution
scaler = StandardScaler()
df = scaler.fit_transform(df)
I divide the datasets into target and characteristics
X_df = df[X_characteristics_list]
y_df = df[target]
I split into train and test then I train the model
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
I predict the test to validate the effectiveness
y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)
But when is time to test in real life I need to leave the model ready to predict
If i Want to predict just one record
let say [100,20,34]
I can't because I need the record standardized, and transform it with StandardScaler does not work because it depends on standard deviation so I would need the original dataset
What's the best way to solve this problem.
See below:
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
# Create our input and output matrices
>>> X, y = make_classification()
# Split train-test... "test" will be production/unobserved/"real-life" data
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
# What does X_train look like?
>>> X_train
array([[-0.08930702, -2.71113991, -0.93849926, ..., 0.21650905,
0.68952722, 0.61365789],
[-0.31143977, -1.87817904, 0.08287492, ..., -0.41332943,
-0.58967179, 1.7239411 ],
[-1.62287589, 1.10691318, -0.630556 , ..., -0.35060008,
1.11270562, 0.08106694],
...,
[-0.59797041, 0.90218081, 0.89983074, ..., -0.54374315,
1.18534841, -0.03397969],
[-1.2006559 , 1.01890955, -1.21617181, ..., 1.76263322,
1.38280423, -1.0192972 ],
[ 0.11883425, 1.42952643, -1.23647358, ..., 1.02509208,
-1.14308885, 0.72096531]])
# Let's scale it
>>> scaler = StandardScaler()
>>> X_train = scaler.fit_transform(X_train)
>>> X_train
array([[ 0.08867642, -1.97950269, -1.1214106 , ..., 0.22075623,
0.57844552, 0.46487917],
[-0.10736984, -1.34896243, 0.00808597, ..., -0.37670234,
-0.6045418 , 1.57819736],
[-1.26479555, 0.91071257, -0.78086855, ..., -0.3171979 ,
0.96979563, -0.06916763],
...,
[-0.36025134, 0.7557329 , 0.91152449, ..., -0.50041152,
1.03697478, -0.18452874],
[-0.89215959, 0.84409499, -1.42847749, ..., 1.68739437,
1.21957946, -1.17253964],
[ 0.27237431, 1.15492649, -1.4509284 , ..., 0.98777012,
-1.116335 , 0.57247992]])
# Fit the model
>>> model = LogisticRegression()
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
# Now let's use the already-fitted StandardScaler object to simply transform
# *not fit_transform* the test data
>>> X_test = scaler.transform(X_test)
>>> model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 0])
Note that using joblib or pickle you can save the scaler object and re-load it for scaling in "real-time" later on.
I'm using an MLPClassifier for classification of heart diseases. I used imblearn.SMOTE to balance the objects of each class. I was getting very good results (85% balanced acc.), but i was advised that i would not use SMOTE on test data, only for train data. After i made this changes, the performance of my classifier fell down too much (~35% balanced accuracy) and i don't know what can be wrong.
Here is a simple benchmark with training data balanced but test data unbalanced:
And this is the code:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
I'd like to know what i'm doing wrong, since this seems to be the proper way of preparing data.
The first mistake in your code is when you are transforming data into standard format. You only need to fit StandardScaler once and that is on X_train. You shouldn't refit it on X_test. So the correct code will be:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
For the machine learning model, try reducing the learning rate. it is too high. the default learning rate in sklearn is 0.001. Try changing the activation function and the number of layers. Also not every ML model works on every dataset so you might need to look at your data and choose ML model accordingly.
Hope you have already got better result for your model.I tried by changing few parameter, and I getting accuracy of 65%, when I change it to 90:10 sample I got an accuracy of 70%.
But accuracy can mislead,so I calculated F1 score which give you better picture of prediction.
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(1,),verbose=False,
learning_rate_init=0.001,
max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=50)
clf.fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report
score=accuracy_score(y_test, y_pred, )
print(score)
cr=classification_report(y_test, clf.predict(X_test))
print(cr)
Accuracy = 0.65
classification report :
precision recall f1-score support
0 0.82 0.97 0.89 33
1 0.67 0.31 0.42 13
2 0.00 0.00 0.00 6
3 0.00 0.00 0.00 4
4 0.29 0.80 0.42 5
micro avg 0.66 0.66 0.66 61
macro avg 0.35 0.42 0.35 61
weighted avg 0.61 0.66 0.61 61
confusion_matrix:
array([[32, 0, 0, 0, 1],
[ 4, 4, 2, 0, 3],
[ 1, 1, 0, 0, 4],
[ 1, 1, 0, 0, 2],
[ 1, 0, 0, 0, 4]], dtype=int64)
I am new to the machine learning and TensorFlow. I am trying to train a simple model to recognize gender. I use small data-set of height, weight, and shoe size. However, I have encountered a problem with evaluating model's accuracy.
Here's the entire code:
import tflearn
import tensorflow as tf
import numpy as np
# [height, weight, shoe_size]
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39], [177, 70, 40], [159, 55, 37], [171, 75, 42],
[181, 85, 43], [170, 52, 39]]
# 0 - for female, 1 - for male
Y = [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0]
data = np.column_stack((X, Y))
np.random.shuffle(data)
# Split into train and test set
X_train, Y_train = data[:8, :3], data[:8, 3:]
X_test, Y_test = data[8:, :3], data[8:, 3:]
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, loss='mean_square')
# fix for tflearn with TensorFlow 12:
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
tf.add_to_collection(tf.GraphKeys.VARIABLES, x)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(X_train, Y_train, n_epoch=100, show_metric=True)
score = model.evaluate(X_test, Y_test)
print('Training test score', score)
test_male = [176, 78, 42]
test_female = [170, 52, 38]
print('Test male: ', model.predict([test_male])[0])
print('Test female:', model.predict([test_female])[0])
Even though model's prediction is not very accurate
Test male: [0.7158362865447998]
Test female: [0.4076206684112549]
The model.evaluate(X_test, Y_test) always returns 1.0. How do I calculate real accuracy on the test data-set using TFLearn?
You want to do binary classification in this case. Your network is set to perform linear regression.
First, transform the labels (gender) to categorical features:
from tflearn.data_utils import to_categorical
Y_train = to_categorical(Y_train, nb_classes=2)
Y_test = to_categorical(Y_test, nb_classes=2)
The output layer of your network needs two output units for the two classes you want to predict. Also the activation needs to be softmax for classification. The tf.learn default loss is cross-entropy and the default metric is accuracy, so this is already correct.
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
The output will now be a vector with the probability for each gender. For example:
[0.991, 0.009] #female
Bear in mind that you will hopelessly overfit the network with your tiny data set. This means that during training the accuracy will approach 1 while, the accuracy on your test set will be quite poor.