How to conduct dataset balancing whilst using Pipeline in Sklearn? [duplicate] - machine-learning

This question already has answers here:
Balance classes in cross validation
(2 answers)
Process for oversampling data for imbalanced binary classification
(2 answers)
Closed 2 years ago.
I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage.
However, my multi-class classification dataset is extremely imbalanced (3 classes) and hence need to implement data set balancing. However, I have researched properly but I cannot find an answer as to when and how this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?
For simplicity's sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.
My code is as follows:
#All necessary packages have already been imported
x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA',
'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative',
'ADX Positive', 'EMA', 'CRA']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
pipe = Pipeline([('sc', StandardScaler()),
('svc', SVC(decision_function_shape = 'ovr'))])
candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly']
}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train)

You need to do rebalancing after train/test split. In real world, you do not know what will be your test set so it is better to keep original. You can rebalance only train set to learn better model and then test on original test dataset. (you can also keep validation set as original)

Related

Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

Why Random Forest gives better results than XGBoost?

I'm trying to compare accuracy results (on titanic dateset) between random forest and XGBoost, and I can't figure out why random forest gives better results.
XGBoost is kind of optimized tree base model. It calculating optimized tree every cycle (every new estimator).
Random forest build many trees (with different data and different features) and select the best tree.
I'm working on titanic dateset (after I handle Nan and remove some noise). (Both models get the same dateset)
For both algorithms I'm tuning with hyper parameters.
XGBoost model:
model = XGBClassifier(n_jobs=-1, random_state=42)
hyperparams = {'max_depth': [2,3,4,5,6,7,8],
'n_estimators': [20, 50, 100, 120],
'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5]}
randomized = RandomizedSearchCV(model, hyperparams, n_iter=40, cv=5, random_state=42, scoring='accuracy')
randomized.fit(x,y)
best_params = randomized.best_estimator_
model = XGBClassifier(n_jobs=-1,
max_depth=best_params.max_depth,
n_estimators=best_params.n_estimators,
learning_rate=best_params.learning_rate)
model.fit(train_df_x, train_df_y)
y_pred = model.predict(test_df_x)
Random forest model:
model = RandomForestClassifier(random_state=42, n_jobs=-1)
hyperparams = {'n_estimators': [20, 50, 100, 120],
'max_depth': [2,3,4,5,6,7,8]}
randomized = RandomizedSearchCV(model, hyperparams, n_iter=20, cv=5, random_state=42, scoring='accuracy')
randomized.fit(x, y)
best_params = randomized.best_estimator_
model = RandomForestClassifier(random_state=42,
n_jobs=-1,
n_estimators=best_params.n_estimators,
max_depth=best_params.max_depth)
model.fit(train_df_x, train_df_y)
y_pred = model.predict(test_df_x)
As you can see:
I'm using more iterations of hyper parameters on XGBoost (because it has more parameters to tune).
I'm getting the following accuracy results:
Random forest: 86.6
XGBoost: 85.41
Before running the test, I was sure that XGBoost will give me better results.
How can it be that random forest give better results ? What am I missing when using XGBoost ?

How to train model with features selected by SelectKBest?

I am using SelectKBest() in Sklearn's Pipeline() class to reduce the number of features down from 30 to the 5 best features. When I fit the classifer, I get different test results as expected with feature selection. However I spotted an error in my code which doesn't seem to cause an actual error in runtime.
When I call predict(), I realised that it was still being given all 30 features as input as if feature selection wasn't occurring. Even though I only trained the model on the 5 best features. Surely giving 30 features to an SVM to predict a class will crash if it was only trained on the 5 best features?
In my train_model(df) function, my code looks as follows:
def train_model(df):
x,y = balance_dataset(df)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feature_selection = SelectKBest()
pipe = Pipeline([('sc', preprocessing.MinMaxScaler()),
('feature_selection', feature_selection),
('SVM', svm.SVC(decision_function_shape = 'ovr', kernel = 'poly'))])
candidate_parameters = [{'SVM__C': [0.01, 0.1, 1], 'SVM__gamma': [0.01, 0.1, 1], 'feature_selection__k': [5]}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train )
return clf
However this is when what happens when I call trade():
def trade(df):
clf = train_model(df)
for index, row in trading_set.iterrows():
features = row[:-3] #features is now an array of 30 features, even though model is only trained on 5
if trade_balance > 0:
trades[index] = trade_balance
if clf.predict(features) == 1: #So this should crash and give an input Shape error, but it doesn't
#Rest of code unneccesary#
So my question is, how do I know that the model is really being trained on only the 5 best features?
Your code is correct, and there is no reason why it should throw you any error. You are confused between the pipeline object and the model itself, which is only one block of the pipeline.
In your example, the pipeline is taking 30 features, scaling them, selecting the 5 best, then training an SVM on these 5 best features. So your SVM has been trained on 5 best features, but you still need to pass all 30 features to your pipeline, because your pipeline expects data to come in in the same format as during the training.

Back to basics for the XOR problem - fundamentaly confused

This is something that has been bothering me for a while about XOR and MLP; it may be basic (if so, apoligies in advance), but I would like to know.
There are many approaches to solving XOR with MLP, but generally they look like this:
from sklearn.neural_network import MLPClassifier
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]
model = MLPClassifier(
activation='relu', max_iter=1000, hidden_layer_sizes=(4,2))
Now to fit the model:
model.fit(X, y)
And, guess what?
print('score:', model.score(X, y))
outputs a perfect
score: 1.0
But what is being predicted and scored? In the case of XOR we have a dataset which, by definition(!) has four rows, two features and one binary label. There is no standard X_train, y_train, X_test, y_test to work with. By definition, again, there is no unseen data for the model to digest.
The prediction takes place in the form of
model.predict(X)
which is exactly the same X that training was performed on.
So doesn't the model just spit back the y it was trained on? How do we know the model "learned" anything?
EDIT: Just to try to clarify what baffles me - the features have 2 and only 2 unique values; the 2 unique values have 4 and only 4 possible combinations. The right label for each possible combination is already present in the label column. So what is there for the model to "learn" when fit() is called? And how is this "learning" performed? How can the model ever be "wrong" when it has access to the "right" answer for each possible combination of inputs?
Again, sorry for what is probably a very basic question.
The key thing is that XOR problem was proposed to demonstrate how some models can learn non-linear problems and some models can't.
So when a model gets 1.0 accuracy on the dataset you mentioned, it's notable since it has learned a non-linear problem. The fact that it has learned the training data is enough for us to know that it can [potentially] learn non-linear models. Notice that if this wasn't the case your model would get a very low accuracy like 0.25 since it divides the 2D space into two sub-spaces by a line.
To understand this better, let's see a case where a model can't learn the data under this same circumstances:
import tensorflow as tf
import numpy as np
X = np.array(X)
y = np.array(y)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(2, activation='relu'))
model.compile(optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1), loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit(X, y, epochs=100)
_, acc = model.evaluate(X, y)
print('acc = ' + str(acc))
which gives:
acc = 0.5
As you can see this model can't classify the data it has already seen. The reason is, this is a non-linear data and our model can only classify linear data.(here is a link to understand the non-linearity of XOR problem better). As soon as we add another layer to our network it will be able to solve this problem:
import tensorflow as tf
import numpy as np
X = np.array(X)
y = np.array(y)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(1, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='relu'))
tb_callback = tf.keras.callbacks.TensorBoard(log_dir='./test/', write_graph=True)
model.compile(optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1), loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit(X, y, epochs=5, callbacks=[tb_callback, ])
acc = model.evaluate(X, y)
print('acc = ' + str(acc))
which gives:
acc = 1.0
By adding only one neuron our model learned to do what it couldn't learn in 100 epochs with 1 layer (even though it had already seen the data).
So to sum up, it is correct that our dataset is so small that the network can easily memorize it but the XOR problem is important because it means there are networks that can't memorize this data no matter what.
Having said that however, there are varsities of XOR problems with proper train and test sets. here is one (the plot is slightly different):
import numpy as np
import matplotlib.pyplot as plt
x1 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
y1 =np.concatenate([np.random.uniform(-100, 0, 100), np.random.uniform(0, 100, 100)])
x2 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
y2 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
plt.scatter(x1, y1, c='red')
plt.scatter(x2, y2, c='blue')
plt.show()
hope that helped ;))

GridsearchCV best score drop when using the best parameters to build model

I'm trying to found a set of best hyperparameters for my Logistic Regression estimator with Grid Search CV and build the model using pipeline:
my problem is when trying to use the best parameters I get through
grid_search.best_params_ to build the Logistic Regression model, the accuracy is different from the one I get by
grid_search.best_score_
Here is my code
x=tweet["cleaned"]
y=tweet['tag']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=.20, random_state=42)
pipeline = Pipeline([
('vectorizer',TfidfVectorizer()),
('chi', SelectKBest()),
('classifier', LogisticRegression())])
grid = {
'vectorizer__ngram_range': [(1, 1), (1, 2),(1, 3)],
'vectorizer__stop_words': [None, 'english'],
'vectorizer__norm': ('l1', 'l2'),
'vectorizer__use_idf':(True, False),
'vectorizer__analyzer':('word', 'char', 'char_wb'),
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [1.0, 0.8],
'classifier__class_weight': [None, 'balanced'],
'classifier__n_jobs': [-1],
'classifier__fit_intercept':(True, False),
}
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train,Y_train)
and when I get best score and pram using
print(grid_search.best_score_)
print(grid_search.best_params_)
the result is
0.7165160230073953
{'classifier__C': 1.0, 'classifier__class_weight': None, 'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__penalty': 'l1', 'vectorizer__analyzer': 'word', 'vectorizer__ngram_range': (1, 1), 'vectorizer__norm': 'l2', 'vectorizer__stop_words': None, 'vectorizer__use_idf': False}
Now if I use these parameters to build my model
pipeline = Pipeline([
('vectorizer',TfidfVectorizer(ngram_range=(1, 1),stop_words=None,norm='l2',use_idf= False,analyzer='word')),
('chi', SelectKBest(chi2,k=1000)),
('classifier', LogisticRegression(C=1.0,class_weight=None,fit_intercept=True,n_jobs=-1,penalty='l1'))])
model=pipeline.fit(X_train,Y_train)
print(accuracy_score(Y_test, model.predict(X_test)))
the result drops to 0.68.
also, it is tedious work, so how can I pass the best parameters to model. I could not figure out how to do it like in this(answer) since my way is slightly different than him.
The reason why your score is lower in the second option is because you are evaluating your pipeline model on the test set, whereas you are evaluating your gridsearch model using cross-validation (in your case, a 10-fold stratified cross-validation). This cross-validation score is the average of 10 models fitted each on 9/10 of your train data and evaluated on the last 1/10 of this train data. Hence, you cannot expect the same score from both evaluations.
As far your second question, why can't you just do grid_search.best_estimator_ ? This takes the best model from your grid search and you can evaluate it without rebuilding it from scratch. For instance:
best_model = grid_search.best_estimator_
best_model.score(X_test, Y_test)
I put both Logistic Regression and MLPClassifier in a pipeline switching between each classifier. I used GridSearchCV to find the best parameters between the classifiers. I adjusted the parameters then selected the most accurate classifier for the data. Originally the MLPClassifier was more accurate but after adjusting the C value for the logistic regression, it became more accurate.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=42)
pipeline= Pipeline([
('scaler',StandardScaler()),
#('pca', PCA()),
('clf',LogisticRegression(C=5,max_iter=10000, tol=0.1)),
#('clf',MLPClassifier(hidden_layer_sizes=(25,150,25), max_iter=800, solver='lbfgs', activation='relu', alpha=0.7,
# learning_rate_init=0.001, verbose=False, momentum=0.9, random_state=42))
])
pipeline.fit(X_train,y_train)
parameter_grid={'C':np.linspace(5,100,5)}
grid_rf_class=GridSearchCV(
estimator=pipeline['clf'],
param_grid=parameter_grid,
scoring='roc_auc',
n_jobs=2,
cv=5,
refit=True,
return_train_score=True)
grid_rf_class.fit(X_train,y_train)
predictions=grid_rf_class.predict(X_test)
print(accuracy_score(y_test,predictions));
print(grid_rf_class.best_params_)
print(grid_rf_class.best_score_)

Resources