OneHotEncoder doesn't remove categorical in pipeline [duplicate] - machine-learning

This question already has an answer here:
Apply multiple preprocessing steps to a column in sklearn pipeline
(1 answer)
Closed 1 year ago.
I have a lab working with preprocess data. And I try to use ColumnTransformer with pipeline syntax. I have some code below.
preprocess = ColumnTransformer(
[('imp_mean', SimpleImputer(strategy='mean'), numerics_cols),
('imp_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
#('stander', StandardScaler(), fewer_cols_train_X_df.columns)
])
After I run this code and call the pipeline the result is.
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
You can see the categorical is in the result. I try to drop it, but it's still here.
So I just want to remove categorical in this result to run StandardScaler. I don't understand why it doesn't work.
Thank you for reading.

With ColumnTransformer you cannot perform sequential information on the different columns. This object will perform the first operation defined for a given column and then mark it as preprocessed.
Therefore in your example, categorical columns will only be imputed but will not be One-hot encoded.
To perform this operation (Imputing and One-hot Encoding on columns you should put these preprocessing on a Pipeline to perform them sequentially.
The example below is illustrating how to handle different processing for numerical and categorical features.
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
X = pd.DataFrame({'gender' : ['male', 'male', 'female'],
'A' : [1, 10 , 20],
'B' : [1, 150 , 20]})
categorical_preprocessing = Pipeline(
[
('imp_mode', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
numerical_preprocessing = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
preprocessing = ColumnTransformer(
[
('catecorical', categorical_preprocessing,
make_column_selector(dtype_include=object)),
('numerical', numerical_preprocessing,
make_column_selector(dtype_include=np.number)),
])
preprocessing.fit_transform(X)
Output:
array([[ 0. , 1. , -1.20270298, -0.84570663],
[ 0. , 1. , -0.04295368, 1.40447708],
[ 1. , 0. , 1.24565666, -0.55877045]])

Related

Early stopping in XGBoost classification and RandomizedSearchCV

When I try to combine RandomSearch with the early stopping method to reduce the overfitting, I get this error:
py:372: FitFailedWarning:
300 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
The code I am trying is like this:
params_dist = {'min_child_weight': [0.1, 1, 5, 10, 50],
'colsample_bytree': np.arange(0.5, 1.0, 0.1),
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': np.arange(0.5, 1.0, 0.1),
'max_depth': range(3, 21, 3),
'learning_rate': [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1],
'n_estimators': [50, 100, 250, 500, 750, 1000],
'reg_alpha': [0.0001, 0.001, 0.1, 1],
'reg_lambda': [0.0001, 0.001, 0.1, 1]}
model_with_earlyStopping = xgb.XGBClassifier(objective='binary:logistic',
eval_metric="error",
early_stopping_rounds=13,
seed=42)
random_search = model_selection.RandomizedSearchCV(model_with_earlyStopping,
param_distributions=params_dist,
scoring='roc_auc',
n_jobs=-1,
verbose=0,
cv=3,
random_state=1001,
n_iter=100)
The code worked fine without using early stopping. However, I am looking for a way to combine these 2 methods together.
Can anyone help me fix it?

RandomizedSearchCV with xgboost classifier is taking too long

I am trying to run this code on my dataset. However, its taking too long (the code has been running since yesterday and my dataset is not too large)
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
xgb_classifier = XGBClassifier(learning_rate=0.02, n_estimators=600,
objective='binary:logistic',silent=True, nthread=1)
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params,
n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(trainx,trainy),
verbose=3, random_state=1001 )
random_search.fit(trainx, trainy)
I am new in the field. any ideas?

Guidance needed - GridSearchCV returns parameters that decrease the accuracy of the XGBoost model

I am playing around with the XGBoostClassifier and tuning this with GridSearchCV. I first created the variable xgbc:
xgbc = xgb.XGBClassifier()
I did'nt use any parameters as I wanted to see the default model performance. This gave me accuracy_score = 85.65%, recall_score = 77.91% and roc_auc_score = 84.21%, using the following lines of code:
print("Accuracy: ", accuracy_score(y_test, xgbc.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc.predict(X_test)))
Next, I used GridSearchCV to try to tune the parameters, like this:
Setting up the parameter dictionary:
xgbc_params = {'max_depth': [5, 6, 7], #6
'learning_rate': [0.25, 0.300000012, 0.35], #0.300000012
'gamma':[0, 0.001, 0.1], #0
'reg_lambda': [0.8, 0.95, 1], #1
'scale_pos_weight': [0, 1, 2], #1
'n_estimators': [95, 100, 105]} #100
(The numbers after the # are the default values, which gave me the above scores.)
And now run the GridSearchCV like this:
xgbc_grid = GridSearchCV(xgbc, param_grid=xgbc_params, scoring = make_scorer(accuracy_score), cv = 10, n_jobs = -1)
Next, fit this to the training data:
xgbc_grid.fit(X_train, y_train, verbose = 1, early_stopping_rounds = 10, eval_metric = 'aucpr', eval_set = [(X_test, y_test)])
Finally, run the metrics again:
print("Best Reg estimators: ", xgbc_grid.best_params_)
print("Accuracy: ", accuracy_score(y_test, xgbc_grid.predict(X_test)))
print("Recall: ", recall_score(y_test, xgbc_grid.predict(X_test)))
print("ROC_AUC: ", roc_auc_score(y_test, xgbc_grid.predict(X_test)))
Now, the scores change: accuracy_score = 0.8340807174887892, recall_score = 0.7325581395348837 and roc_auc_score = 0.8420896282464777. Also, here is the best_params_ result:
Best Reg estimators: {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Here is my problem:
The parameter values that GridSearchCV returns through xgbc_grid.best_params_ are not the most optimal for accuracy, as the accuracy score decreases. Can you please help me figure out why this is happenning?
In the parameter dictionary above, I have provided the default values. If I set the parameters to only these single values, then I get the 85% accuracy, like, 'max_depth': [6]. However, as soon as I add other values, like 'max_depth': [5, 6, 7], then GridSearchCV gives the parameters that are not the highest on accuracy score. Full details below:
Base Reg estimators (acc = 85%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 5, 'n_estimators': 95, 'reg_lambda': 0.8, 'scale_pos_weight': 1}
Best Reg estimators (acc = 83%): {'gamma': 0, 'learning_rate': 0.35, 'max_depth': 6, 'n_estimators': 100, 'reg_lambda': 1, 'scale_pos_weight': 1}

how can we feed multiple stock data and get multiple output?

Usually, we do predictions on data for example USD/GBP which consists of columns HIGH, LOW, OPEN, CLOSE. This is only a prediction on a single dataset. And my question is how can I do prediction on multiple data such as USD/GBP, EUR/USD, XAU/USD, USD/JPY. Feeding all this data to a single model and getting predictions on all this dataset i.e multiple outputs. Is this possible?? Thank you.
If I have got your question correctly what you want is to feed in data to a model which has two models in it and run on one model (model1 or model2) seperately according to the data fed into it. For this you can create a model class with the minimal interface which will create a model which will select based on the value of the feature
As an illustration:
import pickle as pkl
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
#Combined of both df1 and df2
data= {
'HIGH': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0],
'LOW': [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
'OPEN': [1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
'END' : [3.0, 1.0, 3.0, 4.0, 2.0, 3.0, 1.0, 2.0, 6.0, 1.0, 2.0, 1.0]
'target_price': [0.0, 2.0, 1.5, 0.0, 5.1, 4.0, 0.0, 1.0, 2.0, 0.0, 2.1, 1.5]
}
df= pd.DataFrame(data)
X, y=df.iloc[:, :-1], df.iloc[:, -1]
X= X.astype('float32')
# create two models
model1= LinearRegression()
model2 = LinearRegression()
ser_model1= X['x']==0.0
model1.fit(X[ser_model1], y[ser_model1])
model2.fit(X[~ser_model1], y[~ser_model1])
# define a class that mocks the model interface
class CombinedModel:
def __init__(self, model1, model2):
self.model1= model1
self.model2= model2
def predict(self, X, **kwargs):
ser_model1= X['x']==0.0
return pd.concat([
pd.Series(self.model1.predict(X[ser_model1]), index=X.index[ser_model1]),
pd.Series(self.model2.predict(X[~ser_model1]), index=X.index[~ser_model1])
]
).sort_index()
# create a model with the two trained sum models
# and pickle it
model= CombinedModel(model1, model2)
model.predict(X)
with open('model.pkl', 'wb') as fp:
pkl.dump(model, fp)
model= model1= model2= None
# test load it
with open('model.pkl', 'rb') as fp:
model= pkl.load(fp)
model.predict(X)
OR if you want to ensemble
then use VotingClassifier
The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels

If I add a new feature xgboost parameters will have to change?

If I got the best accuracy for some table information from this below feature
if I add a new feature which parameters should I change to get the new result?
or at least which feature?
xgb = XGBClassifier(
bagging_fraction= 0.8,
boosting= 'gbdt',
colsample_bytree= 0.7,
feature_fraction= 0.9,
learning_rate= 0.05,
max_bin= 32,
max_depth= 10,
min_child_weight= 11,
missing= -999,
n_estimators= 400,
nthread= 4,
num_leaves= 80,
objective= 'multiclass',
predictor= 'gpu_predictor',
seed= 1337,
silent= 1,
subsample= 0.8,
tree_method= 'gpu_hist',
verbose= True
)

Resources