Usually, we do predictions on data for example USD/GBP which consists of columns HIGH, LOW, OPEN, CLOSE. This is only a prediction on a single dataset. And my question is how can I do prediction on multiple data such as USD/GBP, EUR/USD, XAU/USD, USD/JPY. Feeding all this data to a single model and getting predictions on all this dataset i.e multiple outputs. Is this possible?? Thank you.
If I have got your question correctly what you want is to feed in data to a model which has two models in it and run on one model (model1 or model2) seperately according to the data fed into it. For this you can create a model class with the minimal interface which will create a model which will select based on the value of the feature
As an illustration:
import pickle as pkl
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
#Combined of both df1 and df2
data= {
'HIGH': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0],
'LOW': [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
'OPEN': [1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
'END' : [3.0, 1.0, 3.0, 4.0, 2.0, 3.0, 1.0, 2.0, 6.0, 1.0, 2.0, 1.0]
'target_price': [0.0, 2.0, 1.5, 0.0, 5.1, 4.0, 0.0, 1.0, 2.0, 0.0, 2.1, 1.5]
}
df= pd.DataFrame(data)
X, y=df.iloc[:, :-1], df.iloc[:, -1]
X= X.astype('float32')
# create two models
model1= LinearRegression()
model2 = LinearRegression()
ser_model1= X['x']==0.0
model1.fit(X[ser_model1], y[ser_model1])
model2.fit(X[~ser_model1], y[~ser_model1])
# define a class that mocks the model interface
class CombinedModel:
def __init__(self, model1, model2):
self.model1= model1
self.model2= model2
def predict(self, X, **kwargs):
ser_model1= X['x']==0.0
return pd.concat([
pd.Series(self.model1.predict(X[ser_model1]), index=X.index[ser_model1]),
pd.Series(self.model2.predict(X[~ser_model1]), index=X.index[~ser_model1])
]
).sort_index()
# create a model with the two trained sum models
# and pickle it
model= CombinedModel(model1, model2)
model.predict(X)
with open('model.pkl', 'wb') as fp:
pkl.dump(model, fp)
model= model1= model2= None
# test load it
with open('model.pkl', 'rb') as fp:
model= pkl.load(fp)
model.predict(X)
OR if you want to ensemble
then use VotingClassifier
The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels
Related
When I try to combine RandomSearch with the early stopping method to reduce the overfitting, I get this error:
py:372: FitFailedWarning:
300 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
The code I am trying is like this:
params_dist = {'min_child_weight': [0.1, 1, 5, 10, 50],
'colsample_bytree': np.arange(0.5, 1.0, 0.1),
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': np.arange(0.5, 1.0, 0.1),
'max_depth': range(3, 21, 3),
'learning_rate': [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1],
'n_estimators': [50, 100, 250, 500, 750, 1000],
'reg_alpha': [0.0001, 0.001, 0.1, 1],
'reg_lambda': [0.0001, 0.001, 0.1, 1]}
model_with_earlyStopping = xgb.XGBClassifier(objective='binary:logistic',
eval_metric="error",
early_stopping_rounds=13,
seed=42)
random_search = model_selection.RandomizedSearchCV(model_with_earlyStopping,
param_distributions=params_dist,
scoring='roc_auc',
n_jobs=-1,
verbose=0,
cv=3,
random_state=1001,
n_iter=100)
The code worked fine without using early stopping. However, I am looking for a way to combine these 2 methods together.
Can anyone help me fix it?
This question already has an answer here:
Apply multiple preprocessing steps to a column in sklearn pipeline
(1 answer)
Closed 1 year ago.
I have a lab working with preprocess data. And I try to use ColumnTransformer with pipeline syntax. I have some code below.
preprocess = ColumnTransformer(
[('imp_mean', SimpleImputer(strategy='mean'), numerics_cols),
('imp_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
#('stander', StandardScaler(), fewer_cols_train_X_df.columns)
])
After I run this code and call the pipeline the result is.
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
You can see the categorical is in the result. I try to drop it, but it's still here.
So I just want to remove categorical in this result to run StandardScaler. I don't understand why it doesn't work.
Thank you for reading.
With ColumnTransformer you cannot perform sequential information on the different columns. This object will perform the first operation defined for a given column and then mark it as preprocessed.
Therefore in your example, categorical columns will only be imputed but will not be One-hot encoded.
To perform this operation (Imputing and One-hot Encoding on columns you should put these preprocessing on a Pipeline to perform them sequentially.
The example below is illustrating how to handle different processing for numerical and categorical features.
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
X = pd.DataFrame({'gender' : ['male', 'male', 'female'],
'A' : [1, 10 , 20],
'B' : [1, 150 , 20]})
categorical_preprocessing = Pipeline(
[
('imp_mode', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
numerical_preprocessing = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
preprocessing = ColumnTransformer(
[
('catecorical', categorical_preprocessing,
make_column_selector(dtype_include=object)),
('numerical', numerical_preprocessing,
make_column_selector(dtype_include=np.number)),
])
preprocessing.fit_transform(X)
Output:
array([[ 0. , 1. , -1.20270298, -0.84570663],
[ 0. , 1. , -0.04295368, 1.40447708],
[ 1. , 0. , 1.24565666, -0.55877045]])
I want to do a grid search on OnevsRest Classifier and my model is SVC but it shows me the following error on using the grid search --how to resolve??
Code-
from sklearn.model_selection import GridSearchCV
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
svc_model_orc = OneVsRestClassifier(SVC())
grid = GridSearchCV(svc_model_orc, param_grid, refit = True, verbose = 3)
# fitting the model for grid search
grid.fit(X_train, y_train)
# svc_pred_train=grid.predict(X_train)
# svc_pred_test = grid.predict(X_valid)
# print(accuracy_score(y_train, svc_pred_train))
# print(f1_score(y_train, svc_pred_train, average='weighted'))
# print(accuracy_score(y_valid, svc_pred_test))
# print(f1_score(y_valid, svc_pred_test, average='weighted'))
Error-
ValueError: Invalid parameter C for estimator OneVsRestClassifier(estimator=SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001,
verbose=False),
n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.
Since you're performing a GridSearch over nested estimators (even though you just have one, OneVsRestClassifier fits a classifier per class), you need to define the parameters with the syntax estimator__some_parameter.
In the case of having nested objects, such as in pipelines for instance, this is the syntax GridSerach expects to access the different model's parameters, i.e. <component>__<parameter> . In such case, you'd name each model and then set their parameters as SVC__some_parameter for example for a SVC parameter. But for this case, the classifier is under estimator, note that the actual model is accessed through the estimator attribute:
print(svc_model_orc.estimator)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
So in this case, you should set the parameter grid as:
param_grid = {'estimator__C': [0.1, 1, 10, 100, 1000],
'estimator__gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'estimator__kernel': ['rbf']}
If I got the best accuracy for some table information from this below feature
if I add a new feature which parameters should I change to get the new result?
or at least which feature?
xgb = XGBClassifier(
bagging_fraction= 0.8,
boosting= 'gbdt',
colsample_bytree= 0.7,
feature_fraction= 0.9,
learning_rate= 0.05,
max_bin= 32,
max_depth= 10,
min_child_weight= 11,
missing= -999,
n_estimators= 400,
nthread= 4,
num_leaves= 80,
objective= 'multiclass',
predictor= 'gpu_predictor',
seed= 1337,
silent= 1,
subsample= 0.8,
tree_method= 'gpu_hist',
verbose= True
)
I am running the same piece of code on Normal XGBoost and Dask XGBoost.
I am getting different probabilities from both models.
Normal XGBoost Code
params = {'objective': 'binary:logistic', 'nround': 1000,
'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
'min_child_weight': 1, 'tree_method': 'hist',
'grow_policy': 'lossguide'}
model = XGBClassifier(params=params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Output:-Normal XGBoost Code Output
Dask XGBoost Code
params = {'objective': 'binary:logistic', 'nround': 1000,
'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
'min_child_weight': 1, 'tree_method': 'hist',
'grow_policy': 'lossguide'}
bst = dxgb.train(client, params, X_train, y_train)
predictions2 = dxgb.predict(client, bst, X_test).persist()
Output:-
Dask XGBoost Code Output
Can someone please help me here?