How to extract model after using sklearn.RepeatedStratifiedKFold() - machine-learning

I have following code:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
model = XGBClassifier()
scores = cross_val_score(model, X_train, y_train, scoring='f1', cv=cv, n_jobs=2)
i want to test my model on the test set:
model.predict(x_test)
, which raises an error: XGBoostError: need to call fit or load_model beforehand.
How do i get the trained model after k-fold validation? Which function should i use? scoree_val_scores return only scores on each split, but not the trained model, which i need for further evaluation.

Related

Creating a random forest function

I am trying to create a function that takes a 2-d numpy array (i.e. the data) and data_indices (a list of (train_indices,test_indices) tuples) as input.For each (train_indices,test_indices) tuple in data_indices ---the function should:
Train a new RandomForestRegressor model on the portion of data indexed by train_indices
Evaluate the trained RandomForestRegressor model on the portion of data indexed by test_indices using the mean squared error.
After training and evalating the RandomForestRegressor models, the function should return the RandomForestRegressor model that obtained highest testing set mean_square_error over its allocated data split across all trained models.
The trained RandomForestRegressor models should be trained with random_state equal 42, all other parameters should be left as default.
This is how I tried to do it pandas:
def best_k_model(data,data_indices):
model = RandomForestRegressor(random_state=42)
for train_indices, test_indices in data:
X_train, y_train = data[train_indices,0], data[train_indices,1]
X_test, y_test = data[test_indices,0], data[test_indices,1]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
return MSE.max()
After this inputs:
best_model = best_k_model(data,data_indices)
best_model.predict([[1960]])
this is what i am suppose to get:
array([8.85170916e+08])
But I'm getting the error that says:
index 1960 is out of bounds for axis 0 with size 58

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

How to apply cross validation on data?

I want to evaluate a ML model using the average cross validation score.
I am splitting the data in a train and test set.
But I don't know if I have to use the train or test data to evaluate the model using the cross validation score.
Here is a part of my code:
train, test = train_test_split(basic_df, test_size=0.3, random_state=42)
# Separate the labels from the features and convert features & labels to numpy arrays
x_train=train.drop('successful',axis=1)
y_train=train['successful']
x_test=test.drop('successful',axis=1)
y_test=test['successful']
model = RandomForestClassifier()
model_random = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)
model_random.fit(x_train, y_train)
print('Accuracy score: ', model_random.score(x_test,y_test))
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_train, y_train, cv=5))) # 5-Fold Cross validation
Y_predicted = model_random.predict(x_test.values)
print('f1_score (macro): ', f1_score(y_test, Y_pred, average='macro') )
The main question is on the following code line:
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_train, y_train, cv=5))) # 5-Fold Cross validation
Is it right or should I use the test set there like this:
print('Average Cross-Val-Score: ', np.mean(cross_val_score(model_random, x_test, y_test, cv=5))) # 5-Fold Cross validation
You don't have to fit again to know the performance of your model's performance on training data. you can get using the following command
import pandas as pd
pd.DataFrame(model_random.cv_results_)
look at the mean_test_score column. Remember this is the performance on test fold of cross validation. This will give you an idea of how well the model performed, for a particular hyper parameter combination chosen by RandomizedSearchCV. Best hyper parameter combination and corresponding model can be extracted using
model_random.best_params_
model_random.best_estimator_
Coming to your actual test data, usually people don't use cross validation there.
Just do a prediction there, like how you in this part. In the background, it uses the model_random.best_estimator_ to do prediction.
Y_predicted = model_random.predict(x_test.values)
print('f1_score (macro): ', f1_score(y_test, Y_pred, average='macro') )
Look at this documentation for more explanation.

Proper way to make prediction with Keras model trained with ImageDataGenerator

I have trained a model applying some image augmentations by using ImageDataGenerator in Keras as follows:
train_datagen = ImageDataGenerator(
rotation_range=60,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True)
train_datagen.fit(x_train)
history = model.fit_generator(
train_datagen.flow(x_train, y_train, batch_size=7),
steps_per_epoch=600,
epochs=epochs,
callbacks=callbacks_list
)
How should I make predictions with this model? By using model.predict() as shown below?
predictions = model.predict(x_test)
Or should I use model.predict_generator() where an ImageDataGenerator is applied on x_test where x_test is unlabelled?
If I use predict_generator(): How to do that?
What is the difference between two methods?
predict_generator() is a convenience function that makes it easier to load in the images and apply the same preprocessing like you did for your training samples. I recommend using that rather than model.predict.
In your case simply do:
test_gen = ImageDataGenerator()
predictions = model.predict_generator(test_gen.flow(# ... your params here ... #))

Resources