How to do model evaluation with train, validation, test? - machine-learning

Generally if one dataset is given we use
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
If we are doing validation on the training dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)
If both Train and Test datasets are given in separate datasets, where do I use Test dataset in the code?

Normally I will split the available data 2 times like this:
#first split: 80% (green+blue) and 20% (orange)
X_model, X_test, y_model, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#second split: 75% (green) and 25% (orange)
X_train, X_val, y_train, y_val = train_test_split(X_model, y_model, test_size=0.25, random_state=0)
Note the the first split is 80:20 ratio, and the second split is 75:25 ratio, in order to get the 60:20:20 ratio in the overall dataset. And if you're already given separate dataset for model train/val and model test, you can skip the first split.
After this you can proceed to train and evaluate the model (using _train and _val):
lr = some_model()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred)) #accuracy score (TRAIN)
print(classification_report(y_val, y_pred))
This is repeated using various models, and in model tuning, until you can find the best performing model and hyperparameters. It is highly recommended to do cross validation here to identify a better and more robust model.
After the winning model is found, do a test on the holdout data (using _test):
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred)) #accuracy score (TEST)
print(classification_report(y_test, y_pred))
Now you can compare the 2 accuracy scores (TRAIN and TEST) to see if model is overfitting or underfitting.

Related

How to do Multi-step forecasting using XGBoost?

I am currently using XGBoost to predict sales in the future. My time series data is given per week interval. But I am not sure how can I do multistep forcasting using XGBoost. I split my data set into train and test and after training the model I use my test set to predict the sales. But I only get prediction on the actual values that I have not on the future weeks that are beyond the test set. Here are some code for clarification:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
shuffle=False)
reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, nthread=24)
reg.fit(X_train, y_train)
# predicting
predictions_xgb = reg.predict(X_test)
Can I get some help on this?

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Prediction in machine learning

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
acc = accuracy_score(y_pred, y_test)
The above 4 lines of code are from a model I came across. Can somebody tell me the difference between y_test and y_pred.
y_test is the actual label set which is used to evaluate your final model performance in comparison with y_pred, which is the one we get after fitting our machine learning model, i.e. what our model would predict for X_test.

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

MLPRegressor gives very negative scores

I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.

Resources