I am currently using XGBoost to predict sales in the future. My time series data is given per week interval. But I am not sure how can I do multistep forcasting using XGBoost. I split my data set into train and test and after training the model I use my test set to predict the sales. But I only get prediction on the actual values that I have not on the future weeks that are beyond the test set. Here are some code for clarification:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
shuffle=False)
reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, nthread=24)
reg.fit(X_train, y_train)
# predicting
predictions_xgb = reg.predict(X_test)
Can I get some help on this?
Related
Generally if one dataset is given we use
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
If we are doing validation on the training dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)
If both Train and Test datasets are given in separate datasets, where do I use Test dataset in the code?
Normally I will split the available data 2 times like this:
#first split: 80% (green+blue) and 20% (orange)
X_model, X_test, y_model, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#second split: 75% (green) and 25% (orange)
X_train, X_val, y_train, y_val = train_test_split(X_model, y_model, test_size=0.25, random_state=0)
Note the the first split is 80:20 ratio, and the second split is 75:25 ratio, in order to get the 60:20:20 ratio in the overall dataset. And if you're already given separate dataset for model train/val and model test, you can skip the first split.
After this you can proceed to train and evaluate the model (using _train and _val):
lr = some_model()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred)) #accuracy score (TRAIN)
print(classification_report(y_val, y_pred))
This is repeated using various models, and in model tuning, until you can find the best performing model and hyperparameters. It is highly recommended to do cross validation here to identify a better and more robust model.
After the winning model is found, do a test on the holdout data (using _test):
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred)) #accuracy score (TEST)
print(classification_report(y_test, y_pred))
Now you can compare the 2 accuracy scores (TRAIN and TEST) to see if model is overfitting or underfitting.
I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)
I have the following question:
I have a timer series. I have done my preprocessing, and now I have x, which contains multiple features and y, which contains my output. I have split it into the train, test: x_train, x_test, y_train, y_test
I now want to do a regression and a gridsearch.
Since I have a time series, I cant do the k-fold cross-validation. So I wanted to use the TimeSeriesSplit.
But what exactly am I splitting? I thought I would split the training set into train and test/validate to train my model, validate/select my hyperparameter and then forecast using the test. Is this correct?
And how do I choose n_splits?
I have now the following code:
pipe=Pipeline....
pipe.fit(x_train, y_train)
tss=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tss(train):
print('train:', train_index, 'test:', test_index
clf=GridSearchCV(pipe, param_grid, cv=tss)
clf.fit(x_train, y_train)
according to sklearn documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
"Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate."
The way to go, if you want to validate a time series model is nested cross validation some info about it is in the link bellow:
https://mlfromscratch.com/nested-cross-validation-python-code/
I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.
When I read scikit learn example, a typical machine learning flow is prepocessing --> learning --> predicting. As the code snippet shown below:
steps = [('scalar', StandardScalar()),
('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Here, both training and testing dataset are scaled before fitting into the classifier. But in my task, I am going to predict on a single data sample. After training my model, I will get data from a streaming line. So each time, a single new data is received, I need to use the classifier to predict on it, and preceed my task with the predicted value.
So with only one example available each time, how to preprocess it before predicting? Scaling on this single example seems make no sense. How should I deal with such issue?
just as you train your classifier and use the generated model to predict the individual records, preprocessing step generates a preprocessing model as well. Let's say your input is Xi and you fitted the preprocessing and classifier models(scaler and clf respectively) already:
Xi_new=scaler.transform(Xi)
print(clf.predict(Xi_new))