I am trying to combine out of sample forecast horizon for supervised models. Furthermore, they are multi-output as there are a lot of simultaneous univariate time series going parallel. How can a avoid using X_test samples for predicting on this type of models?
code is like this .... (any other regressor - RF, AdaBoost etc)
multioutputregressor =
MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror',verbose = 1)).fit(X_train,y_train) ....
y_multirf1 = multioutputregressor.predict(X_test)
Here I need to forecast on univariate data. Besides, it looks like there is only 'time' as an exogenous variable. But it is a violation to put it as X(train/test). Are there any special models for supervised forecasting with out-of-sample predictions?
Thanx.
Related
The question proposed reads as follows: Use scikit-learn to split the data into a training and test set. Classify the data as either cat or dog using DBSCAN.
I am trying to figure out how to go about using DBSCAN to fit a model using training data and then predict the labels of a testing set. I am well aware that DBSCAN is meant for clustering and not prediction. I have also looked at Use sklearn DBSCAN model to classify new entries as well as numerous other threads. DBSCAN only comes with fit and fit_predict functions, which don't seem relatively useful when trying to fit the model using the training data and then test the model using the testing data.
Is the question worded poorly or am I missing something? I have looked at the scikit-learn documentation as well as looked for examples, but have not had any luck.
# Split the samples into two subsets, use one for training and the other for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Instantiate the learning model
dbscan = DBSCAN()
# Fit the model
dbscan.fit(X_train, y_train)
# Predict the response
# Confusion matrix and quantitative metrics
print("The confusion matrix is: " + np.str(confusion_matrix(y_test, dbscan_pred)))
print("The accuracy score is: " + np.str(accuracy_score(y_test, dbscan_pred)))
Whoever gave you that assignment has no clue...
DBSCAN will never predict "cat" or "dog". It just can't.
Because it is an unsupervised algorithm, it doesn't use training labels. y_train is ignored (see the parameter documentation), and it is stupid that sklearn will allow you to pass it at all! It will output sets of points that are clusters. Many tools will enumerate these sets as 1, 2, ... But it won't name a set "dogs".
Furthermore it can't predict on new data either - which you need for predicting on "test" data. So it can't work with a train-test split, but that does not really matter because it does not use labels anyway.
The accepted answer in the question you linked is a pretty good one for you, too: you want to perform classification, not discover structure (which is what clustering does).
DBSCAN, as implemented in scikit-learn, is a transductive algorithm, meaning you can't do predictions on new data. There's an old discussion from 2012 on the scikit-learn repository about this.
Suffice to say, when you're using a clustering algorithm, the concept of train/test splits is less defined. Cross-validation usually involves a different metric; for example, in K-means, the cross-validation is often over the hyperparameter k, rather than mutually exclusive subsets of the data, and the metric that is optimized is the intra-vs-inter cluster variance, rather than F1 accuracy.
Bottom line: trying to perform classification using a clustering technique is effectively square-peg-round-hole. You can jam it through if you really want to, but it'd be considerably easier to just use an off-the-shelf classifier.
"Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in." The above is most of the blogs mentioned about which I don't understand that. I think the disadvantages is not overfitting but underfitting. When we split the data , assume State A and B become the training dataset and try to predict the State C which is completely different than the training data that will lead to underfitting. Can someone fill me in why most of the blogs state 'test-split' lead to overfitting.
It would be more correct to talk about selection bias, which your question describes.
Selection bias can not really tie to overfitting, but to fitting a biased set, therefore the model will be unable to generalize/predict correctly.
In other words, whether "fitting" or "overfitting" applies to a biased train set, that is still wrong.
The semantic strain on the "over" prefix is just that. It implies bias.
Imagine you have no selection bias. In that case, when you overfit even a healthy set, by definition of overfitting, you will still make the model biased towards your train set.
Here, your starting training set is already biased. So any fitting, even "correct fitting", will be biased, just like it happens in overfitting.
In fact train/test split does have some randomness. See below with sci-kit learn train_test_split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
Here, in order to have some initial intuition, you may change the random_state value to some random integer and train the model multiple times to see if you could get a comparable test accuracies in each run. If the dataset is small (in order of 100s) the test accuracies may differ significantly. But when you have a larger dataset (in order of 10000s) the test accuracies become more or less similar as the train set would include at least some examples from all samples.
Of course, cross validation is performed to minimize the effect of overfitting and to make the results more generalized. But with too large datasets, it would be really expensive to do cross validation.
The "train_test_split" function will not necessarily be biased if you do it only once on a data set. What I mean is that by selecting a value for "random_state" feature of the function, you can make different groups of train and test data sets.
Imagine you have a data set, and after applying the train_test_split and training your model, you get low accuracy score on your test data.
If you alter the random_state value and retrain your model, you will get a different accuracy score on your data set.
Consequently, you can essentially be tempted to find the best value for random_state feature to train your model in a way that will have best accuracy. Well, guess what?, you have just introduced bias to your model. So you have found a train set which could train your model in such way that would work the best on the test set.
However, when we use something such as KFold cross Validation, we break down the data set into five or ten (depending on size) groups of train and test data set. Every time we train the model, we can see a different score. The average of all the scores will probably be something more realistic for the model, when trained on the whole data set. It would look like something like this:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kfold = KFold(5, True, 1)
R_2 = []
for train_index, test_index in kfold.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
Model = LinearRegression().fit(X_train, y_train)
r2 = metrics.r2_score(y_test, Model.predict(X_test))
R_2.append(r2)
R_2mean = np.mean(R_2)
I would like to check with you if my understanding about ensemble learning (homogeneous vs heterogeneous) is correct.
Is the following statement correct?
An homogeneous ensemble is a set of classifiers of the same type built upon different data as random forest and an heterogeneous ensemble is a set of classifiers of different types built upon same data.
If it's not correct, could you please clarify this point?
Homogeneous ensemble consists of members having a single-type base learning algorithm. Popular methods like bagging and boosting generate
diversity by sampling from or assigning weights to training
examples but generally utilize a single type of base classifier
to build the ensemble.
On the other hand, Heterogeneous ensemble consists of members having different base learning algorithms such as SVM, ANN and Decision Trees. A popular heterogeneous ensemble method is stacking, which is similar to boosting.
This table contains examples for both homogeneous and heterogeneous ensemble models.
EDIT:
Homogeneous ensemble methods, use the same feature selection method with different training data and distributing the dataset over several nodes while
Heterogeneous ensemble methods use different feature selection methods with the same training data.
Heterogeneous Ensembles (HEE) use different fine-tunes algorithms. They usually work well if we have a small amount of estimators. Note that the number of algorithms should always be odd (3+) in order to avoid ties. For example, we could combine a decision tree, a SVM and a logistic regression using a voting mechanism to improve the results. Then use combined wisdom through majority vote in order to classify a given sample. Besides voting, we can also use averaging or stacking to aggregate the results of the models.The data for each model is the same.
Homogeneous Ensembles (HOE), such as bagging work by applying the same algorithm on all the estimators. These algorithms should not be fine-tuned -> They should be weak ! In contrast to HEE we will use a large amount of estimators. Note that the datsets for this model should be separately sampled in order to guarantee independence. Furthermore, the datasets should be different for each model. This will allow us to be more precise when aggregating the results of each model. Bagging reduces variances as the sampling is truly random. Through using the ensemble itself, we can reduce the risk of over-fitting and we create a robust model. Unfortunately bagging is computationally expensive.
EDIT: Here an example in code
Heterogeneous Ensemble Function:
# Instantiate the individual models
clf_knn = KNeighborsClassifier(5)
clf_decision_tree= DecisionTreeClassifier()
clf_logistic_regression = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
estimators=[
('knn', clf_knn),
('dt', clf_decision_tree),
('lr', clf_logistic_regression )])
# Fit it to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
Homogeneous Ensemble Function:
# Instantiate the base estimator, which is a weak model (we set max depth to 3)
clf_decision_tree = DecisionTreeClassifier(max_depth=3)
# Build the Bagging classifier with 5 estimators (we use 5 decision trees)
clf_bag = BaggingClassifier(
base_estimator=clf_decision_tree,
n_estimators=5
)
# Fit the Bagging model to the training set
clf_bag.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf_bag.predict(X_test)
Conclusion: In summary what you say is correct, yes.
In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.
The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.
Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?
When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example
Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.