Using Validation Set in GridSearchCV/RandomizedCV or not? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
As I know, cross validation (in GridSearchCV/RandomizedSearchCV) will split data into folds, in which each fold acts as a validataion set one time.
But one recommendation from sklearn:
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid.
When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance) and an evaluation set to compute performance metrics.
This can be done by using the train_test_split utility function.
So we may use "train_test_split" to split original data to train data and valid data
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.25)
And using X_train,y_train in fit GridSearchCV/RandomizedSearchCV, and X_val, y_val for eval_set in fit_params.
Is It really useful?
We apply split orginal data twice( SearchCV and train_test_split) --> necessary?
Data apply in SearchCV is less (X vs X_train) --> less accuracy in training?

The documentation here refers to evaluation set as the test set. Therefore you should use train_test_split to split your data into a train set and a test set.
This is useful to perform this train_test_split as you will then be able to validate the result of your model using the test set that contains unseen data.
The train set will be use during the GridSearchCV to find the best parameter for your model. As explained in the documentation you can use cv parameter to train your model using n-1 fold and validate it with 1 fold.
I would recommend to use cross validation set during GridSearchCV instead of having a fix validation set as this will give you a better indication on how your model perform on unseen data.

Related

Why do we store X_test to y_preds variable in Scikit learn? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am currently working on a Machine Learning project with no prior hands-on experience of Machine Learning or Python. I have just encountered the following code online, but don't know why is that actually happening.
Where is the trained data stored? is it stored in X_train or X_test?
Why did we predict X_test and stored it to y_preds variable? Since we used y_preds, I was expecting something like this:
y_preds = clf.predict(y_test)
Code:
from sklearn.model_selection import train_test_split
# Using train_test_split() function, defining test data size + storing it to variables of test, train
and split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fitting the data into the training model defined above
clf.fit(X_train, y_train);
# Making predictions on our trained data
y_preds = clf.predict(X_test)
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
A) supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
B) unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).
Basically, machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
Take a look at the link below.
https://scikit-learn.org/stable/user_guide.html
That is an excellent resource for learning all about Scikit Learn. It's hard to get your mind around some of these things, but it's a great learning experience, and it really does work!

how to learn from each fold in the k-fold cross validation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When performing k-fold cross-validation, for every fold, we have a different validation set and a slightly changed learning set. Say that you progress from the first fold to the second fold. How is what υοu learned from the first fold being inherited in the second fold iteration? Currently, it seems as you only calculate the accuracy and the model learned is discarded and never retained.
What am I missing? If such a model is retained? How is it retained and does the method differ for DQN from KNN?
K-fold cross-validation does not retrain models in each iteration. Instead, it trains and evaluates K different independent (could be parallelized) models with different folds of the dataset, but with the same hyper-parameters. This is not to get a more accurate model, but to get a more accurate (statistically speaking) validation by computing the aggregated validation score (i.e.: you can estimate the mean and stddev of the accuracy of the model).
You can then just keep one of those models and use the aggregated estimation for its metrics (instead of using the one computed in the specific fold for that model), or train (from scratch) a new model with the complete dataset. In this last case, your best estimation for the model metrics is still the previous aggregated metrics, but a new unused test set could be used to estimate a new metric. So, why would you do that? Well, that is because you normally use cross-validation combined with hyper-parameters tuning. So, every time you tune your hyperparameters, you just check the aggregated metric estimation with cross-validation. But, when you finish tuning your model, you compute the final metrics with the unseen test set.
Let's consider a case where you want to find a regression model for some data. You have various choices on how many terms your model will use. More terms might mean better accuracy but it also means risk of overfitting. To select the right model, set about training them with some train data and testing them against some test data which are usually mutually exclusive.
Now to get more precise approximations on how accurate your model is, you can use k-fold cross-validation which allows you to use as many as k test datasets. Note that you're using k-fold to evaluate how good your model is, given some data, not to train it. In fact k-fold is barely used when the cost of training is high (e.g. deep neural networks) or your dataset is big enough to ensure a well enough approximation of your models accuracy.
So to answer your last question: no the model is not necessarily retained. you can re-train it with all the data you have once you're ready for real-life practice.

Combining neural networks with different features but same target [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a question about how one can go about creating a neural network (or similar) architecture(s) that does the following:
Example
Say I have a model that uses feature 1 and feature 2 to predict a
target. That model does not perform well because I am limited to the
number of training examples that have a feature 1 and feature 2
populated.
If I were to have another neural network with feature 3 and feature 4,
and my goal is to predict the same target, how can I go about
combining the learning from both models to make the same target
prediction.
This continues for several other similar datasets with different
features but a common target
Explanation
I am only doing this because not every training example has a feature
1, 2, 3 and 4, therefore it cannot be incorporated into a single
model. But the only thing in common is that the models are trying to
predict the same target.
Question
What machine learning strategy (not just a neural network) would be
most appropriate for such a problem?
The model you describe is built out of 2 core sub-models.
Many feature-dependent encoders, one for each feature set. Features 1 and 2 can be combined by part of the model into some hidden representation. Features 3 and 4 would be translated into the same hidden representation, but would have a different sub-model with a different set of parameters to fit.
A single feature-independent decoder on top of the hidden representation, to predict your target.
When it comes to fitting the model, each encoder can only use the data where the desired feature set is available. It is fitting a representation to those features, so it needs to see them. But the decoder can be used for all of your data. This will capture the distribution of the targets, which is common because your targets are common.
This sort of model is appropriate when you believe that there is a meaningful hidden representation. That is, you believe that your feature sets are measuring similar things but in different ways.
That allows you to keep the encoder small, as it is doing a small translation from one way of measuring to another. Translating from the measurements to the target may still be difficult, but because that logic goes in the common decoder it can benefit from all the training data.
To make it concrete, a good example use case for such a model would be if if your features were width, height, volume, and weight. And let's say your target is shipping cost.
It is reasonable to say that the intermediate representation is well described by the concept of size. And it is also reasonable to say that translating from size to cost is an interesting problem in its own right, no matter how you measured size originally.
So the model formulation looks something like this:
# Feature encoders.
size ~ width + height
size ~ volume + weight
# Target decoder.
cost ~ size
Now, above I have been careful to describe the model design without any commitment to the type of model. But you did tag this question as related neural networks specifically, and I think that's a good choice.
For your simple example, using PyTorch, the model might look something like this:
import torch.nn as nn
import torch.nn.functional as F
class MultiEncoderSingleDecoder(torch.nn.Module):
def __init__(self, hid_sz):
super().__init__()
self.using_encoder = 0
self.encoders = torch.nn.ModuleList([
torch.nn.Linear(2, hid_sz),
torch.nn.Linear(2, hid_sz),
])
self.decoder = torch.nn.Linear(hid_sz, 1)
def set_encoder(self, use_encoder):
self.using_encoder = use_encoder
def forward(self, inp):
encoder = self.encoders[self.using_encoder]
return self.decoder(F.relu(encoder(inp)))
And then usage might look like so:
model = MultiEncoderSingleDecoder()
model.set_encoder(0)
# Do some training on the first feature set.
model.set_encoder(1)
# Do some more training on the second feature set.
# ...

Why K-fold cross validation will built K+1 models?

I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb
Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...

Machine Learning data preprocessing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance.
I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.
Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.
Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):
Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

Resources