Should deterministic models be trained splitting into train, test datasets? - machine-learning

I'm studying the difference between GLM models (OLS, Logistic Regression, Zero Inflated, etc.), which are deterministic, since we can infer the parameters exactly, and some CART models (Random Forest, LightGBM, CatBoost, etc.) that are based on stochastic prediction.
What I've heard is that for stochastic models we should split into train and test to avoid over-fitting, fact that does not happen in deterministic models, because they use Linear Programming for finding the best parameters.
I've like to start some discussion about it.
My opinion is that it's true. Deterministic models are just equations solved, and it should not over-fit the data at all, and it differs from stochastic models based on randomness to make predictions.
But what I found was every course saying to split every datasets, independent if its deterministic or not.

There is confusion over multiple concepts in your question.
Should one use train/test set splits for deterministic models? If you are training a model for prediction, absolutely! The important thing to remember is that a prediction model needs to generalize to data other than the one used for training. This is evaluated using the test set. Even if a model is being learned simply as a means to explore the data, this is still recommended as a way to verify that one isn't just overfitting to the noise.
The second point of confusion is that splitting into train and test sets avoid overfitting. This is not true per se. The separation is so that one can use the test set to verify if the model is overfitting. If the performance on the train and test sets differ "dramatically" then a model is likely overfitting and needs to be simplified, regularized, or otherwise constrained somehow.
The other point pertains to what constitutes a stochastic model. All of the CART models that you mention are actually deterministic in the sense that, once you train then, they always yield exactly the same output for the same input. The stochasticity that you may have been referring is that the training uses random initializations which may result in quite different final models. If this is a concern (because of local optima for example), then use multiple initializations (a.k.a., multiple restarts, or Monte Carlo runs) to resolve them.
Finally, you mentioned that deterministic models don't need this split because they cannot overfit. This is not true. Consider an SVM classifier with a Gaussian kernel of sufficiently small bandwidth. If solved to optimality, the training is deterministic and will most assuredly overfit the training data.

Related

For large missingness, what are the advantages of imputation versus training on available subsets for random forest?

I want to train a random forest model on a dataset with large missingness. I am aware of the 'standard method', where we impute missing data in the training set, use the same imputation rules to impute the test set, then train a random forest model on the imputed training set and use the same model to predict on the test set (potentially doing it with multiple imputation).
What I want to understand is the difference to the following method which I would like to use:
Subset the dataset according to missing patterns. Train random forest models for each of the missing patterns. Use the random forest model trained on missing pattern A to predict data from the test set with missing pattern A. Use the model trained on pattern B to predict data from the test set with pattern B etc.
What is the name for this method? What are the statistical advantages or disadvantages of the two methods? I would very much appreciate if someone could direct me to some literature on the second method, or the comparison of the two.
The difference in methods is in prediction capability.
If you will train different models according to different missing patterns it will be trained on a lower amount of the data (due to missing pattern separation) and will be used to predict only the corresponding test set. Using this approach you can easily miss common patterns in your data for all of your dataset, which otherwise (using all the data) you would detect.
It still heavily depends on your particular case and your data. The good test that will check if your models trained due to particular missing patterns generalize well will be taking another missing pattern dataset, do simple and fast imputation in it (mean/mode/median e.t.c) and check the difference in the metric.
In my opinion, this approach sounds a little extreme as you are voluntarily cutting your train dataset into much smaller parts, than it could be. Maybe, it could perform better on large amounts of data, where your train dataset reduction doesn't hurt your model performance much.
About the articles - I don't know any articles, that compare these two approaches, but can suggest some good ones about various "standard "imputation approaches:
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

Dropping accuracy rate by adding more predictors

I have run some prediction models e.g. Logistic Regression, SVM, decision tree, ... on a dataset. When I add more dimensions (predictors) then my accuracy rates in all models drops . How can I interpret this?
Usually it means that the features that you are adding are either unimportant or even strongly correlated with other features you already have. Your model might therefore pick up a "random" signal in the training set from these features, and then fail to apply it to test data, as it is not a real pattern.
However, interpretation of this kind of problem is very model dependent. Linear Models do not behave the same as decision trees (for example, they are more sensitive to correlated features), so it is weird that they would react the same way. Please detail more if you can.

Why not optimize hyperparameters on train dataset?

When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?
UPDATE July 6 2016
Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."
The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.
At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.
There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).
You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.
However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)
simple answer: we do
In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.
So, in summary you optimize:
ordinary parameters only during training but not during validation
hyperparameters during training and during validation
It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).
This holds true for other machine learning algorithms like decision trees, forests, etc as well.

Model selection with dropout training neural network

I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.

Cross validation with Training/Testing sets

Is it possible to do the evaluation with cross validation and using training/testing sets? I understand cross validation vs holdout evaluation, but I am confused about if we combine them together.
Both cross-validation and holdout evaluation are widely used for estimating the accuracy (or some other measure of performance) of a model. Typically, if you have the luxury of a large amount of data available, you might use holdout evaluation, but if you are a bit more restricted, you might use cross-validation.
But they can also be used for other purposes - in particular, model selection and optimization - and one might commonly want to do these things as well as estimating the model's accuracy.
For example, you might wish to carry out feature selection on your model (choose among several versions of the model, each if which has been built with a different subset of variables), and then evaluate the final chosen model. For the final evaluation, you might reserve a test set for holdout validation; but in order to choose the best subset of variables, you might compare the accuracies of the models built on each subset, as estimated by a cross-validation on the training set.
Other aspects of models could also be optimized using this mixed approach such as, for example, a complexity parameter from a neural network or the ridge parameter from ridge regression.
Hope that helps!

Resources