How to correctly evaluate a neural network model?

How to correctly evaluate a neural network model? - machine-learning

I am currently doing hyper-parameter optimization for my neural network.
I have train, dev and test file that were given to me. For my hyper-parameter optimization, I am running a complete training using the train and dev sets. In the end, I evaluate on the test set of the training for a given combination of parameters.
I am choosing the parameters that maximize the score on the test set. My issue is that I feel that it is wrong since I am sort of leaking the test set.
Is this procedure bad? Should I use optunity to maximize the accuracy on the dev set and in the end report a score on the test set?

Typically the validation (dev) set is used to compare models with various hyper-parameters. Once your preferred model is chosen and trained, you run it on the test set to measure its performance.
Your intuition is correct; using the test set to chose model parameters is in a sense using that data to aid in the training procedure, which is not advisable.
The division and usage of train/validation/test sets are discussed in more detail in this post and in this video by Andrew Ng.

Related

train, dev set and test set advices for implementation and hyperameter tuning

I have some doubts about the implementation and tuning of parameters and hyperparameters by using the classic train, validation and test set. So it would be of great help if somebody could clarify me these concepts and bring me some hints for its implementation in a language like Python.
For example, if I have a Neural Network, for what I know the parameter tuning (lets consider the number of hidden layers and neurons per layer), could be tuned with the training set. So when it comes to the validation set, which is approximately 20% of the dataset, I can tune my hyperparameters with the following algorithm:
Example: Tuning batch size and learning rate:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for i in range(0,hyperListB):
model=fit(train_set,hyperListB[i],hyperlistL[i]
values[].add(evaluate(model,validation_set) //add scores of each run
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
model=fit(test_set, selecter_hyperparameters)
evaluate(model)
would this sequence of steps be correct? I have searched thru different pages and did not find something that could help me with this. Please, bear in mind that I do not want to use cross-validation or other library-based techniques such as GridSearchCV.
Thanks

In a Train validation test split, the fit method on the train data.
Validation data is used for hyperparameter tuning. A set of hyperparameters is selected and the model is trained on the train set. Then this model will be evaluated on the validation set. This is repeated until all permutations of the different hyperparameters have been exhausted.
The best set of hyperparameters are the ones that gave the best result on the validation set. This method is called Grid search.
The test set is used to evaluate the model with the best hyperparameters selected. THis gives the final unbiased accuracy and loss.
The fit method will never be called on the validation or test set.
your example will look like:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for hyperB in hyperListB:
for hyperL in hyperListL:
model=fit(train_set,hyperB,hyperL)
values[].add(evaluate(model,validation_set) //add scores of each run
end for
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
evaluate(model,test_set)

How azure ML give an output for a value which is not used when training the model?

I am trying to predict the price of a house. Therefore I added no-of-rooms as one variable to get the prediction. Previous values for that variable was (3,2,1) when I was training the model. Now I am adding no-of-rooms as "6" to get an output(which was not use before to get the predicted value). How will it give the output for a new value?Is it only consider the variables except no-of-rooms ? I used Boosted decision tree regression as the model.

The short answer is that when you train your model on a set of features and then use a test set to run predictions, yes it will be able to utilize/understand feature values that the model hasn't previously seen during training. If you have large outliers in your test set that would differ significantly from what the model saw during training, it will affect accuracy, but it will still attempt a prediction.
This is less of a Azure Machine Learning question and more machine learning basics (or really just the basics of how regression works). I would do some research on both "linear regression", and the concept of "over-fitting in machine learning". These are two very basic conceptual topics that will help with your understanding. Understanding regression will help you see why a model can use a value it hasn't previously seen to create a prediction.

In Machine Learning, is it okay to add the development set to the training set after development?

Usually we train our models on the training set, evaluate them on the development set, make some changes, train and evaluate again, etc. (the development phase), and in the end evaluate once on the test set.
Assume we have little training data. Then, it could make sense to use training AND development set after the development phase. One could estimate hyperparameters as usual and in the end (the final training) add the dev set to the training set, train the model with the previously estimated hyperparameters and evaluate it once on the test set.
Would this be "cheating" in any way? Do people do this, or do they usually leave out the dev set from any training?

I don't think it's cheating in any way. If it improves your model against real world data and your unseen test data, it should be ok. There are reasons why a training/dev/test set is recommended, but if you have such small training data set, I believe it's a valid strategy. In any case, it's hard to have definitive answer without knowing more details such as nature of data and the task you would like to accomplish. Another approach you might like to have look is data augmentation.
I'd recommend the following course which covers training/dev/test set distribution, among other things:
https://www.coursera.org/learn/machine-learning-projects

Once you decided on the hyperparameter using the dev set, you can use the train + dev to perform the training again. This is method is used quite often.
For example with using GridSearchCV method in sklearn, if you use refit=True, this would perform the training after the hyperparameter search is done. i.e. if cv=4 and refit=True, the model performs training 5 times, (4 times for searching best hyperparameters) + (1 for the final training using the complete training set)

Why not optimize hyperparameters on train dataset?

When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?
UPDATE July 6 2016
Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."

The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.
At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.

There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).
You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.
However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)

simple answer: we do
In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.
So, in summary you optimize:
ordinary parameters only during training but not during validation
hyperparameters during training and during validation
It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).
This holds true for other machine learning algorithms like decision trees, forests, etc as well.

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.

I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart