In Machine Learning, is it okay to add the development set to the training set after development? - machine-learning

Usually we train our models on the training set, evaluate them on the development set, make some changes, train and evaluate again, etc. (the development phase), and in the end evaluate once on the test set.
Assume we have little training data. Then, it could make sense to use training AND development set after the development phase. One could estimate hyperparameters as usual and in the end (the final training) add the dev set to the training set, train the model with the previously estimated hyperparameters and evaluate it once on the test set.
Would this be "cheating" in any way? Do people do this, or do they usually leave out the dev set from any training?

I don't think it's cheating in any way. If it improves your model against real world data and your unseen test data, it should be ok. There are reasons why a training/dev/test set is recommended, but if you have such small training data set, I believe it's a valid strategy. In any case, it's hard to have definitive answer without knowing more details such as nature of data and the task you would like to accomplish. Another approach you might like to have look is data augmentation.
I'd recommend the following course which covers training/dev/test set distribution, among other things:
https://www.coursera.org/learn/machine-learning-projects

Once you decided on the hyperparameter using the dev set, you can use the train + dev to perform the training again. This is method is used quite often.
For example with using GridSearchCV method in sklearn, if you use refit=True, this would perform the training after the hyperparameter search is done. i.e. if cv=4 and refit=True, the model performs training 5 times, (4 times for searching best hyperparameters) + (1 for the final training using the complete training set)

Related

Oversampled train set and test set - machine learning classification

Let's say that I have oversampled my training set after splitting, then I selected the features of interest to be extracted based on the training set analysis.
After this, do I use the oversampled training set with the testing set together to determine the classification performance (accuracy, precision, F1 measure, and etc) OR I just use the testing set for it?
(Not really a programming question but it's important enough to be clarified imho)
To measure performance reliably you must use the original test set, without any resampling.
This is one of the reasons why the train/test split should always be done first, the test set should be kept "fresh". Resampling the test set would be like cheating, because it makes the problem easier to solve.
Note: in general resampling rarely works, especially with text.

(cross) validation of CNN models - when to bring in test data?

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!
In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

How to correctly evaluate a neural network model?

I am currently doing hyper-parameter optimization for my neural network.
I have train, dev and test file that were given to me. For my hyper-parameter optimization, I am running a complete training using the train and dev sets. In the end, I evaluate on the test set of the training for a given combination of parameters.
I am choosing the parameters that maximize the score on the test set. My issue is that I feel that it is wrong since I am sort of leaking the test set.
Is this procedure bad? Should I use optunity to maximize the accuracy on the dev set and in the end report a score on the test set?
Typically the validation (dev) set is used to compare models with various hyper-parameters. Once your preferred model is chosen and trained, you run it on the test set to measure its performance.
Your intuition is correct; using the test set to chose model parameters is in a sense using that data to aid in the training procedure, which is not advisable.
The division and usage of train/validation/test sets are discussed in more detail in this post and in this video by Andrew Ng.

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

Resources