Why not train for partial epochs? - machine-learning

Nobody ever seems to run their model for say '10.5' epochs. What is the theoretical reason for this?
It is somewhat intuitive to me that if I had a training set of perfectly unique samples, the optimal knee point between undertraining and overtraining should be between full epochs. However, in most cases individual training samples will often be similar/related in one way or another.
Is there a solid statistics based reason? Or else, did anyone empirically investigate?

I dispute the premise: where I work, we often run for partial epochs, although the range is higher for the large data sets: say, 40.72 epochs.
For small data sets or short training, it's a matter of treating each observation with equal weight, so it's natural to think that one needs to process each the same number of times. As you point out, if the input samples are related, then it's less important to do so.
I would think that one base reason is convenience: integers are easier to interpret and discuss.
For many models, there is no knee at optimal training: it's a gentle curve, such that there is almost certainly an integral number of epochs within the "sweet spot" of accuracy. Thus, it's more convenient to find that 10 epochs is a little better than 11, even if the optimal point (found with multiple training runs at tiny differences in iteration count) happens to be 10.2 epochs. Diminishing returns says that if 9-12 epochs give us very similar, good results, we simply note that 10 is the best performance in the range 8-15 epochs, accept the result, and get on with the rest of life.

Related

Where do # of epochs and batch size belong in the hyperparameter tuning process?

I'm fairly new to machine learning, and working on optimizing hyperparameters for my model. I'm doing this via a randomized search. My question is: should I be searching over # of epochs and batch size along with my other hyperparameters (e.g. loss function, number of layers, etc.)? If not, should I fix a these values first, find the other parameters, then return to tune these?
My concern is a) that searching over many epochs will be extremely time-consuming, so leaving it at one low value for the initial scan would be useful and b) that these parameters, esp. # of epochs, will disproportionately affect the results when the model is behaving well, and won't really give me much information about the rest of my architecture, as there should be a regime where more epochs, up to a point, are better. I know this isn't totally accurate, i.e. # of epochs is a real hyperparameter and too many can lead to overfitting issues, for example. Currently, my model is not clearly improving with # of epochs, though it was suggested by someone working on a similar problem within my area of research that this may be mitigated by implementing batch normalization, which is another parameter I am testing. Finally, I am worried that batch size will be quite affected by the fact that I am scaling my data down to 60% to allow my code to run reasonably (and I think the final model will be trained on vastly more data than the simulated data currently available to me).
I agree with your intuition on epochs. It is common to keep this value as low as possible in order to complete more training "experiments" in the same number of working hours. I don't have a great reference here, but I would welcome one in the comments.
For almost everything else, there is a paper by Leslie N. Smith that I can't recommend enough, A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay.
As you can see, batch size is included but epochs are not. You will also notice that the model architecture is not included (number of layers, layer size, etc). Neural Architecture Search is a huge research field in its own right, separate from hyper-parameter tuning.
As for the loss function, I can't think of any reason to "tune" that except in the context of an Auxiliary Loss for training only, which I suspect is not what you are talking about.
The loss function that will be applied to your validation or test set is part of the problem statement. That, along with the data, defines the problem you are solving. You don't changing it by tuning, you change it by convincing a product manager that your alternative is better for the business need.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

Is there a rule-of-thumb for how to divide a dataset into training and validation sets? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?
I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?
There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).
Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:
Split your data into training and testing (80/20 is indeed a good starting point)
Split the training data into training and validation (again, 80/20 is a fair split).
Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples
You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.
However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.
There has been some research into what is the proper ratio between the training set and the validation set:
The fraction of patterns reserved for the validation set should be
inversely proportional to the square root of the number of free
adjustable parameters.
In their conclusion they specify a formula:
Validation set (v) to training set (t) size ratio, v/t, scales like
ln(N/h-max), where N is the number of families of recognizers and
h-max is the largest complexity of those families.
What they mean by complexity is:
Each family of recognizer is characterized by its complexity, which
may or may not be related to the VC-dimension, the description
length, the number of adjustable parameters, or other measures of
complexity.
Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.
Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:
Training: 60%
Cross-validation: 20%
Testing: 20%
Well, you should think about one more thing.
If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.
Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.
Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.
Perhaps a 63.2% / 36.8% is a reasonable choice. The reason would be that if you had a total sample size n and wanted to randomly sample with replacement (a.k.a. re-sample, as in the statistical bootstrap) n cases out of the initial n, the probability of an individual case being selected in the re-sample would be approximately 0.632, provided that n is not too small, as explained here: https://stats.stackexchange.com/a/88993/16263
For a sample of n=250, the probability of an individual case being selected for a re-sample to 4 digits is 0.6329.
For a sample of n=20000, the probability is 0.6321.
It all depends on the data at hand. If you have considerable amount of data then 80/20 is a good choice as mentioned above. But if you do not Cross-Validation with a 50/50 split might help you a lot more and prevent you from creating a model over-fitting your training data.
Suppose you have less data, I suggest to try 70%, 80% and 90% and test which is giving better result. In case of 90% there are chances that for 10% test you get poor accuracy.

Having trouble understanding neural networks

I am trying to use a neural network to solve a problem. I learned about them from the Machine Learning course offered on Coursera, and was happy to find that FANN is a Ruby implementation of neural networks, so I didn't have to re-invent the airplane.
However, I'm not really understanding why FANN is giving me such strange output. Based on what I learned from the class,
I have a set of training data that's results of matches. The player is given a number, their opponent is given a number, and the result is 1 for a win and 0 for a loss. The data is a little noisy because of upsets, but not terribly so. My goal is to find which rating gaps are more prone to upsets - for instance, my intuition tells me that lower-rated matches tend to entail more upsets because the ratings are less accurate.
So I got a training set of about 100 examples. Each example is (rating, delta) => 1/0. So it's a classification problem, but not really one that I think lends itself to a logistic regression-type chart, and a neural network seemed more correct.
My code begins
training_data = RubyFann::TrainData.new(:inputs => inputs, :desired_outputs => outputs)
I then set up the neural network with
network = RubyFann::Standard.new(
:num_inputs=>2,
:hidden_neurons=>[8, 8, 8, 8],
:num_outputs=>1)
In the class, I learned that a reasonably default is to have each hidden layer with the same number of units. Since I don't really know how to work this or what I'm doing yet, I went with the default.
network.train_on_data(training_data, 1000, 1, 0.15)
And then finally, I went through a set of sample input ratings in increments and, at each increment, increased delta until the result switched from being > 0.5 to < 0.5, which I took to be about 0 and about 1, although really they were more like 0.45 and 0.55.
When I ran this once, it gave me 0 for every input. I ran it again twice with the same data and got a decreasing trend of negative numbers and an increasing trend of positive numbers, completely opposite predictions.
I thought maybe I wasn't including enough features, so I added (rating**2 and delta**2). Unfortunately, then I started getting either my starting delta or my maximum delta for every input every time.
I don't really understand why I'm getting such divergent results or what Ruby-FANN is telling me, partly because I don't understand the library but also, I suspect, because I just started learning about neural networks and am missing something big and obvious. Do I not have enough training data, do I need to include more features, what is the problem and how can I either fix it or learn how to do things better?
What about playing a little with parameters? At first I would highly recommend only two layers..there should be mathematical proof somewhere that it is enough for many problems. If you have too many neurons your NN will not have enough epochs to really learn something.. so you can also play with number of epochs as well as gama..I think that in your case it's 0.15 ..if you use a little bigger value your NN should learn a little bit faster(don't be afraid to try 0.3 or even 0.7), right value of gama usually depends on weight's intervals or input normalization.
Your NN shows such a different results most probably because in each run there is new initialization and then there is totally different network and it will learn in different way as the previous one(different weights will have higher values so different parts of NN will learn same things).
I am not familiar with this library I am just writing some experiences with NN. Hope something from these will help..

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Resources