Say I am using a batch-size of 64 datapoints. During training I update the exponential moving averages for both mean and variance, and use these averages during test time.
I have two test cases:
(1) datapoint-A + 63 other unique datapoints,
(2) datapoint-A repeated 64 times
What I expect to happen:
During test time, the output for datapoint-A should be the same for both cases, since the average mean and variances are used to normalize.
What is happening in my implementation:
The output is different for each of the test cases, i.e., the output for each test example depends on the other examples provided in the batch, due to normalization.
Is my expectation incorrect, or is it correct and I need to focus on debugging my implementation?
Normalization adjustment should not be performed in the test time. You need to distinguish between train time and test time of your network. During training you fit the normalization, and once it is finished - compute normalization over whole training set (or at least representible batch), then fix it and use the fixed one for prediction phase.
Related
Can anyone recommend some more formal method of establishing the optimal number of folds, less than the maximum possible one and not requiring time-consuming simulations (that would predictably find the top of the range of tested k values to be the best)?
More info
From theory and simulations we know that model metrics tend to generally increase (with some variance) as a function of the number of folds (k). It is therefore suboptimal to use anything less than the maximum number of folds still feasible given data size and time constraints.
So using standard default values of 5 or 10 folds is in fact an example of hyperparameter optimization too, but one collectively performed, so they need not be pre-optimized, but switched according to time constraints for model training. As a special case, in time-consuming training setups such as deep learning there is no time for multiple folds, so only single validation set is normally used.
One imperfect solution can be borrowed from PCA scree plots - it's the so called elbow point, but it needs formalization, and it requires those simulations over folds numbers that we wanted to avoid.
For example, according to my simulation over hundreds of models (sklearn breast cancer data classification) the optimum elbow point would be around 3-5 folds:
In a post on Quora, someone says:
At test time, the layer is supposed to see only one test data point at
a time, hence computing the mean / variance along a whole batch is
infeasible (and is cheating).
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images?
I mean, our network as been train to predict using batches, so what is the issue with giving it batches?
If someone could explain what informations our network gets from batches that it is not suppose to have that would be great :)
Thank you
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images ?
First of all, it's ok to use batches for testing. Second, in test mode, batchnorm doesn't compute the mean or variance for the test batch. It takes the mean and variance it already has (let's call them mu and sigma**2), which are computed based solely on the training data. The result of batch norm in test mode is that all tensors x are normalized to (x - mu) / sigma.
At test time, the layer is supposed to see only one test data point at a time, hence computing the mean / variance along a whole batch is infeasible (and is cheating)
I just skimmed through Quora discussion, may be this quote has a different context. But taken on its own, it's just wrong. No matter what the batch is, all tensors will go through the same transformation, because mu and sigma are not changed during testing, just like all other variables. So there's no cheating there.
The claim is very simple, you train your model so it is useful for some task. And in classification the task is usually - you get a data point and you output the class, there is no batch. Of course in some practical applications you can have batches (say many images from the same user etc.). So that's it - it is application dependent, so if you want to claim something "in general" about the learning model you cannot make an assumption that during test time one has access to batches, that's all.
On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:
test_iter
test_interval
iter_size
max_iter
Does it depend on the number of images you have for your training set? If so, how?
In order to set these values in a meaningful manner, you need to have a few more bits of information regarding your data:
1. Training set size the total number of training examples you have, let's call this quantity T.
2. Training batch size the number of training examples processed together in a single batch, this is usually set by the input data layer in the 'train_val.prototxt'. For example, in this file the train batch size is set to 256. Let's denote this quantity by tb.
3. Validation set size the total number of examples you set aside for validating your model, let's denote this by V.
4. Validation batch size value set in batch_size for the TEST phase. In this example it is set to 50. Let's call this vb.
Now, during training, you would like to get an un-biased estimate of the performance of your net every once in a while. To do so you run your net on the validation set for test_iter iterations. To cover the entire validation set you need to have test_iter = V/vb.
How often would you like to get this estimation? It's really up to you. If you have a very large validation set and a slow net, validating too often will make the training process too long. On the other hand, not validating often enough may prevent you from noting if and when your training process failed to converge. test_interval determines how often you validate: usually for large nets you set test_interval in the order of 5K, for smaller and faster nets you may choose lower values. Again, all up to you.
In order to cover the entire training set (completing an "epoch") you need to run T/tb iterations. Usually one trains for several epochs, thus max_iter=#epochs*T/tb.
Regarding iter_size: this allows to average gradients over several training mini batches, see this thread fro more information.
I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.
First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.
I have 20 output neurons on a feed-forward neural network, for which I have already tried varying the number of hidden layers and number of neurons per hidden layer. When testing, I've noticed that while the outputs are not always exactly the same, they vary from test case to case very little, especially in respect to one another. It seems to be outputting nearly (within 0.0005 depending on the initial weights) the same output on every test case; the one that is the highest is always the highest. Is there a reason for this?
Note: I'm using a feed-forward neural network, with resilient and common backpropagation, separating training/validation/testing and shuffling in between training sets.
UPDATE: I'm using the network to categorize patterns from 4 inputs into one of twenty output possibilities. I have 5000 training sets, 800 validation sets, and 1500 testing sets. Number of rounds can vary depending on what I'm doing, on my current training case, the training error seems to converge too quickly (under 20 epochs). However, I have noticed this non-variance at other times when the error will decrease over a period of 1000 epochs. I have also adjusted the learning rate and momentum for the regular propagation. Resilient propagation does not use a learning rate or momentum for updates. This is being implemented using Encog.
Your dataset seems problematic to begin with. 20 outputs for 4 inputs seem too many. The number of output is generally much smaller than the number of inputs. Most probably, either the dataset is wrongly formulated, or you have misunderstood something in the problem you are trying to solve. Anyway, some things regarding your other comments:
First of all, you don't use 1500 training sets, but one set with 1500 training patterns. The same goes for validation and testing.
Second, the output can't be exactly the same on each run, since the weights are initialized randomly and the outputs depend on them. However, we want them to be similar on each run. If they weren't it would mean that they depend too much on the random initialization, so the network wouldn't work well.
In your case, the highest output is the selected category, so if the same output is the highest every time your network is working well.
If the network output is almost the same for different input patterns, the network is unable to categorize input well.
You say your network has 4 input nodes and 20 output nodes (right?). So there are 2*2*2*2 = 16 different possible input patterns. Why the hell you need 800 validation sets?
Your training data may be corrupt.