For my thesis, I'm running a 4 layered deep network for sequence to sequence translation use-case
150 x Conv(64,5) x GRU (100) x softmax activation on last stage with loss='categorical_crossentropy'.
Training loss and accuracy converge optimally pretty quickly
where as validation loss and accuracy seem to be stuck in val_acc 97 to 98.2 range, unable to go past beyond that.
Is my model overfitting?
Have tried dropout of 0.2 between layers.
Output after drop-out
Epoch 85/250
[==============================] - 3s - loss: 0.0057 - acc: 0.9996 - val_loss: 0.2249 - val_acc: 0.9774
Epoch 86/250
[==============================] - 3s - loss: 0.0043 - acc: 0.9987 - val_loss: 0.2063 - val_acc: 0.9774
Epoch 87/250
[==============================] - 3s - loss: 0.0039 - acc: 0.9987 - val_loss: 0.2180 - val_acc: 0.9809
Epoch 88/250
[==============================] - 3s - loss: 0.0075 - acc: 0.9978 - val_loss: 0.2272 - val_acc: 0.9774
Epoch 89/250
[==============================] - 3s - loss: 0.0078 - acc: 0.9974 - val_loss: 0.2265 - val_acc: 0.9774
Epoch 90/250
[==============================] - 3s - loss: 0.0027 - acc: 0.9996 - val_loss: 0.2212 - val_acc: 0.9809
Epoch 91/250
[==============================] - 3s - loss: 3.2185e-04 - acc: 1.0000 - val_loss: 0.2190 - val_acc: 0.9809
Epoch 92/250
[==============================] - 3s - loss: 0.0020 - acc: 0.9991 - val_loss: 0.2239 - val_acc: 0.9792
Epoch 93/250
[==============================] - 3s - loss: 0.0047 - acc: 0.9987 - val_loss: 0.2163 - val_acc: 0.9809
Epoch 94/250
[==============================] - 3s - loss: 2.1863e-04 - acc: 1.0000 - val_loss: 0.2190 - val_acc: 0.9809
Epoch 95/250
[==============================] - 3s - loss: 0.0011 - acc: 0.9996 - val_loss: 0.2190 - val_acc: 0.9809
Epoch 96/250
[==============================] - 3s - loss: 0.0040 - acc: 0.9987 - val_loss: 0.2289 - val_acc: 0.9792
Epoch 97/250
[==============================] - 3s - loss: 2.9621e-04 - acc: 1.0000 - val_loss: 0.2360 - val_acc: 0.9792
Epoch 98/250
[==============================] - 3s - loss: 4.3776e-04 - acc: 1.0000 - val_loss: 0.2437 - val_acc: 0.9774
The case you presented is a really complexed one. In order to answer your question if overfitting is actually happening in your case you need to answer two questions:
Are results obtained on validation set satisfying?- the main purpose of a validation set is to provide you with insights what will happen when new data arrives. If you are satisfied with an accuracy on a validation set then you should think about your model as not overfitting too much.
Should I worry on extremely high accuracy of your model on a training set?- you may easily notice that your model is almost perfect on a training set. This could mean that it learned some patterns by heart. Usually - there is always some noise in your data - and the property of your model to be perfect on a data - means that it probably uses some part of its capacity to learn bias. To test that I usually prefer to test positive examples with a lowest score or negative samples with a highest score - as outliers are usually in these two groups (model is struggling to push them above / below 0.5 treshold).
So - after checking these two concerns you may get an answer if your model overfit. The behaviour you presented is really nice - and what could be the actual reason behind is that there are few patterns in a validation set which are not properly covered in a training set. But this is something you should always take into account when you are designing a Machine Learning solution.
No, this is not overfitting. Overfitting only happens when the training loss is low, and the validation loss is high. This can also be seen as a high difference between training and validation accuracy (in case of classification).
Related
I'm reading a signal which is a constant frequency sine wave (F = 1KHz), constant phase and changing amplitude.
For example, if the read-back signal is "high" the amplitude will be (say) 1V.
If the read-back signal is "low" the amplitude will be (say) 0.01 V.
The signal is deep in white noise (full spectra).
How can I improve SNR to remove the noise but remain with the signal ?
Of course I tried steep filters BPF at the known frequency.
Any other ideas to remove the noise and get a better SNR ?
What does pytorch SGD do if I feed the whole data and do not specify the batch size? I don't see any "stochastic" or "randomness" in the case.
For example, in the following simple code, I feed the whole data (x,y) into a model.
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(5):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Suppose there are 100 data pairs (x,y), i.e. x_data and y_data each has 100 elements.
Question: It seems to me that all the 100 gradients are calculated before one update of parameters. Size of a "mini_batch" is 100, not 1. So there is no randomness, am I right? At first, I think SGD means randomly choose 1 data point and calculate its gradient, which will be used as an approximation of the true gradient from all data.
The SGD optimizer in PyTorch is just gradient descent. The stocastic part comes from how you usually pass a random subset of your data through the network at a time (i.e. a mini-batch or batch). The code you posted passes the entire dataset through on each epoch before doing backprop and stepping the optimizer so you're really just doing regular gradient descent.
I'm learning regularization in Neural networks from deeplearning.ai course. Here in dropout regularization, the professor says that if dropout is applied, the calculated activation values will be smaller then when the dropout is not applied (while testing). So we need to scale the activations in order to keep the testing phase simpler.
I understood this fact, but I don't understand how scaling is done. Here is a code sample which is used to implement inverted dropout.
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
In the above code, why the activations are divided by 0.8 or the probability of keeping a node in a layer (keep_prob)? Any numerical example will help.
I got the answer by myself after spending some time understanding the inverted dropout. Here is the intuition:
We are preserving the neurons in any layer with the probability keep_prob. Let's say keep_prob = 0.6. This means to shut down 40% of the neurons in any layer. If the original output of the layer before shutting down 40% of neurons was x, then after applying 40% dropout, it'll be reduced by 0.4 * x. So now it will be x - 0.4x = 0.6x.
To maintain the original output (expected value), we need to divide the output by keep_prob (or 0.6 here).
Another way of looking at it could be:
TL;DR: Even though due to dropout we have fewer neurons, we want the neurons to contribute the same amount to the output as when we had all the neurons.
With dropout = 0.20, we're "shutting down 20% of the neurons", that's also the same as "keeping 80% of the neurons."
Say the number of neurons is x. "Keeping 80%" is concretely 0.8 * x. Dividing x again by the keep_prob helps "scale it back" to the original value, which is x/0.8:
x = 0.8 * x # x is 80% of what it used to be
x = x/0.8 # x is scaled back up to its original value
Now, the purpose of the inverting is to assure that the Z value will not be impacted by the reduction of W. (Cousera).
When we scale down a3 by keep_prob, we're inadvertently also scaling down the value of z4 (Since, z4 = W4 * a3 + b4). To compensate for this scaling, we need to divide it by keep_prob, to scale it back up. (Stackoverflow)
# keep 80% of the neurons
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
# Scale it back up
a3 = a3 / keep_prob
# this way z4 is not affected
z4 = W4 * a3 + b4
What happens if you don't scale?
With scaling:
-------------
Cost after iteration 0: 0.6543912405149825
Cost after iteration 10000: 0.061016986574905605
Cost after iteration 20000: 0.060582435798513114
On the train set:
Accuracy: 0.9289099526066351
On the test set:
Accuracy: 0.95
Without scaling:
-------------
Cost after iteration 0: 0.6634619861891963
Cost after iteration 10000: 0.05040089794130624
Cost after iteration 20000: 0.049722351029060516
On the train set:
Accuracy: 0.933649289099526
On the test set:
Accuracy: 0.95
Though this is just a single example with one dataset, I'm not sure if it makes a major difference in shallow neural networks. Perhaps it pertains more to deeper architectures.
I learned from several articles that to compute the gradients for the filters, you just do a convolution with the input volume as input and the error matrix as the kernel. After that, you just subtract the filter weights by the gradients(multiplied by the learning rate). I implemented this process but it's not working.
I even tried doing the backpropagation process myself with pen and paper but the gradients I calculated doesn't make the filters perform any better. So am I understanding the whole process wrong?
Edit:
I will provide an example of my understanding of the backpropagation in CNNs and the problem with it.
Consider a randomised input matrix for a convolutional layer:
1, 0, 1
0, 0, 1
1, 0, 0
And a randomised weight matrix:
1, 0
0, 1
The output would be (applied ReLU activator):
1, 1
0, 0
The target for this layer is a 2x2 matrix filled with zeros. This way, we know the weight matrix should be filled with zeros also.
Error:
-1, -1
0, 0
By applying the process as stated above, the gradients are:
-1, -1
1, 0
So the new weight matrix is:
2, 1
-1, 1
This is not getting anywhere. If I repeat the process, the filter weights just go to extremely high values. So I must have made a mistake somewhere. So what is it that I'm doing wrong?
I'll give you a full example, not going to be short but hopefully you will get it. I'm omitting both bias and activation functions for simplicity, but once you get it it's simple enough to add those too. Remember, backpropagation is essentially the SAME in CNN as in a simple MLP, but instead of having multiplications you'll have convolutions. So, here's my sample:
Input:
.7 -.3 -.7 .5
.9 -.5 -.2 .9
-.1 .8 -.3 -.5
0 .2 -.1 .6
Kernel:
.1 -.3
-.5 .7
Doing the convolution yields (Result of 1st convolutional layer, and input for the 2nd convolutional layer):
.32 .27 -.59
.99 -.52 -.55
-.45 .64 .13
L2 Kernel:
-.5 .1
.3 .9
L2 activation:
.73 .29
.37 -.63
Here you would have a flatten layer and a standard MLP or SVM to do the actual classification. During backpropagation you'll recieve a delta which for fun let's assume is the following:
-.07 .15
-.09 .02
This will always be the same size as your activation before the flatten layer. Now, to calculate the kernel's delta for the current L2, you'll convolve L1's activation with the above delta. I'm not writting this down again but the result will be:
.17 .02
-.05 .13
Updating the kernel is done as L2.Kernel -= LR * ROT180(dL2.K), meaning you first rotate the above 2x2 matrix and then update the kernel. This for our toy example turns out to be:
-.51 .11
.3 .9
Now, to calculate the delta for the first convolutional layer, recall that in MLP you had the following: current_delta * current_weight_matrix. Well in Conv layer, you pretty much have the same. You have to convolve the original Kernel (before update) of L2 layer with your delta for the current layer. But this convolution will be a full convolution. The result turns out to be:
.04 -.08 .02
.02 -.13 .14
-.03 -.08 .01
With this you'll go for the 1st convolutional layer, and will convolve the original input with this 3x3 delta:
.16 .03
-.09 .16
And update your L1 kernel the same way as above:
.08 -.29
-.5 .68
Then you can start over from feeding forward. The above calculations were rounded to 2 decimal places and a learning rate of .1 was used for calculating the new kernel values.
TLDR:
You get a delta
You calculate the next delta that will be used for the next layer as: FullConvolution(Li.Input, delta)
Calculate the kernel delta that is used to update the kernel: Convolution(Li.W, delta)
Go to next layer and repeat.
I specify a batch size of 500 by doing this in my code:
model.fit(x_train, y_train, validation_data=(x_test, y_test), nb_epoch=100, batch_size=500, verbose=1)
When I run the code, the first batch size is 500, and the batch sizes after that are like 5000 and larger, why does this happen?
The reason I think the batch size are larger is because it seems like the model goes from row 500 to row 6000, which is 5500 rows.
Epoch 100/100
500/31016 [..............................] - ETA: 0s - loss: 0.1659 - acc: 0.7900
6000/31016 [====>.........................] - ETA: 0s - loss: 0.1679 - acc: 0.7865
11500/31016 [==========>...................] - ETA: 0s - loss: 0.1688 - acc: 0.7850
17000/31016 [===============>..............] - ETA: 0s - loss: 0.1692 - acc: 0.7842
23000/31016 [=====================>........] - ETA: 0s - loss: 0.1694 - acc: 0.7839
29000/31016 [===========================>..] - ETA: 0s - loss: 0.1693 - acc: 0.7841
31016/31016 [==============================] - 0s - loss: 0.1693 - acc: 0.7841 - val_loss: nan - val_acc: 0.6799
This is really interesting issue. The part of code which is responsible for displaying a progress bar is a util called progbar and it's defined here. It accepts as a parameter a minimum visual progress update interval which is by default set to 0.01 seconds. This default value is also used during printing a progress bar during fit computations and this is probably the reason behind this weird behaviour.