I have a network and I want to only update the first few layers' weights. The thing is, I want to change the amount of layers I update dynamically through the training process.
Is there a way to do this without having to change each weight to trainable = False? (or requires_grad = False)
Related
I am trying to conduct a simple feature scaling in PyTorch. For example, I have an image, and I want to scale certain pixel values down by 10. Now I have 2 options:
Directly divide those features by 10.0 in __getitem__ function in dataloader;
Pass the original features into the model forward function, but before pass them through trainable layers, scale down the corresponding features.
I have conducted several experiments, but observed after the first epoch, the validation losses between the two would start to diverge slightly. While after a couple hundreds of epochs, the two trained models would vary largely. Any suggestion on this?
I'm using batch normalization with a batch size of size 10 for face detection, I wanted to know if it is better to remove the batch norm layers or keep them.
And if it is better to remove them what can I use instead?
This question depends on a few things, first being the depth of your neural network. Batch normalization is useful for increasing the training of your data when there are a lot of hidden layers. It can decrease the number of epochs it takes to train your model and hep regulate your data. By standardizing the inputs to your network, you reduce the risk of chasing a 'moving target', meaning your learning algorithm is not performing as optimally as it could be.
My advice would be to include batch normalization layers in your code if you have a deep neural network. Reminder, you should probably include some Dropout in your layers as well.
Let me know if this helps!
Yes, it works for the smaller size, it will work even with the smallest possible size you set.
The trick is the bach size also adds to the regularization effect, not only the batch norm.
I will show you few pics:
We are on the same scale tracking the bach loss. The left-hand side is a module without the batch norm layer (black), the right-hand side is with the batch norm layer.
Note how the regularization effect is evident even for the bs=10.
When we set the bs=64 the batch loss regularization is super evident. Note the y scale is always [0, 4].
My examination was purely on nn.BatchNorm1d(10, affine=False) without learnable parameters gamma and beta i.e. w and b.
This is why when you have low batch size, it has sense to use the BatchNorm layer.
I'm running a FCN in Keras that uses the binary cross-entropy as the loss function. However, im not sure how the losses are accumulated.
I know that the loss gets applied at the pixel level, but then are the losses for each pixel in the image summed up to form a single loss per image? Or instead of being summed up, is it being averaged?
And furthermore, are the loss of each image simply summed(or is it some other operation) over the batch?
I assume that you question is a general one, and to specific to a particular model (if not can you share your model?).
You are right that if the cross-entropy is used at a pixel level, the results have to be reduced (summed or averaged) over all pixels to get a single value.
Here is an example of a convolutional autoencoder in tensorflow where this step is specific:
https://github.com/udacity/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder_Solution.ipynb
The relevant lines are:
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
cost = tf.reduce_mean(loss)
Whether you take the mean or sum of the cost function does not change the value of the minimizer. But If you take the mean, then the value of the cost function is more easily comparable between experiments when you change the batch size or image size.
I am reading Fit generator and data augmentation in keras, but there are still something that I am not quite sure about image augmentation in keras.
(1) In datagen.flow(), we also set a batch_size. I know batch_size is needed if we do mini-batch training, so are these two batch_size values the same, i mean, if we indicate batch_size in flow() generator, are we assuming we will do mini-batch training with the same batch_size?
(2)
Let me assume the size of training set is 10,000. I guess the only difference between model.fit_generator() and model.fit() at each epoch is that, for the former one, we are using 10,000 of randomly transformed images, rather than the original 10,000 ones. But for other epochs, we are using another 10,000 images which are totally different than those used in the first epoch, because all the images are randomly generated. Is it right?
It is like we are always using new images at each epoch, which is different from the ordinary case, when the same set of images are used at each epoch.
I am new to this area. Please help!
the 1st question:the answer is YES.
the 2nd question:yes we are always using new images at each epoch,if we use data augmentation in model.fit_generator()
For Example for 3-1-1 layer if the weights are initialized equally the MLP might not learn well. But why does this happen?
If you only have one neuron in the hidden layer, it doesn't matter. But, imagine a network with two neurons in the hidden layer. If they have the same weights for their input, than both neurons would always have the exact same activation, there is no additional information by having a second neuron. And in the backpropagation step, those weights would change by an equal amount. Hence, in every iteration, those hidden neurons have the same activation.
It looks like you have a typo in your question title. I'm guessing that you mean why should the weights of hidden layer be random. For the example network you indicate (3-1-1), it won't matter because you only have a single unit in the hidden layer. However, if you had multiple units in the hidden layer of a fully connected network (e.g., 3-2-1) you should randomize the weights because otherwise, all of the weights to the hidden layer will be updated identically. That is not what you want because each hidden layer unit would be producing the same hyperplane, which is no different than just having a single unit in that layer.