I'm running a FCN in Keras that uses the binary cross-entropy as the loss function. However, im not sure how the losses are accumulated.
I know that the loss gets applied at the pixel level, but then are the losses for each pixel in the image summed up to form a single loss per image? Or instead of being summed up, is it being averaged?
And furthermore, are the loss of each image simply summed(or is it some other operation) over the batch?
I assume that you question is a general one, and to specific to a particular model (if not can you share your model?).
You are right that if the cross-entropy is used at a pixel level, the results have to be reduced (summed or averaged) over all pixels to get a single value.
Here is an example of a convolutional autoencoder in tensorflow where this step is specific:
https://github.com/udacity/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder_Solution.ipynb
The relevant lines are:
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
cost = tf.reduce_mean(loss)
Whether you take the mean or sum of the cost function does not change the value of the minimizer. But If you take the mean, then the value of the cost function is more easily comparable between experiments when you change the batch size or image size.
Related
Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.
I am relatively new the subject and have been doing loads of reading. What I am particularly confused about is how a CNN learns its filters for a particular labeled feature in a training data set.
Is the cost calculated by which outputs should or shouldn't be active on a pixel by pixel basis? And if that is the case, how does mapping the activations to the labeled data work after having down sampled?
I apologize for any poor assumptions or general misunderstandings. Again, I am new to this field and would appreciate all feedback.
I'll break this up into a few small pieces.
Cost calculation -- cost / error / loss depends only on comparing the final prediction (the last layer's output) to the label (ground truth). This serves as a metric of how right or wrong the prediction is.
Inter-layer structure -- Each input to the prediction is an output of the prior layer. This output has a value; the link between the two has a weight.
Back-prop -- Each weight gets adjusted in proportion to the error comparison and its weight. A connection that contributed to a correct prediction gets rewarded: its weight is increased in magnitude. Conversely, a connection that pushed for a wrong prediction gets reduced.
Pixel-level control -- To clarify the terminology ... traditionally, each kernel is a square matrix of float values, each of which is called a "pixel". The pixels are trained individually. However, that training comes from sliding a smaller filter (also square) across the kernel, performing a dot-product of the window with the corresponding square sub-matrix of the kernel. The output of that dot-product is the value of a single pixel in the next layer.
When the strength of pixel in layer N is increased, this effectively increases the influence of the filter in layer N-1 providing that input. That filter's pixels are, in turn, tuned by the inputs from layer N-2.
The FaceNet algorithm (described in this article) uses a convolutional neural network to represent an image in an 128 dimensional Euclidean space.
While reading the article I didn't understand:
How does the loss function impact on the convolutional network (in normal networks, in order to minimize the loss the weights are slightly changed -
backpropagation - so, what happens in this case?)
how are the triplets chosen?
2.1 . how do I know a negative image is hard
2.2 . why am I using the loss function to determine the negative image
2.3 . when do I check my images for hardness with respect to the anchor - I believe that is before I send a triplet to be processed by the network, right.
Here are some of the answer that may clarify your doubts:
Even here the weights are adjusted to minimise the Loss, its just the loss term is little complicated. The loss has two parts(separated by + in the equation), first part is the image of a person compared to a different image of the same person. The second part is the image of the person compared to a image of a different person. We want the first part loss to be less than the second part loss and the loss equation in essence captures that. So here you basically want to adjust the weights such that same person error is less and different person error is more.
The Loss term involves three images: The image in question(anchor): x_a, its positive pair: x_p and its negative pair: x_n. An hardest positive of x_a is the positive image that has the biggest error compared to the rest of the positive images. The hardest negative of x_a is the closest image of a different person. So you want to bring the furthest positives to be close to each other and push the closest negatives further away. This is captured in the loss equation.
Facenet calculates its anchor during training (online). In each minibatch(which is a set of 40 images) they select the hardest negative to the anchor and instead of choosing the hardest positive image, they choose all anchor-positive pairs within the batch.
If you are looking to implement face recognition, you should better consider this paper, that implements centre loss, which is much easier to train and shown to perform better.
I have built a FCN for image segmentation. The object to be segmented is only very few pixels relatively to the image size (1024x1024). This results in that the accuracy is very high, even if I only train with 10 images instead of 18000 (my full training set).
My approach to solve this is to use some kind of weighted accuracy, so that the accuracy actually say something about the performance of identifying the small object (now it gets high accuracy since so many pixels are not the object and by not classifying anything the accuracy still gets high).
How do I decide the weight, anybody with some experience?
As you wrote, use a custom weight function which penalizes misclassification of underrepresented pixels more. You can get the weight by calculating the quotient between the number of object pixels versus all of the pixels in the image, or you can try it by hand - just make sure you follow the metrics which tell you the accuracy of object pixels. Hope it helps.
You can use infogain loss layer for a "weighted" loss.
The infogain loss is a generalization of the cross entropy loss commonly used. It is defined using a weight matrix H (of size L-by-L, where L is the number of classes):
L(p) = -H log(p)
Where p is a vector of class probabilities.
You can find more details on this loss here.
I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.