Weighted accuracy for image segmentation using FCN - machine-learning

I have built a FCN for image segmentation. The object to be segmented is only very few pixels relatively to the image size (1024x1024). This results in that the accuracy is very high, even if I only train with 10 images instead of 18000 (my full training set).
My approach to solve this is to use some kind of weighted accuracy, so that the accuracy actually say something about the performance of identifying the small object (now it gets high accuracy since so many pixels are not the object and by not classifying anything the accuracy still gets high).
How do I decide the weight, anybody with some experience?

As you wrote, use a custom weight function which penalizes misclassification of underrepresented pixels more. You can get the weight by calculating the quotient between the number of object pixels versus all of the pixels in the image, or you can try it by hand - just make sure you follow the metrics which tell you the accuracy of object pixels. Hope it helps.

You can use infogain loss layer for a "weighted" loss.
The infogain loss is a generalization of the cross entropy loss commonly used. It is defined using a weight matrix H (of size L-by-L, where L is the number of classes):
L(p) = -H log(p)
Where p is a vector of class probabilities.
You can find more details on this loss here.

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

How is cross entropy calculated for pixel level prediction

I'm running a FCN in Keras that uses the binary cross-entropy as the loss function. However, im not sure how the losses are accumulated.
I know that the loss gets applied at the pixel level, but then are the losses for each pixel in the image summed up to form a single loss per image? Or instead of being summed up, is it being averaged?
And furthermore, are the loss of each image simply summed(or is it some other operation) over the batch?
I assume that you question is a general one, and to specific to a particular model (if not can you share your model?).
You are right that if the cross-entropy is used at a pixel level, the results have to be reduced (summed or averaged) over all pixels to get a single value.
Here is an example of a convolutional autoencoder in tensorflow where this step is specific:
https://github.com/udacity/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder_Solution.ipynb
The relevant lines are:
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
cost = tf.reduce_mean(loss)
Whether you take the mean or sum of the cost function does not change the value of the minimizer. But If you take the mean, then the value of the cost function is more easily comparable between experiments when you change the batch size or image size.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Word2Vec: Number of Dimensions

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.

How to visualizate svm weights in hog

In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).

Resources