Post-process multi-class predictions for image segmentation? - image-processing

My FCN is trained to detect 10 different classes and produces an output of 500x500x10 with each of the final dimensions being the prediction probabilities for a different class.
Usually, I've seen using a uniform threshold, for instance 0.5, to binarize the probability matrices. However, in my case, this doesn't quite cut it because the IoU for some of the classes increases when the threshold is 0.3 and for other classes it is 0.8.
Hence, I don't have to arbitrarily pick the threshold for each class but rather use a more probabilistic approach to finalizing the threshold values. I thought of using CRFs but this also requires the thresholding to have already been done. Any ideas on how to proceed?
Example: consider an image of a forest with 5 different birds. Now im trying to output an image that has segmented the forest and the five birds, 6 classes, each with a separate label. The network outputs 6 confusion matrices indicating the confidence that a pixel falls into a particular class. Now, the correct answer for a pixel isnt always the class with the highest confidence value. Therefore, a one size fits all method or a max value method won't work.

CRF Postprocessing Approch
You don't need to set thresholds to use a CRF. I'm not familiar with any python libraries for CRFs, but in principle, what you need to define is:
A probability distribution of the 10 classes for each of the nodes
(pixels), which is simply the output of your network.
Pairwise potentials: 10*10 matrix, where element Aij denotes the "strength" of the configuration that one pixel is of class i and the other of class j. If you set the potentials to have a value alpha (alpha >> 1) in the diagonal and 1 elsewhere, then alpha is the regularization force that gives you consistency of the predictions (if pixel X is of class Y, then the neighboring pixels of X are more likely to be of the same class).
This is just one example of how you can define your CRF.
End to End NN Approach
Add a loss to your network that will penalize pixels that have neighbors of a different class. Please note that you will still end up with a tune-able parameter for the weight of the new regularization loss.

Related

What is loss_cls and loss_bbox and why are they always zero in training

I'm trying to train a custom dataset on using faster_rcnn using the Pytorch implementation of Detectron here. I have made changes to the dataset and configuration according to the guidelines in the repo.
The training process is carried out successfully, but the loss_cls and loss_bbox values are 0 from the beginning and even though the training is completed, final output cannot be used to make an evaluation or an inference.
I would like to know what these two mean and how to get those values to change during the training. The exact model I'm using here is e2e_faster_rcnn_R-50-FPN_1x
Any help regarding this would be appreciated. I' using Ubuntu 16.04 with Python 3.6 on Anaconda, CUDA 9, cuDNN 7.
What are the two losses?
When training a multi-object detector, you usually have (at least) two types of losses:
loss_bbox: a loss that measures how "tight" the predicted bounding boxes are to the ground truth object (usually a regression loss, L1, smoothL1 etc.).
loss_cls: a loss that measures the correctness of the classification of each predicted bounding box: each box may contain an object class, or a "background". This loss is usually called cross entropy loss.
Why are the losses always zero?
When training a detector, the model predict quite a few (~1K) possible boxes per image. Most of them are empty (i.e. belongs to "background" class). The loss function associate each of the predicted boxes with the ground truth boxes annotation of the image.
If a predicted box has a significant overlap with a ground truth box then loss_bbox and loss_cls are computed to see how well the model is able to predict the ground truth box.
On the other hand, if a predicted box has no overlap with any ground truth box, than only loss_cls is computed for the "background" class.
However, if there is only very partial overlap with ground truth the predicted box is "discarded" and no loss is computed. I suspect, for some reason, this is the case for your training session.
I suggest you check the parameters that determines the association between predicted boxed and ground truth annotations. Moreover, look at the parameters of your "anchors": these parameters determines the scale and aspect ratios of the predicted boxes.

Bayes Classification with Multivariate Parzen Window using Spherical Kernel

I'm having a problem implementing a Bayes Classifier with the Parzen window algorithm using a spherical (or isotropic) kernel.
I am running the algorithm with test data containing 2 dimensions and 3 different classes (For each class, I have 10 test points, and 40 training points, all in 2 dimensions). When I change the value of my hyper-parameter (sigma_sq for the spherical Gaussian kernel), I find that there is no effect on how the points are classified.
This is my density estimator. My self.sigma_sq is the same across all the dimensions of my data (2 dimensions)
for i in range(test_data.shape[0]):
log_prob_intermediate = 0
for j in range(n): #n is size of training set
c = -self.n_dims * np.log(2*np.pi)/2.0 - self.n_dims*np.log(self.sigma_sq)/2.0
log_prob_intermediate += (c - np.sum((test_data[i,:] - self.train_data[j,:])**2.0) / (2.0 * self.sigma_sq))
log_prob.append(log_prob_intermediate / n)
How I implemented my Bayes Classifier:
There are 3 classes that my Bayes Classifier must distinguish. I created 3 training sets and 3 test sets (one training and test set per class). For each point in my test set, I run the density estimator for each class on the point. This gives me a vector of 3 values: the log probability that my new point is in class1, class2, or class3. I then choose the maximum value and assign the new point to that class.
Since I am using a spherical Gaussian kernel, I am of the understanding that my sigma_sq must be common for each density estimator (one density estimator for each class). Is this correct? If I had a different sigma_sq for each dimension pair, wouldn't this give me somewhat of a diagonal Gaussian kernel?
For my list of 30 test points (10 for each class), I find that running the bayes classifier on these points continues to give me the exact same classification for each point, regardless of what sigma I use. Is this normal? Since it's a spherical Gaussian kernel, and all my dimensions use the same kernel, is increasing or decreasing my sigma_sq just having a proportional effect on my log probability with no change in the classification? OR do I have some sort of problem with my density estimator that I can't figure out.
Lets address each thing separately
Using the same sigma for each dimension makes your kernel radial, this is true; however, you can (and should!) use different sigma for each class, as each distribution usually requires different density estimator, for simple heuristics read for example about Scott's rule of thumb for the kernel width selection in gaussian case or later work by Silverman.
It is hard to tell whether in your particular case choice of sigma should change the classification - in general it should be true; but each dataset has its own properties. However, your data is just 2D, which makes it perfect for visualization. Draw your data, then, draw each KDE and simply visually investigate what is going on.

Linear Regression :: Normalization (Vs) Standardization

I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables.
Normalization = x -xmin/ xmax – xmin
 
Zero Score Standardization = x - xmean/ xstd
 
a) Also, when to Normalize (Vs) Standardize ?
b) How Normalization affects Linear Regression?
c) Is it okay if I don't normalize all the attributes/lables in the linear regression?
Thanks,
Santosh
Note that the results might not necessarily be so different. You might simply need different hyperparameters for the two options to give similar results.
The ideal thing is to test what works best for your problem. If you can't afford this for some reason, most algorithms will probably benefit from standardization more so than from normalization.
See here for some examples of when one should be preferred over the other:
For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).
However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.
One disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers.
Also on the linked page, there is this picture:
As you can see, scaling clusters all the data very close together, which may not be what you want. It might cause algorithms such as gradient descent to take longer to converge to the same solution they would on a standardized data set, or it might even make it impossible.
"Normalizing variables" doesn't really make sense. The correct terminology is "normalizing / scaling the features". If you're going to normalize or scale one feature, you should do the same for the rest.
That makes sense because normalization and standardization do different things.
Normalization transforms your data into a range between 0 and 1
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other. We want that so we can be sure we are capturing the true information in a feature, and that we dont over weigh a particular feature just because its values are much larger than other features.
If all of your features are within a similar range of each other then theres no real need to standardize/normalize. If, however, some features naturally take on values that are much larger/smaller than others then normalization/standardization is called for
If you're going to be normalizing at least one variable/feature, I would do the same thing to all of the others as well
First question is why we need Normalisation/Standardisation?
=> We take a example of dataset where we have salary variable and age variable.
Age can take range from 0 to 90 where salary can be from 25thousand to 2.5lakh.
We compare difference for 2 person then age difference will be in range of below 100 where salary difference will in range of thousands.
So if we don't want one variable to dominate other then we use either Normalisation or Standardization. Now both age and salary will be in same scale
but when we use standardiztion or normalisation, we lose original values and it is transformed to some values. So loss of interpretation but extremely important when we want to draw inference from our data.
Normalization rescales the values into a range of [0,1]. also called min-max scaled.
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1.So it gives a normal graph.
Example below:
Another example:
In above image, you can see that our actual data(in green) is spread b/w 1 to 6, standardised data(in red) is spread around -1 to 3 whereas normalised data(in blue) is spread around 0 to 1.
Normally many algorithm required you to first standardise/normalise data before passing as parameter. Like in PCA, where we do dimension reduction by plotting our 3D data into 1D(say).Here we required standardisation.
But in Image processing, it is required to normalise pixels before processing.
But during normalisation, we lose outliers(extreme datapoints-either too low or too high) which is slight disadvantage.
So it depends on our preference what we chose but standardisation is most recommended as it gives a normal curve.
None of the mentioned transformations shall matter for linear regression as these are all affine transformations.
Found coefficients would change but explained variance will ultimately remain the same. So, from linear regression perspective, Outliers remain as outliers (leverage points).
And these transformations also will not change the distribution. Shape of the distribution remains the same.
lot of people use Normalisation and Standardisation interchangeably. The purpose remains the same is to bring features into the same scale. The approach is to subtract each value from min value or mean and divide by max value minus min value or SD respectively. The difference you can observe that when using min value u will get all value + ve and mean value u will get bot + ve and -ve values. This is also one of the factors to decide which approach to use.

Object classification with Kinect using cascaded classifiers

My project is to create a software that recognizes certain objects like an apple or a coin etc. I want to use Kinect. My question is: Do I need to have a machine learning algorithm like haar classifier to recognize a object or kinect itself can do that?
Kinect itself cannot recognize objects. It will give you a dense depth map. Then you can use the depth features along with some simple features (in your case, maybe color features or gradient features would do the job). Those features you input to a classifier (SVM or Random Forest for example) to train the system. You use the trained model for testing on new samples.
Regarding Haar features, I think they could do the job but you would need a sufficiently large database of features. It all depends on what you want to detect. In the case of an apple and a coin, just color would suffice.
Refer this paper to get an idea how to perform human pose recognition using Kinect camera. You just have to pay attention to their depth features and their classifiers. Do not apply their approach directly. Your problem is simpler.
Edit: simple gradient orientations histogram
Gradient orientations can give you a coarse idea about the shape of the object (It is not a shape-feature to be specific, better shape features exist, but this one is extremely fast to calculate).
Code snippet:
%calculate gradient
[dx,dy] = gradient(double(img));
A = (atan(dy./(dx+eps))*180)/pi; %eps added to avoid division by zero.
A will contain orientation for each pixel. Segment your original image according to the depth values. For a segment having similar depth values, calculate color histogram. Extract the pixel orientations corresponding to that region, call it A_r. calculate a 9-bin (you can have more bins. Nine bins mean each bin will contain 180/9=20 degrees) histogram. Concatenate the color features and the gradient histogram. Do this for sufficient number of leaves. Then you can give this to a classifier for training.
Edit: This is a reply to a comment below.
Regarding MaxDepth parameter in opencv_traincascade
The documentation says, "Maximal depth of a weak tree. A decent choice is 1, that is case of stumps". When you perform binary classification, it takes a form of:
if yourFeatureValue>=learntThresh
class=1;
else
class=0;
end
The above type of classifier which performs thresholding on a single feature value (a scalar) is called decision stumps. There is only one split between positive and negative class (therefore maxDepth is one). For example, it would work in following scenario. Imagine you have a 1-D feature:
f=[1 2 3 4 -1 -2 -3 -4]
First 4 are class 1, rest are class 0. Decision stumps would get 100% accuracy on this data by setting the threshold to zero. Now, imagine a complicated feature space such as:
f=[1 2 3 4 5 6 7 8 9 10 11 12];
First 4 and last 4 are class 1, rest are class 0. Here, you cannot get 100% classification by decision stumps. You need two thresholds/splits. Therefore, you can construct a tree with depth value 2. You will have 2^(2-1)=2 thresholds. For depth=3, you get 4 thresholds, for depth=4, you get 8 thresholds and so on. Here, I assume a tree with a single node has height 1.
You may feel that the more the number of levels, you can achieve more accuracy, but then there is a problem of overfitting (and computation, memory storage etc.). Therefore, you have to set a good value for depth. I usually set it to 3.

How to visualizate svm weights in hog

In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).

Resources