Selecting an IoU and confidence threshold for evaluation of model performance - machine-learning

mAP is commonly used to evaluate the performance of object detection models. However, there are two variables that need to be set when calculating mAP:
confidence threshold
IoU threshold
Just to clarify, confidence threshold is the minimum score that the model will consider the prediction to be a true prediction (otherwise it will ignore this prediction entirely). IoU threshold is the minimum overlap between ground truth and prediction boxes for the prediction to be considered a true positive.
Setting both of these thresholds to be low would result in a greater mAP. However, the low thresholds would most likely be inconsistent with the mAP scores from other studies. How does one select, and justify, these threshold values?

In Yolov5, we do NMS on the outputs of network, then calculate mAP. So, there is a conf_thres and an iou_thres in NMS to filter some boxes, these are set to 0.001 and 0.6, see: https://github.com/ultralytics/yolov5/blob/2373d5470e386a0c63c6ab77fbee6d699665e27b/val.py#L103.
When calculating mAP, we set iou threshold to 0.5 for mAP#0.5, or 0.5 to 0.95 with step 0.05 for mAP#0.5:0.95.
I guess the way of calculating mAP in Yolov5 is aligned to other framework. If I'm wrong, please correct me.

Related

Why don't the false positive rate and true positive rate add up to one in a roc-auc curve?

The false positive rate is the x-axis spanning from 0 to 1. The true positive rate is the y-axis spanning from 0 to 1. And the graphs show data points like (.8,.8). Which if the tpr is .8 and the fpr is .8, they add up to 1.6...
Typically the axis are normalised using the total number of FPs or TPs in the test/validation set. Otherwise the end of the curve wouldn't be 1/1. I personally prefer to label the axis by the number of instances.
Why to not normalise by the total number - in real applications, it gets rather complicated as you often do not have labels for all examples. The typical example for ROC curves are mass mailings. To normalise the curve correctly you would need to spam the entire world.

How to use ROC curve

For Logistic regression we usually follow below approach ---
[1] Randomly initiallize parameters (theta), and choose cutoff/deciding point (we consider points above this cutoff point as one class and below ones as another class)
[2] Predict output values (h) with theta and chosen input features
[3] Calculate cost using predicted (h) and actual result
[4] Calculate gradient, so that we can minimize theta using it
[5] Recalculate theta using obtained gradient
[6] repeat steps 1-5 for few iterations and then plot the cost values (obtained in 3rd step of each iteration) against no of iteration
[7] If the cost values are getting decreased with increase in no of iterations, then our classifier is good otherwise we have to randomly choose another value of theta and start against
We use ROC curve to analyse the trade off between cutoff point and true positive as well as true negative rate. My question is when can we use ROC curve? Is it after finding the minimized theta using gradient descent? Please help!!
The ROC curve can be measured for any tunable predictor, no matter how bad. You could measure it straight away after step one. That would get you a very bad curve, of course: simultaneously many FP and FN.
The whole point of all those iterations is to push the ROC curve lower.

Perceptron training rule, why multiply by x

I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is
where
: training rate
: expected output
: actual output
: ith input
This implies that if is very large then so is , but I don't understand the purpose of a large update when is large
on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in will result in a big change in the final output (due to )
The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.
Consider a 1xd weight vector indicating the weights of the perceptron model. Also, consider a 1xd datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be
-- Eq. 1
Here '.' is a dot product, or
The hyperplane above equation is
(Ignoring the iteration indices for the weight updates for simplicity)
Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.
The vector which is normal to this hyperplane is . The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.
There are three possibilities of (ignoring the training rate)
: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
implying that the target was 1, but the present set of weights classified it as 0. The Eq1. which was supposed to be . Eq1. in this case is , which indicates that the angle between and is greater that 90 degrees, which should have been lesser. The update rule is . If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between and is closer than before and less than 90 degrees.
implying that the target was 0, but the present set of weights classified it as 1. The eq1. which was supposed to be . Eq1. in this case is indicates that the angle between and is lesser that 90 degrees, which should have been greater. The update rule is . Similarly this will rotate the hyperplane so that the angle between and is greater than 90 degrees.
This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.
If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.
Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4

Loss function representing the euclidean distance from prediction to nearest groundtruth in images?

Is there a loss function that calculates the euclidean distance between a prediction pixel and the nearest groundtruth pixel? Specifically, this is the location distance, not the intensity distance.
This would be on binary predictions and binary groundtruth.
That's the root of mean square error (RMSE), for example:
model.compile(loss='rmse', optimizer='adagrad')
But it might be better to use mean squared error instead because of what is discussed here https://github.com/fchollet/keras/issues/1170:
i.e. Keras computes the loss batch by batch. To avoid inconsistencies
I recommend using MSE instead.
As in:
model.compile(loss='rmse', optimizer='adagrad')
But since your data has only binary predictions I would advise the binary_crossentropy instead (https://keras.io/losses/#binary_crossentropy):
model.compile(loss='binary_crossentropy', optimizer='adagrad')

Handling zero rows/columns in covariance matrix during em-algorithm

I tried to implement GMMs but I have a few problems during the em-algorithm.
Let's say I've got 3D Samples (stat1, stat2, stat3) which I use to train the GMMs.
One of my training sets for one of the GMMs has in nearly every sample a "0" for stat1. During training I get really small Numbers (like "1.4456539880060609E-124") in the first row and column of the covariance matrix which leads in the next iteration of the EM-Algorithm to 0.0 in the first row and column.
I get something like this:
0.0 0.0 0.0
0.0 5.0 6.0
0.0 2.0 1.0
I need the inverse covariance matrix to calculate the density but since one column is zero I can't do this.
I thought about falling back to the old covariance matrix (and mean) or to replace every 0 with a really small number.
Or is there a another simple solution to this problem?
Simply your data lies in degenerated subspace of your actual input space, and GMM is not well suited in most generic form for such setting. THe problem is that empirical covariance estimator that you use simply fail for such data (as you said - you cannot inverse it). What you usually do? You chenge covariance estimator to the constrained/regularized ones, which contain:
Constant-based shrinking, thus instead of using Sigma = Cov(X) you do Sigma = Cov(X) + eps * I, where eps is prefedefined small constant, and I is identity matrix. Consequently you never have a zero values on the diagonal, and it is easy to prove that for reasonable epsilon, this will be inversible
Nicely fitted shrinking, like Oracle Covariance Estimator or Ledoit-Wolf Covariance Estimator which find best epsilon based on the data itself.
Constrain your gaussians to for example spherical family, thus N(m, sigma I), where sigma = avg_i( cov( X[:, i] ) is the mean covariance per dimension. This limits you to spherical gaussians, and also solves the above issue
There are many more solutions possible, but all based on the same thing - chenge covariance estimator in such a way, that you have a guarantee of invertability.

Resources