How does sklearn actually calculate AUROC? - machine-learning

I understand that the ROC curve for a model is constructed by varying the threshold (that affects TPR, FPR).
Thus my initial understanding is that, to calculate the AUROC, you need to run the model many times with different threshold to get that curve and finally calculate the area.
But it seems like you just need some probability estimate of the positive class, as in the code example in sklearn's roc_auc_score below, to calculate AUROC.
>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
How does that work? Any recommended read?

Related

How to prepare confusion matrix from the predicted class probabilities?

There is a Naive Bayesian classifier which is created with a given training data. In the table, the predicted positive class probabilities and the actual class labels are shown. I want to prepare the confusion matrix but I could not find out how to do it with just knowing the probabilities.
ID
Actual class label
Predicted positive class probability
1
+
0.6
2
+
0.8
3
-
0.2
4
+
0.3
5
-
0.4
First, you need to have discrete class labels to compute confusion matrix. Define a threshold on the predicted positive class probability to predict class labels (y_pred).
You can then use actual class labels (y_actual) and y_pred to compute the confusion matrix.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_actual, y_pred)

Critical parameter behind skimage's watershed "over-segmentation"

I have the following mask of cell nuclei, and my goal is to segment them. However, using what seems to be a very standard approach,
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from skimage.segmentation import watershed
from skimage import measure
# load mask
mask = mpimg.imread('mask.png')
# find distance to nearest border
distance = scipy.ndimage.distance_transform_edt(mask)
# find local maxima based on distance to border
local_maxi = peak_local_max(distance, indices=False, footprint=np.ones((125, 125)), labels=mask)
# generate markers for regions
markers = measure.label(local_maxi)
# watershed segmentation
labeled = watershed(-distance, markers, mask=mask, watershed_line = True)
# plot figure
fig, axs = plt.subplots()
axs.imshow(labeled, cmap='flag')
some large, connected components are unsegmented while smaller unconnected components become oversegmented:
Throughly browosing answers on StackOverflow, I haven't been able to find is a discussion of which parameters drive 'under-segmentation' vs 'over-segmentation' in the skimage.segmentation.watershed algorithm.
Which parameter most strongly influences "oversegmentation" in the watershed algorithm? My intuition tells me it could be the footprint size? or the distance transform? What is the most critical parameter that determines the segmentation neighbourhood?
EDIT1: Below I have included the distance transform, the filtering of which others have pointed out is a critically important step. However, I am still unable to diagnose symptoms of a "bad" distance transform, and unaware of rules of thumbs for filtering said transform.
In your particular case, the origin of some of your over-segmentation is on the result of peak_local_max().
If you run the following code you will be able to find which local maximums are selected for your image. I'm using OpenCV for plotting dots, you might want to adapt it for another library.
import cv2
import numpy as np
import matplotlib.pyplot as plt
localMax_idx = np.where(local_maxi)
localMax_img = mask.copy()
localMax_img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
for i in range(localMax_idx[0].shape[0]):
x = localMax_idx[1][i]
y = localMax_idx[0][i]
localMax_img = cv2.circle(localMax_img, (x,y), radius=5, color=(255, 0, 0), thickness=-1)
plt.imshow(localMax_img)
plt.show()
You will see that there are multiple markers for over-segmented cells. There are some suggested approaches to deal with this issue (for example, this one).

How to apply a partial derivative Gaussian kernel to an image with OpenCV?

I'm trying reproduce results from a paper, in which they convolve the image with an horizontal partial derivative of a Gaussian kernel. I haven't found any way to achieve that with OpenCV. Is that possible ?
Do I have to get Gaussian filter and then compute the partial derivatives by hand ?
OpenCV doesn't have built-in function to calculate Gaussian partial derivatives. But you may use cv::getGaussianKernel and cv::filter2D to do so.
For example:
cv::Mat kernel = cv::getGaussianKernel(3, 0.85, CV_32F);
kernel = kernel.reshape(1, 1);
cv::filter2D(img, img, CV_8U, kernel);
Please note that cv::getGaussianKernel returns column filter, so you need reshape to make it horizontal.
As #akarsakov said OpenCV does not provide a built-in function for this. However, we can still use OpenCV's getGaussianKernel() and then apply a factor to get the derivative.
Since a Gaussian 2D kernel is separable, that function will simply return you a 1D kernel and assume that you will apply a 1D filter along the x-axis and then a 1D filter along the y-axis, which is faster than applying the 2D kernel directly.
Since you want the horizontal Gaussian derivative, ref:
You can simply multiply by -x/sigma^2 each point of the kernel.
import cv2
import numpy as np
import matplotlib.pyplot as plt
kernel_size = 13
sigma = 0.5
kernel = cv2.getGaussianKernel(kernel_size, sigma, cv2.CV_64F)
print(kernel)
gaussian_first_deriv_x = np.zeros_like(kernel)
assert(kernel_size % 2 == 1) # I assume you are using an odd kernel_size
half_kernel_size = int(kernel_size / 2)
for i in range(kernel_size):
x = - half_kernel_size + i
factor = - x/ (sigma**2)
gaussian_first_deriv_x[i] = kernel[i] * factor
print(gaussian_first_deriv_x)
plt.plot(kernel[:], 'b')
plt.plot(gaussian_first_deriv_x[:], 'r')
plt.show()
Now, you can do your convolution using this kernel.
Note: if you are using Python you can alternatively use the function ndimage.gaussian_filter1d(data, sigma=1, order=1, mode='wrap') from from scipy import ndimage.
OpenCV by default uses the following: sigma = 0.3 * (kernel_size / 2.0 - 1) + 0.8.

XY coordinates in a image stored as numpy?

i have a 96x96 pixel numpy array, which is a grayscale image. How do i find and plot the x,y cordinate of the maximum pixel intensity in this image?
image = (96,96)
Looks simple but i could find any snippet of code.
Please may you help :)
Use the argmax function, in combination with unravel_index to get the row and column indices:
>>> import numpy as np
>>> a = np.random.rand(96,96)
>>> rowind, colind = np.unravel_index(a.argmax(), a.shape)
As far as plotting goes, if you just want to pinpoint the maximum value using a Boolean mask, this is the way to go:
>>> import matplotlib.pyplot as plt
>>> plt.imshow(a==a.max())
<matplotlib.image.AxesImage object at 0x3b1eed0>
>>> plt.show()
In that case, you don't need the indices even.

Noise Estimation / Noise Measurement in Image

I want to estimate the noise in an image.
Let's assume the model of an Image + White Noise.
Now I want to estimate the Noise Variance.
My method is to calculate the Local Variance (3*3 up to 21*21 Blocks) of the image and then find areas where the Local Variance is fairly constant (By calculating the Local Variance of the Local Variance Matrix).
I assume those areas are "Flat" hence the Variance is almost "Pure" noise.
Yet I don't get constant results.
Is there a better way?
Thanks.
P.S.
I can't assume anything about the Image but the independent noise (Which isn't true for real image yet let's assume it).
You can use the following method to estimate the noise variance (this implementation works for grayscale images only):
def estimate_noise(I):
H, W = I.shape
M = [[1, -2, 1],
[-2, 4, -2],
[1, -2, 1]]
sigma = np.sum(np.sum(np.absolute(convolve2d(I, M))))
sigma = sigma * math.sqrt(0.5 * math.pi) / (6 * (W-2) * (H-2))
return sigma
Reference: J. Immerkær, “Fast Noise Variance Estimation”, Computer Vision and Image Understanding, Vol. 64, No. 2, pp. 300-302, Sep. 1996 [PDF]
The problem of characterizing signal from noise is not easy. From your question, a first try would be to characterize second order statistics: natural images are known to have pixel to pixel correlations that are -by definition- not present in white noise.
In Fourier space the correlation corresponds to the energy spectrum. It is known that for natural images, it decreases as 1/f^2 . To quantify noise, I would therefore recommend to compute the correlation coefficient of the spectrum of your image with both hypothesis (flat and 1/f^2), so that you extract the coefficient.
Some functions to start you up:
import numpy
def get_grids(N_X, N_Y):
from numpy import mgrid
return mgrid[-1:1:1j*N_X, -1:1:1j*N_Y]
def frequency_radius(fx, fy):
R2 = fx**2 + fy**2
(N_X, N_Y) = fx.shape
R2[N_X/2, N_Y/2]= numpy.inf
return numpy.sqrt(R2)
def enveloppe_color(fx, fy, alpha=1.0):
# 0.0, 0.5, 1.0, 2.0 are resp. white, pink, red, brown noise
# (see http://en.wikipedia.org/wiki/1/f_noise )
# enveloppe
return 1. / frequency_radius(fx, fy)**alpha #
import scipy
image = scipy.lena()
N_X, N_Y = image.shape
fx, fy = get_grids(N_X, N_Y)
pink_spectrum = enveloppe_color(fx, fy)
from scipy.fftpack import fft2
power_spectrum = numpy.abs(fft2(image))**2
I recommend this wonderful paper for more details.
Scikit Image has an estimate sigma function that works pretty well:
http://scikit-image.org/docs/dev/api/skimage.restoration.html#skimage.restoration.estimate_sigma
it also works with color images, you just need to set multichannel=True and average_sigmas=True:
import cv2
from skimage.restoration import estimate_sigma
def estimate_noise(image_path):
img = cv2.imread(image_path)
return estimate_sigma(img, multichannel=True, average_sigmas=True)
High numbers mean low noise.

Resources