Preprocessing CT scans for kidney and tumor segmentation using UNet

Preprocessing CT scans for kidney and tumor segmentation using UNet - image-processing

My ultimate goal is to train a network for kidney and tumor segmentation from a volumetric CT scan. I can’t figure out which approach is better for preprocessing.
In particular, after I made a clip of the Hounsfield units with the numpy.clip function, which of the two options I found should I choose?
min max scaling + rescaling between 0 and 255
subtract the mean value and divide by the standard deviation
And if I choose the second method, do I have to scale between 0 and 255?

Related

What does entropy mean in this context?

I'm reading an image segmentation paper in which the problem is approached using the paradigm "signal separation", the idea that a signal (in this case, an image) is composed of several signals (objects in the image) as well as noise, and the task is to separate out the signals (segment the image).
The output of the algorithm is a matrix, which represents a segmentation of an image into M components. T is the total number of pixels in the image, is the value of the source component (/signal/object) i at pixel j
In the paper I'm reading, the authors wish to select a component m for which matches certain smoothness and entropy criteria. But I'm failing to understand what entropy is in this case.
Entropy is defined as the following:
and they say that '' are probabilities associated with the bins of the histogram of ''
The target component is a tumor and the paper reads: "the tumor related component with "almost" constant values is expected to have the lowest value of entropy."
But what does low entropy mean in this context? What does each bin represent? What does a vector with low entropy look like?
link to paper

They are talking about Shannon's entropy. One way to view entropy is to relate it to the amount of uncertainty about an event associated with a given probability distribution. Entropy can serve as a measure of 'disorder'. As the level of disorder rises, the entropy rises and events become less predictable.
Back to the definition of entropy in the paper:
H(s_m) is the entropy of the random variable s_m. Here is the probability that outcome s_m happens. m are all the possible outcomes. The probability density p_n is calculated using the gray level histogram, that is the reason why the sum runs from 1 to 256. The bins represent possible states.
So what does this mean? In image processing entropy might be used to classify textures, a certain texture might have a certain entropy as certain patterns repeat themselves in approximately certain ways. In the context of the paper low entropy (H(s_m) means low disorder, low variance within the component m. A component with low entropy is more homogenous than a component with high entropy, which they use in combination with the smoothness criterion to classify the components.
Another way of looking at entropy is to view it as the measure of information content. A vector with relatively 'low' entropy is a vector with relatively low information content. It might be [0 1 0 1 1 1 0]. A vector with relatively 'high' entropy is a vector with relatively high information content. It might be [0 242 124 222 149 13].
It's a fascinating and complex subject which really can't be summarised in one post.

Entropy was introduced by Shanon (1948), were the higher value of Entropy = more detailed information.
Entropy is a measure of image information content, which is interpreted as the average uncertainty of information source.
In Image, Entropy is defined as corresponding states of intensity level which individual pixels can adapt.
It is used in the quantitative analysis and evaluation image details, the entropy value is used as it provides better comparison of the image details.

Perhaps, another way to think about entropy and information content in an image is to consider how much an image can be compressed. Independent of the compression scheme (run length encoding being one of many), you can imagine a simple image having little information (low entropy) can be encoded with fewer bytes of data while completely random images (like white noise) cannot be compressed much, if at all.

Bayes Classification with Multivariate Parzen Window using Spherical Kernel

I'm having a problem implementing a Bayes Classifier with the Parzen window algorithm using a spherical (or isotropic) kernel.
I am running the algorithm with test data containing 2 dimensions and 3 different classes (For each class, I have 10 test points, and 40 training points, all in 2 dimensions). When I change the value of my hyper-parameter (sigma_sq for the spherical Gaussian kernel), I find that there is no effect on how the points are classified.
This is my density estimator. My self.sigma_sq is the same across all the dimensions of my data (2 dimensions)
for i in range(test_data.shape[0]):
log_prob_intermediate = 0
for j in range(n): #n is size of training set
c = -self.n_dims * np.log(2*np.pi)/2.0 - self.n_dims*np.log(self.sigma_sq)/2.0
log_prob_intermediate += (c - np.sum((test_data[i,:] - self.train_data[j,:])**2.0) / (2.0 * self.sigma_sq))
log_prob.append(log_prob_intermediate / n)
How I implemented my Bayes Classifier:
There are 3 classes that my Bayes Classifier must distinguish. I created 3 training sets and 3 test sets (one training and test set per class). For each point in my test set, I run the density estimator for each class on the point. This gives me a vector of 3 values: the log probability that my new point is in class1, class2, or class3. I then choose the maximum value and assign the new point to that class.
Since I am using a spherical Gaussian kernel, I am of the understanding that my sigma_sq must be common for each density estimator (one density estimator for each class). Is this correct? If I had a different sigma_sq for each dimension pair, wouldn't this give me somewhat of a diagonal Gaussian kernel?
For my list of 30 test points (10 for each class), I find that running the bayes classifier on these points continues to give me the exact same classification for each point, regardless of what sigma I use. Is this normal? Since it's a spherical Gaussian kernel, and all my dimensions use the same kernel, is increasing or decreasing my sigma_sq just having a proportional effect on my log probability with no change in the classification? OR do I have some sort of problem with my density estimator that I can't figure out.

Lets address each thing separately
Using the same sigma for each dimension makes your kernel radial, this is true; however, you can (and should!) use different sigma for each class, as each distribution usually requires different density estimator, for simple heuristics read for example about Scott's rule of thumb for the kernel width selection in gaussian case or later work by Silverman.
It is hard to tell whether in your particular case choice of sigma should change the classification - in general it should be true; but each dataset has its own properties. However, your data is just 2D, which makes it perfect for visualization. Draw your data, then, draw each KDE and simply visually investigate what is going on.

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms.
But I'm not sure using what descriptor as frequency domain based feature, since there are amplitude spectrum, power spectrum and phase spectrum of a signal and I've read some references but still got confused about the significance. And what distance (similarity) function should be used as measurement when performing learning algorithms on frequency domain based feature vector(Euclidean distance? Cosine distance? Gaussian function? Chi-kernel or something else?)
Hope someone give me a clue or some material that I can refer to, thanks~
Edit
Thanks to #DrKoch, I chose a spatial element with the largest L-1 norm and plotted its log power spectrum in python and it did show some prominent peaks, below is my code and the figure
import numpy as np
import matplotlib.pyplot as plt
sp = np.fft.fft(signal)
freq = np.fft.fftfreq(signal.shape[-1], d = 1.) # time sloth of histogram is 1 hour
plt.plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
And I have several trivial questions to ask to make sure I totally understand your suggestion:
In your second suggestion, you said "ignore all these values."
Do you mean the horizontal line represent the threshold and all values below it should be assigned to value zero?
"you may search for the two, three largest peaks and use their location and probably widths as 'Features' for further classification."
I'm a little bit confused about the meaning of "location" and "width", does "location" refer to the log value of power spectrum (y-axis) and "width" refer to the frequency (x-axis)? If so, how to combine them together as a feature vector and compare two feature vector of "a similar frequency and a similar widths" ?
Edit
I replaced np.fft.fft with np.fft.rfft to calculate the positive part and plot both power spectrum and log power spectrum.
code:
f, axarr = plt.subplot(2, sharex = True)
axarr[0].plot(freq, np.abs(sp) ** 2)
axarr[1].plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
figure:
Please correct me if I'm wrong:
I think I should keep the last four peaks in first figure with power = np.abs(sp) ** 2 and power[power < threshold] = 0 because the log power spectrum reduces the difference among each component. And then use the log spectrum of new power as feature vector to feed classifiers.
I also see some reference suggest applying a window function (e.g. Hamming window) before doing fft to avoid spectral leakage. My raw data is sampled every 5 ~ 15 seconds and I've applied a histogram on sampling time, is that method equivalent to apply a window function or I still need apply it on the histogram data?

Generally you should extract just a small number of "Features" out of the complete FFT spectrum.
First: Use the log power spec.
Complex numbers and Phase are useless in these circumstances, because they depend on where you start/stop your data acquisiton (among many other things)
Second: you will see a "Noise Level" e.g. most values are below a certain threshold, ignore all these values.
Third: If you are lucky, e.g. your data has some harmonic content (cycles, repetitions) you will see a few prominent Peaks.
If there are clear peaks, it is even easier to detect the noise: Everything between the peaks should be considered noise.
Now you may search for the two, three largest peaks and use their location and probably widths as "Features" for further classification.
Location is the x-value of the peak i.e. the 'frequency'. It says something how "fast" your cycles are in the input data.
If your cycles don't have constant frequency during the measuring intervall (or you use a window before caclculating the FFT), the peak will be broader than one bin. So this widths of the peak says something about the 'stability' of your cycles.
Based on this: Two patterns are similar if the biggest peaks of both hava a similar frequency and a similar widths, and so on.
EDIT
Very intersiting to see a logarithmic power spectrum of one of your examples.
Now its clear that your input contains a single harmonic (periodic, oscillating) component with a frequency (repetition rate, cycle-duration) of about f0=0.04.
(This is relative frquency, proprtional to the your sampling frequency, the inverse of the time beetween individual measurment points)
Its is not a pute sine-wave, but some "interesting" waveform. Such waveforms produce peaks at 1*f0, 2*f0, 3*f0 and so on.
(So using an FFT for further analysis turns out to be very good idea)
At this point you should produce spectra of several measurements and see what makes a similar measurement and how differ different measurements. What are the "important" features to distinguish your mesurements? Thinks to look out for:
Absolute amplitude: Height of the prominent (leftmost, highest) peaks.
Pitch (Main cycle rate, speed of changes): this is position of first peak, distance between consecutive peaks.
Exact Waveform: Relative amplitude of the first few peaks.
If your most important feature is absoulute amplitude, you're better off with calculating the RMS (root mean square) level of our input signal.
If pitch is important, you're better off with calculationg the ACF (auto-correlation function) of your input signal.
Don't focus on the leftmost peaks, these come from the high frequency components in your input and tend to vary as much as the noise floor.
Windows
For a high quality analyis it is importnat to apply a window to the input data before applying the FFT. This reduces the infulens of the "jump" between the end of your input vector ant the beginning of your input vector, because the FFT considers the input as a single cycle.
There are several popular windows which mark different choices of an unavoidable trade-off: Precision of a single peak vs. level of sidelobes:
You chose a "rectangular window" (equivalent to no window at all, just start/stop your measurement). This gives excellent precission of your peaks which now have a width of just one sample. Your sidelobes (the small peaks left and right of your main peaks) are at -21dB, very tolerable given your input data. In your case this is an excellent choice.
A Hanning window is a single cosine wave. It makes your peaks slightly broader but reduces side-lobe levels.
The Hammimg-Window (cosine-wave, slightly raised above 0.0) produces even broader peaks, but supresses side-lobes by -42 dB. This is a good choice if you expect further weak (but important) components between your main peaks or generally if you have complicated signals like speech, music and so on.
Edit: Scaling
Correct scaling of a spectrum is a complicated thing, because the values of the FFT lines depend on may things like sampling rate, lenght of FFT, window, and even implementation details of the FFT algorithm (there exist several different accepted conventions).
After all, the FFT should show the underlying conservation of energy. The RMS of the input signal should be the same as the RMS (Energy) of the spectrum.
On the other hand: if used for classification it is enough to maintain relative amplitudes. As long as the paramaters mentioned above do not change, the result can be used for classification without further scaling.

How to determine the optimal threshold for Chi-square statistic dissimilarity measure in LBP face recognition?

I'm trying to implement the original and circular Local Binary Pattern (LBP) with uniform pattern mapping for face recognition application.
I've done with LBP descriptors extraction and spatial histogram construction steps so far. Now I have to work on the face classification and recognition phases. As the original paper in the subject suggest, the simplest classifier uses Chi-square statistic as a dissimilarity measure between 2 histograms of 2 face images. The formula seems straightforward, but I don't know how I can classify 2 histograms are representations of the same face or of different faces based on the resulting value of Chi-square dissimilarity measure. So my question is: What is the optimal threshold value which I can use as the border line between the same faces and different faces? How can I determine that value?
I've come across some source code on the internet and they set LBP threshold to 180.0. I have no idea where this value came from.
I would gratefully appreciate your helps. Thanks for your reading.

In the same/not-same setting, you learn the optimal threshold from the training set. Given, say 1000 same and 1000 not same pairs for training, run a for loop on the threshold. For each threshold value, calculate the precision as 0.5 * (percent of same pairs with distance < current threshold) + 0.5 * (percent of not same pairs with distance >= currentThreshold). Then, keep track of the optimal threshold.
By the way, for same/not-same setting, I would recommend considering using one-shot-similarity

Opencv Feature vector size of SiFT

I am trying to use sift for object classification using Normal Bayes Classifier.When I compute the descriptors for each image of variable size i get different sized feature vectors. Eg:
Feature Size: [128 x 39]
Feature Size: [128 x 54]
Feature Size: [128 x 69]
Feature Size: [128 x 64]
Feature Size: [128 x 14]
As for development, i am using 20 training images and therefore i have 20 labels. My classification is only of 3 classes containing car, book and ball. So my label vector size is [1 x 20]
As far as I understand, to perform Machine learning the feature vector size and label vector size should be same so i should get a vector size for training data as [__ x 20] and label is [1 x 20].
But my problem is that sift has 128 dimensional feature space hence and each image has different feature size as i shown above. How do I convert all to same size without losing features?
OR perhaps I might be doing it incorrectly so please help me out in this?
PS: actual I have done it using BOW model and it works but just for my learning purposes I am trying to do it in this matter just to learn out of interest so any hint and advise are welcomed. Thank you

You are right, SIFT descriptor is a 128 dimensional feature.
SIFT descriptor is computed for every key-point detected in the image. Before computing descriptor, you probably used a detector (as Harris, Sift or Surf Detector) to detect points of interest.
Detecting key-points and computing descriptors are two independent steps!
When you print Feature Size: [128 x Y] in your program, Y represents the number of key-points detected in the current image.
Generally, using BOW allows you to assign for each key-points descriptors the indice of the closest cluster in the BOW. Depending on your application, you can make decision ... (voting on the presence of one object in the scene or ...)

If you do not want to use BOW you could try to match the individual SIFT features as described in the original SIFT paper by Lowe.
The basic idea is that you compare two images with each other and decide, whether they are similar or not. You do that by comparing the individual SIFT features. You decide if they match. Then, to check if the spatial positions are consistent, you need to check if it is possible to transform the matched features from one image to the other.
It is described in more detail in the SIFT wikipedia article.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart