Shape of ROC curve - machine-learning

I did a prediction analysis on a dataset and drew the ROC curve.
The ROC curve looks like below,
Im not very much sure about the shape of the curve. Doesn't it need to be a wavy curve. But looking at the cure, can we decide, that there is an issue with this. I got arount 71% accuracy, that is ok for me. But I'm worrying about the shape of the curve, which is not wavy. For an example doesn't look like below. (taken from internet.)

It looks like you only plotted three points. The idea of a ROC curve is to show how the FP/TP ratio varies when you tweak the decision threshold in order to establish the performance at every point. Without information about how you plotted this or what parameters you have, it's hard to say anything more.
A typical example would be to tweak aggressivity level -- if you have a spam scanner which will classify as spam at a particular score, how does changing the score threshold change the TP/FP rate? So effectively the X axis will also reveal the threshold setting (but possibly stretched in a manner) and the curve at every point will show how many of the samples in your clean collection will be FPs at that threshold, and how many in your spam collection will be correctly blocked.
("Stretching" means that the threshold setting might not map linearly onto the FP rate. If nothing happens between thresholds 0.950 and 0.975, you don't plot that interval on the x axis at all. The points on the x axis are the threshold values where the TP/FP rate changes; some could be very close to each other in terms of threshold value, and other adjacent points could correspond to a large jump in the threshold value.)
A good ROC curve has a large area underneath it. An ideal ROC goes from 0 to 1.00 and stays there, but then you don't need the plot to help you decide how to deploy your solution anyway. But in reality, they will come in all kinds of shapes, from vaguely asymptotic towards the upper left (very good) to straight diagonal (pretty lousy) and even asymptotic towards the lower right (extremely poor; random verdicts would be better). The interesting points are the "knee" where the TP rate's growth slows down and the FP rate starts growing quicker (that's where you should stop increasing the threshold) and any irregularities, especially any which break monotony.
(In your example from the net, there is a spot around TP 0.6 where increasing the threshold will only increase FPs. Why is that? Is there a skew in the samples, or a problem in the implementation? Could it be fixed?)

It looks like you have plotted points using the predicted class of a classifier (.predict function in python's sklearn package) rather than the predicted class probability (.predict_proba function in python's sklearn package). This means there is only one threshold change, when the class switches from 0 to 1, rather than a range of values that would give you the smooth curve.
Replace your predict class with your prediction probability and this should fix your problem.

Related

which power of the feature should i train with? regression

I have a explanatory variable x and a response variable y. I am trying to find which power of the feature i should train with. You can ignore the colors for my question. the scatter data is from the sensor and the line plot is the theoretical curve from the lab, which you can also ignore for my question.
For this answer I understand you want to obtain some polynomial curve going through the croissant shaped zone where points are dense.
Also I assume that the independent variable is on the horizontal axis, while the dependent is on the vertical one. Otherwise as you can see from the blue line, there is no functional that could give you this.
Now to select the degree of polynomial you can use stepwise regression.
This is about running the regression with more or less features one at a time (i.e decrease or increase the degree of polynomial in this case), and calculating a score such as AIC, BIC, or even adjusted R2 to assess if it's worth it or not to add or remove this feature.

Is it normal for gradients to be extremely large in a deep convnet?

I just finished implementing a convolutional neural network from scratch. This is the first time I've done this. When testing my backpropagation algorithm, the outputted delta values for the weights are extremely large compared to what the original value was. For example, all my weights are initialized to a random number between -0.1 and 0.1, but the delta values outputted are around 75000. This obviously is much too big of a change, and it requires a very small learning rate to even be near functional. A learning rate like 0.01 seems like the convention but mine needs to be at least 0.0000001, leading me to believe I'm doing something wrong. The thing is I don't see how the deltas couldn't be large. To get the derivative of weights with regard to the cost function I convolve the activations of the previous layer (mostly positive due to leaky reLu) with the previous errors (all either 0.1 or 1 due to the derivative of leaky reLu). Obviously the sum of all these positive numbers will get very large as it propagates through the layers. Did I skip a step somewhere? Is this an exploding gradient problem? Should I use gradient clipping or batch normalization?
Depending on the size of the convolutions -0.1 to 0.1 seems extremely large. Try something like 0.01 or even less.
If you want to do a more insightful initialization you can take a look at glorot (http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) or he (https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initializations.
The crux is to initialize with either uniform or gaussian values with mean 0 and standard deviation equal to square root of the input channels.

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

ROC curve shows strange pattern

I have a dataset to which I add 10-30% of artificial data and run an algorithm to classify what data is original and what artificial. I got the attached ROC curves. I've never seen ROC curves ending like that. Am I doing something wrong? Or such pattern is possible? If so, what would be its explanation?
Thanks
You could see a ROC curve similar to what you have shown if your target data have an unbalanced bimodal distribution with a noise/background distribution located between the two modes. Initially (like in your plot), you would have a steep increase in the ROC curve as it covers the main peak of the true positive (TP) distribution. Next, you would have a relatively flat region where you accumulate false positives (FP's) without much increase in TP's. Then, you would hit the second cluster of TP's.
I'm guessing that your artificial data is closer to the centroid of the main cluster of TP's, which is why adding more artificial data tends to deemphasize the smaller TP cluster and make it look more like a typical ROC curve.
As I mentioned in my comment, it would be informative to plot the ROC curve without any artificial data. Also, it could be informative to show a version zoomed in on the tail end of the plot where the TP rate approaches 1 (i.e., to see if it flattens as it approaches 1).

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms.
But I'm not sure using what descriptor as frequency domain based feature, since there are amplitude spectrum, power spectrum and phase spectrum of a signal and I've read some references but still got confused about the significance. And what distance (similarity) function should be used as measurement when performing learning algorithms on frequency domain based feature vector(Euclidean distance? Cosine distance? Gaussian function? Chi-kernel or something else?)
Hope someone give me a clue or some material that I can refer to, thanks~
Edit
Thanks to #DrKoch, I chose a spatial element with the largest L-1 norm and plotted its log power spectrum in python and it did show some prominent peaks, below is my code and the figure
import numpy as np
import matplotlib.pyplot as plt
sp = np.fft.fft(signal)
freq = np.fft.fftfreq(signal.shape[-1], d = 1.) # time sloth of histogram is 1 hour
plt.plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
And I have several trivial questions to ask to make sure I totally understand your suggestion:
In your second suggestion, you said "ignore all these values."
Do you mean the horizontal line represent the threshold and all values below it should be assigned to value zero?
"you may search for the two, three largest peaks and use their location and probably widths as 'Features' for further classification."
I'm a little bit confused about the meaning of "location" and "width", does "location" refer to the log value of power spectrum (y-axis) and "width" refer to the frequency (x-axis)? If so, how to combine them together as a feature vector and compare two feature vector of "a similar frequency and a similar widths" ?
Edit
I replaced np.fft.fft with np.fft.rfft to calculate the positive part and plot both power spectrum and log power spectrum.
code:
f, axarr = plt.subplot(2, sharex = True)
axarr[0].plot(freq, np.abs(sp) ** 2)
axarr[1].plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
figure:
Please correct me if I'm wrong:
I think I should keep the last four peaks in first figure with power = np.abs(sp) ** 2 and power[power < threshold] = 0 because the log power spectrum reduces the difference among each component. And then use the log spectrum of new power as feature vector to feed classifiers.
I also see some reference suggest applying a window function (e.g. Hamming window) before doing fft to avoid spectral leakage. My raw data is sampled every 5 ~ 15 seconds and I've applied a histogram on sampling time, is that method equivalent to apply a window function or I still need apply it on the histogram data?
Generally you should extract just a small number of "Features" out of the complete FFT spectrum.
First: Use the log power spec.
Complex numbers and Phase are useless in these circumstances, because they depend on where you start/stop your data acquisiton (among many other things)
Second: you will see a "Noise Level" e.g. most values are below a certain threshold, ignore all these values.
Third: If you are lucky, e.g. your data has some harmonic content (cycles, repetitions) you will see a few prominent Peaks.
If there are clear peaks, it is even easier to detect the noise: Everything between the peaks should be considered noise.
Now you may search for the two, three largest peaks and use their location and probably widths as "Features" for further classification.
Location is the x-value of the peak i.e. the 'frequency'. It says something how "fast" your cycles are in the input data.
If your cycles don't have constant frequency during the measuring intervall (or you use a window before caclculating the FFT), the peak will be broader than one bin. So this widths of the peak says something about the 'stability' of your cycles.
Based on this: Two patterns are similar if the biggest peaks of both hava a similar frequency and a similar widths, and so on.
EDIT
Very intersiting to see a logarithmic power spectrum of one of your examples.
Now its clear that your input contains a single harmonic (periodic, oscillating) component with a frequency (repetition rate, cycle-duration) of about f0=0.04.
(This is relative frquency, proprtional to the your sampling frequency, the inverse of the time beetween individual measurment points)
Its is not a pute sine-wave, but some "interesting" waveform. Such waveforms produce peaks at 1*f0, 2*f0, 3*f0 and so on.
(So using an FFT for further analysis turns out to be very good idea)
At this point you should produce spectra of several measurements and see what makes a similar measurement and how differ different measurements. What are the "important" features to distinguish your mesurements? Thinks to look out for:
Absolute amplitude: Height of the prominent (leftmost, highest) peaks.
Pitch (Main cycle rate, speed of changes): this is position of first peak, distance between consecutive peaks.
Exact Waveform: Relative amplitude of the first few peaks.
If your most important feature is absoulute amplitude, you're better off with calculating the RMS (root mean square) level of our input signal.
If pitch is important, you're better off with calculationg the ACF (auto-correlation function) of your input signal.
Don't focus on the leftmost peaks, these come from the high frequency components in your input and tend to vary as much as the noise floor.
Windows
For a high quality analyis it is importnat to apply a window to the input data before applying the FFT. This reduces the infulens of the "jump" between the end of your input vector ant the beginning of your input vector, because the FFT considers the input as a single cycle.
There are several popular windows which mark different choices of an unavoidable trade-off: Precision of a single peak vs. level of sidelobes:
You chose a "rectangular window" (equivalent to no window at all, just start/stop your measurement). This gives excellent precission of your peaks which now have a width of just one sample. Your sidelobes (the small peaks left and right of your main peaks) are at -21dB, very tolerable given your input data. In your case this is an excellent choice.
A Hanning window is a single cosine wave. It makes your peaks slightly broader but reduces side-lobe levels.
The Hammimg-Window (cosine-wave, slightly raised above 0.0) produces even broader peaks, but supresses side-lobes by -42 dB. This is a good choice if you expect further weak (but important) components between your main peaks or generally if you have complicated signals like speech, music and so on.
Edit: Scaling
Correct scaling of a spectrum is a complicated thing, because the values of the FFT lines depend on may things like sampling rate, lenght of FFT, window, and even implementation details of the FFT algorithm (there exist several different accepted conventions).
After all, the FFT should show the underlying conservation of energy. The RMS of the input signal should be the same as the RMS (Energy) of the spectrum.
On the other hand: if used for classification it is enough to maintain relative amplitudes. As long as the paramaters mentioned above do not change, the result can be used for classification without further scaling.

Resources