How to manually scale data to a normal distribution - machine-learning

I do data normalization as:
X = ( X - X.mean(axis=0) ) / X.std(axis=0)
But some features of X have 0 variance. It gives me Runtime error for ZeroDivision.
I know we can normalize using "StandardScalar" class from sklearn. But how can I normalize data by myself from scratch if std=0 ?

To quote sklearn documentation for StandardScaler:
Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1.
Therefore, like what the other answer said, you can omit the standard deviation term and just do X - X.mean(axis=0) when standard deviation is 0. However this only works if the whole of X has 0 standard deviation.
To make this work where you have a mix of values with some std dev and values that don't, use this instead:
std = X.std(axis=0)
std = np.where(std == 0, 1, std)
X = ( X - X.mean(axis=0) ) / std
This code checks if standard deviation is zero for each row of values in axis 0, and replaces them with 1 if true.

If standard deviation is 0 for a particular feature, than all of its values are identical. In this case X = X - X.mean(axis=0) should suffice. This would give you 0 mean and 0 standardeviation.

Related

Misconceptions about the Shannon-Nyquist theorem

I am a student working with time-series data which we feed into a neural network for classification (my task is to build and train this NN).
We're told to use a band-pass filter of 10 Hz to 150 Hz since anything outside that is not interesting.
After applying the band-pass, I've also down-sampled the data to 300 samples per second (originally it was 768 Hz). My understanding of the Shannon Nyquist sampling theorem is that, after applying the band-pass, any information in the data will be perfectly preserved at this sample-rate.
However, I got into a discussion with my supervisor who claimed that 300 Hz might not be sufficient even if the signal was band-limited. She says that it is only the minimum sample rate, not necessarily the best sample rate.
My understanding of the sampling theorem makes me think the supervisor is obviously wrong, but I don't want to argue with my supervisor, especially in case I'm actually the one who has misunderstood.
Can anyone help to confirm my understanding or provide some clarification? And how should I take this up with my supervisor (if at all).
The Nyquist-Shannon theorem states that the sampling frequency should at-least be twice of bandwidth, i.e.,
fs > 2B
So, this is the minimal criteria. If the sampling frequency is less than 2B then there will be aliasing. There is no upper limit on sampling frequency, but more the sampling frequency, the better will be the reconstruction.
So, I think your supervisor is right in saying that it is the minimal condition and not the best one.
Actually, you and your supervisor are both wrong. The minimum sampling rate required to faithfully represent a real-valued time series whose spectrum lies between 10 Hz and 150 Hz is 140 Hz, not 300 Hz. I'll explain this, and then I'll explain some of the context that shows why you might want to "oversample", as it is referred to (spoiler alert: Bailian-Low Theorem). The supervisor is mixing folklore into the discussion, and when folklore is not properly-contexted, it tends to telephone tag into fakelore. (That's a common failing even in the peer-reviewed literature, by the way). And there's a lot of fakelore, here, that needs to be defogged.
For the following, I will use the following conventions.
There's no math layout on Stack Overflow (except what we already have with UTF-8), so ...
a^b denotes a raised to the power b.
∫_I (⋯x⋯) dx denotes an integral of (⋯x⋯) taken over all x ∈ I, with the default I = ℝ.
The support supp φ (or supp_x φ(x) to make the "x" explicit) of a function φ(x) is the smallest closed set containing all the x-es for which φ(x) ≠ 0. For regularly-behaving (e.g. continuously differentiable) functions that means a union of closed intervals and/or half-rays or the whole real line, itself. This figures centrally in the Shannon-Nyquist sampling theorem, as its main condition is that a spectrum have bounded support; i.e. a "finite bandwidth".
For the Fourier transform I will use the version that has the 2π up in the exponent, and for added convenience, I will use the convention 1^x = e^{2πix} = cos(2πx) + i sin(2πx) (which I refer to as the Ramanujan Convention, as it is the convention I frequently used in my previous life oops I mean which Ramanujan secretly used in his life to make the math a whole lot simpler).
The set ℤ = {⋯, -2, -1, 0, +1, +2, ⋯ } is the integers, and 1^{x+z} = 1^x for all z∈ℤ - making 1^x the archetype of a periodic function whose period is 1.
Thus, the Fourier transform f̂(ν) of a function f(t) and its inverse are given by:
f̂(ν) = ∫ f(t) 1^{-νt} dt, f(t) = ∫ f̂(ν) 1^{+νt} dν.
The spectrum of the time series given by the function f(t) is the function f̂(ν) of the cyclic frequency ν, which is what is measured in Hertz (Hz.); t, itself, being measured in seconds. A common convention is to use the angular frequency ω = 2πν, instead, but that muddies the picture.
The most important example, with respect to the issue at hand, is the Fourier transform χ̂_Ω of the interval function given by χ_Ω(t) = 1 if t ∈ [-½Ω,+½Ω] and χ_Ω(t) = 0 else:
χ̂_Ω(t) = ∫_[-½Ω,+½Ω] 1^ν dν
= {1^{+½Ω} - 1^{-½Ω}}/{2πi}
= {2i sin πΩ}/{2πi}
= Ω sinc πΩ
which is where the function sinc x = (sin πx)/(πx) comes into play.
The cardinal form of the sampling theorem is that a function f(t) can be sampled over an equally-spaced sampled domain T ≡ { kΔt: k ∈ ℤ }, if its spectrum is bounded by supp f̂ ⊆ [-½Ω,+½Ω] ⊆ [-1/(2Δt),+1/(2Δt)], with the sampling given as
f(t) = ∑_{t'∈T} f(t') Ω sinc(Ω(t - t')) Δt.
So, this generally applies to [over-]sampling with redundancy factors 1/(ΩΔt) ≥ 1. In the special case where the sampling is tight with ΩΔt = 1, then it reduces to the form
f(t) = ∑_{t'∈T} f(t') sinc({t - t'}/Δt).
In our case, supp f̂ = [10 Hz., 150 Hz.] so the tightest fits are with 1/Δt = Ω = 300 Hz.
This generalizes to equally-spaced sampled domains of the form T ≡ { t₀ + kΔt: k ∈ ℤ } without any modification.
But it also generalizes to frequency intervals supp f̂ = [ν₋,ν₊] of width Ω = ν₊ - ν₋ and center ν₀ = ½ (ν₋ + ν₊) to the following form:
f(t) = ∑_{t'∈T} f(t') 1^{ν₀(t - t')} Ω sinc(Ω(t - t')) Δt.
In your case, you have ν₋ = 10 Hz., ν₊ = 150 Hz., Ω = 140 Hz., ν₀ = 80 Hz. with the condition Δt ≤ 1/140 second, a sampling rate of at least 140 Hz. with
f(t) = (140 Δt) ∑_{t'∈T} f(t') 1^{80(t - t')} sinc(140(t - t')).
where t and Δt are in seconds.
There is a larger context to all of this. One of the main places where this can be used is for transforms devised from an overlapping set of windowed filters in the frequency domain - a typical case in point being transforms for the time-scale plane, like the S-transform or the continuous wavelet transform.
Since you want the filters to be smoothly-windowed functions, without sharp corners, then in order for them to provide a complete set that adds up to a finite non-zero value over all of the frequency spectrum (so that they can all be normalized, in tandem, by dividing out by this sum), then their respective supports have to overlap.
(Edit: Generalized this example to cover both equally-spaced and logarithmic-spaced intervals.)
One example of such a set would be filters that have end-point frequencies taken from the set
Π = { p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α: n ∈ {0,1,2,⋯} }
So, for interval n (counting from n = 0), you would have ν₋ = p_n and ν₊ = p_{n+1}, where the members of Π are enumerated
p_n = p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α,
Δp_n = p_{n+1} - p_n = α p_n + β = (α p₀ + β)(α + 1)ⁿ,
n ∈ {0,1,2,⋯}
The center frequency of interval n would then be ν₀ = p_n + ½ Δp₀ (α + 1)ⁿ and the width would be Ω = Δp₀ (α + 1)ⁿ, but the actual support for the filter would overlap into a good part of the neighboring intervals, so that when you add up the filters that cover a given frequency ν the sum doesn't drop down to 0 as ν approaches any of the boundary points. (In the limiting case α → 0, this produces an equally-spaced frequency domain, suitable for an equalizer, while in the case β → 0, it produces a logarithmic scale with base α + 1, where octaves are equally-spaced.)
The other main place where you may apply this is to time-frequency analysis and spectrograms. Here, the role of a function f and its Fourier transform f̂ are reversed and the role of the frequency bandwidth Ω is now played by the (reciprocal) time bandwidth 1/Ω. You want to break up a time series, given by a function f(t) into overlapping segments f̃(q,λ) = g(λ)* f(q + λ), with smooth windowing given by the functions g(λ) with bounded support supp g ⊆ [-½ 1/Ω, +½ 1/Ω], and with interval spacing Δq much larger than the time sampling Δt (the ratio Δq/Δt is called the "hop" factor). The analogous role of Δt is played, here, by the frequency interval in the spectrogram Δp = Ω, which is now constant.
Edit: (Fixed the numbers for the Audacity example)
The minimum sampling rate for both supp_λ g and supp_λ f(q,λ) is Δq = 1/Ω = 1/Δp, and the corresponding redundancy factor is 1/(ΔpΔq). Audacity, for instance, uses a redundancy factor of 2 for its spectrograms. A typical value for Δp might be 44100/2048 Hz., while the time-sampling rate is Δt = 1/(2×3×5×7)² second (corresponding to 1/Δt = 44100 Hz.). With a redundancy factor of 2, Δq would be 1024/44100 second and the hop factor would be Δq/Δt = 1024.
If you try to fit the sampling windows, in either case, to the actual support of the band-limited (or time-limited) function, then the windows won't overlap and the only way to keep their sum from dropping to 0 on the boundary points would be for the windowing functions to have sharp corners on the boundaries, which would wreak havoc on their corresponding Fourier transforms.
The Balian-Low Theorem makes the actual statement on the matter.
https://encyclopediaofmath.org/wiki/Balian-Low_theorem
And a shout-out to someone I've been talking with, recently, about DSP-related matters and his monograph, which provides an excellent introductory reference to a lot of the issues discussed here.
A Friendly Guide To Wavelets
Gerald Kaiser
Birkhauser 1994
He said it's part of a trilogy, another installment of which is forthcoming.

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

optimal separating hyperplane objective function confusion

Chapter 4.5.2 of Elements of Statistical Learning
I don't understand what does it mean:
"Since for any β and β0 satisfying these inequalities, any positively scaled
multiple satisfies them too, we can arbitrarily set ||β|| = 1/M." 
Also, how does maximize M becomes minimize 1/2(||β||^2) ?
"Since for any β and β0 satisfying these inequalities, any positively scaled multiple satisfies them too, we can arbitrarily set ||β|| = 1/M." 
y_i(x_i' b + b0) >= M ||b||
thus for any c>0
y_i(x_i' [bc] + [b0c]) >= M ||bc||
thus you can always find such c that ||bc|| = 1/M, so we can focus only on b such that they have such norm (we simply limit the space of possible solutions because we know that scaling does not change much)
Also, how does maximize M becomes minimize 1/2(||β||^2) ?
We put ||b|| = 1/M, thus M=1/||b||
max_b M = max_b 1 / ||b||
now maximization of positive f(b) is equivalent of minimization of 1/f(b), so
min ||b||
and since ||b|| is positive, its minimization is equivalent to minimization of the square, as well as multiplied by 1/2 (this does not change the optimal b)
min 1/2 ||b||^2

Normalize a feature in this table

This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.
...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.
It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366
Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.
"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.
I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.
For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52
I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.

Gradient Boost Decision Tree (GBDT) or Multiple Additive Regression Tree(MART): Calculating gradient/pseudo-response

I'm implementing MART from (http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf) algorithm 5,
My algorithm "works" for say less data(3000 training data file, 22 features) and J=5,10,20 (# of leaf nodes) and T = 10, 20. It gives me good result (R-Precision is 0.30 to 0.5 for training) but when I try to run on some what large training data (70K records) it gives me runtime underflow error - which I think it should be - just don't know how workaround this problem?
Underflow err comes here, calculating gradient of cost (or pseudo-response):
here y_i are {1,-1} labels so if I just try: 2/exp(5000) its overflow in denominator!
Just wondering if I can "normalize" this or "threshold" this, but then I'm using this pseudo-response in calculating "label" (gamma in that pdf), and then those gamma to calculate model scores.
You can wrap that expression with an if.
exp_arg = 2 * y_i * F_m_minus_1
if (exp_arg > 700) {
// assume exp() overflow, result of exp() ~= inf, 2 / inf = 0
y_tilda_i = 0
else // standard calculation
I haven't implemented gradient boosting in particular, but I needed to do that trick in some neural network computation.
#rrenaud is close, what I did is: if exp_arg > 16 or exp_arg < -16 make my exp_arg = 16(or -16) and it works! (For 1.2GB data and 700 features too!)

Resources