I do data normalization as:
X = ( X - X.mean(axis=0) ) / X.std(axis=0)
But some features of X have 0 variance. It gives me Runtime error for ZeroDivision.
I know we can normalize using "StandardScalar" class from sklearn. But how can I normalize data by myself from scratch if std=0 ?
To quote sklearn documentation for StandardScaler:
Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1.
Therefore, like what the other answer said, you can omit the standard deviation term and just do X - X.mean(axis=0) when standard deviation is 0. However this only works if the whole of X has 0 standard deviation.
To make this work where you have a mix of values with some std dev and values that don't, use this instead:
std = X.std(axis=0)
std = np.where(std == 0, 1, std)
X = ( X - X.mean(axis=0) ) / std
This code checks if standard deviation is zero for each row of values in axis 0, and replaces them with 1 if true.
If standard deviation is 0 for a particular feature, than all of its values are identical. In this case X = X - X.mean(axis=0) should suffice. This would give you 0 mean and 0 standardeviation.
I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.
In a nutshell, I would like to know if there is a tensor command in torch that gives me the indices of elements in a tensor that satisfy a certain criteria.
Here is matlab code that illustrates what I would like to be able to do in torch:
my_mat = magic(3); % returns a 3 by 3 matrix with the numbers 1 through 9
greater_than_fives = find(my_mat > 5); % find indices of all values greater than 5, the " > 5" is a logical elementwise operator that returns a matrix of all 0's and 1's and finally the "find" command picks out the indices with a "1" in them
my_mat(greater_than_fives) = 0; % set all values greater than 5 equal to 0
I understand that I could do this in torch using a for loop, but is there some equivalent to matlab's find command that would allow me to do this more compactly?
x[x:gt(5)] = 0
In general there are x:gt :lt :ge :le :eq
There is also the general :apply function tha takes in an anonymous function and applies it to each element.
I'm currently using MATLAB's bweuler function to find the number of "holes" in an object. I'm having difficulty due to the relatively low resolution of these images, currently only achieving a 50% success rate. Here is an example of an image that will fail (cropped for testing convenience):
Normalized and scaled up so that you can see what's going on:
The number of holes should be 1 here, but I haven't found a reliable way to threshold that small area out while leaving its surroundings intact.
My current method uses an adaptive threshold, but this is problematic because the correct window size is not constant across all samples. For instance, this example will work with a window size of 6, but others will fail.
For reference, here are a couple more failure cases:
N=1 ->
N=2 ->
And a few cases my current method handles:
N=6 ->
N=1 ->
N=1 ->
I'll post a short version of my current code as well, though as I said previously, I do not believe that an adaptive thresholding method will work. I will be playing with some other morphological segmentation methods in the meantime. Just read one of the test images in and call findholes(image).
function N = findholes(I)
bw = imclearborder(adaptivethreshold(I, 9));
N = abs(1 - bweuler(bw2, 4));
end
function bw=adaptivethreshold(IM,ws,C,tm)
%ADAPTIVETHRESHOLD An adaptive thresholding algorithm that seperates the
%foreground from the background with nonuniform illumination.
% bw=adaptivethreshold(IM,ws,C) outputs a binary image bw with the local
% threshold mean-C or median-C to the image IM.
% ws is the local window size.
% C is a constant offset subtracted from the final input image to im2bw (range 0...1).
% tm is 0 or 1, a switch between mean and median. tm=0 mean(default); tm=1 median.
%
% Contributed by Guanglei Xiong (xgl99#mails.tsinghua.edu.cn)
% at Tsinghua University, Beijing, China.
%
% For more information, please see
% http://homepages.inf.ed.ac.uk/rbf/HIPR2/adpthrsh.htm
if (nargin < 2)
error('You must provide IM and ws');
end
if (nargin<3)
C = 0;
tm = 0;
end
if (nargin==3)
tm=0;
elseif (tm~=0 && tm~=1)
error('tm must be 0 or 1.');
end
IM=mat2gray(IM);
if tm==0
mIM=imfilter(IM,fspecial('average',ws),'replicate');
else
mIM=medfilt2(IM,[ws ws]);
end
sIM=mIM-IM-C;
bw=im2bw(sIM,0);
bw=imcomplement(bw);
end
I'm implementing MART from (http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf) algorithm 5,
My algorithm "works" for say less data(3000 training data file, 22 features) and J=5,10,20 (# of leaf nodes) and T = 10, 20. It gives me good result (R-Precision is 0.30 to 0.5 for training) but when I try to run on some what large training data (70K records) it gives me runtime underflow error - which I think it should be - just don't know how workaround this problem?
Underflow err comes here, calculating gradient of cost (or pseudo-response):
here y_i are {1,-1} labels so if I just try: 2/exp(5000) its overflow in denominator!
Just wondering if I can "normalize" this or "threshold" this, but then I'm using this pseudo-response in calculating "label" (gamma in that pdf), and then those gamma to calculate model scores.
You can wrap that expression with an if.
exp_arg = 2 * y_i * F_m_minus_1
if (exp_arg > 700) {
// assume exp() overflow, result of exp() ~= inf, 2 / inf = 0
y_tilda_i = 0
else // standard calculation
I haven't implemented gradient boosting in particular, but I needed to do that trick in some neural network computation.
#rrenaud is close, what I did is: if exp_arg > 16 or exp_arg < -16 make my exp_arg = 16(or -16) and it works! (For 1.2GB data and 700 features too!)