I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.
for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)
x=torch.Tensor({1,-1,3,-8})
How to convert x such that all the negative values in x are replaced with zero without using a loop such that the tensor must look like
th>x
1
0
3
0
Pytorch supports indexing by operators
a = torch.Tensor([1,0,-1])
a[a < 0] = 0
a
tensor([1., 0., 0.])
Actually, this operation is equivalent to applying ReLU non-linear activation.
Just do this and you're good to go
output = torch.nn.functional.relu(a)
You can also do it in-place for faster computations:
torch.nn.functional.relu(a, inplace=True)
Pytorch takes care of broadcasting here :
x = torch.max(x,torch.tensor([0.]))
Just want to clear out my confusion. I've tested openCV template matching method to match some numbers. First I have this sequence of number 0 1 2 3 4 5 1 2 3 4 5 (after binarization probably the character width is different). How does template matching works to match number '1'? Does it;
slides through all the window until it found 2 matches (2 output), or
stop after it match the first '1', or
find the highest correlation between the two number '1' and choose either one.
Edited: As attached is the output. It only match one number '1' and not two '1'.
[Q] How can I detect two numbers '1' simultaneously?
I know it's an old question but here is an answer.
When you do MatchTemplate, it will output an grayscale image. After that, you will need to do a MinMax on it. Then, you can check if there is a result in the range you are looking for. In the example below, using EmguCV (a wrapper of OpenCV in C#), I draw a rectangle around the best find (index 0 of the minValues array) only if it's below 0.75 (you can adjust this threshold for your needs).
Here is the code:
Image<Gray, float> result = new Image<Gray, float>(new System.Drawing.Size(nWidth, nHeight));
result = image.CurrentImage.MatchTemplate(_imageTemplate.CurrentImage, Emgu.CV.CvEnum.TM_TYPE.CV_TM_SQDIFF_NORMED);
double[] minValues;
double[] maxValues;
System.Drawing.Point[] minLocations;
System.Drawing.Point[] maxLocations;
result.MinMax(out minValues, out maxValues, out minLocations, out maxLocations);
if (minValues[0] < 0.75)
{
Rectangle rect = new Rectangle(new Point(minLocations[0].X, minLocations[0].Y),
new Size(_imageTemplate.CurrentImage.Width, _imageTemplate.CurrentImage.Height));
image.CurrentImage.Draw(rect, new Bgr(0,0,255), 1);
}
else
{
//Nothing has been found
}
EDIT
Here is an example of the output:
http://www.cognotics.com/opencv/servo_2007_series/part_5/fig_6_thumb.png
Can anyone explain why nEgiens must not be over nTrainFace-1 on the picture above?
I am using EmguCv. When nEgiens and nTrainFace are equals, last eigen image is returned blank (black).
However, on emgucv implementation:
`if (termCrit.max_iter <= 0 || termCrit.max_iter > trainingImages.Length)
termCrit.max_iter = trainingImages.Length;`
A mathematical constriction:
The rank of the covariance matrix is limited by the number of training examples: if there are N training examples, there will be at most N-1 eigenvectors with non-zero eigenvalues.
http://en.wikipedia.org/wiki/Eigenface#Computing_the_eigenvectors