Taking Mean of list of numbers having positive, negative, NaN, zero values and only taking mean of positive numbers only - mean

How to take mean of positive numbers only in the series or in dataframe?
How to take mean of positive numbers only in the series or in dataframe?

Try Something like this :
def mean_positive(L):
# Get all positive numbers into another list
pos_only = [x for x in L if x > 0]
if pos_only:
return sum(pos_only) / len(pos_only)
raise ValueError('No postive numbers in input')

Related

Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?

The code of the partial derivatives of the mean square error:
w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)
With n_samples as n, the samples number, y_true as the observations and y_pred as the predictions
My question is, why we used the sum for the gradient in the code of b (b_grad), and why we didn't in the code of w_grad?
The original equation is:
If you have ten features, then you have ten Ws and ten Bs, and the total number of variables are twenty.
But we can just sum all B_i into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x be 1. The calculation becomes below:

A differentiable approach to counting elements in PyTorch

I need to count the number of times a certain element appear in a tensor in a differentiable way.
I have a tensor
a = torch.arange(10, dtype = float, requires_grad=True)
print(a)
>>>tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=torch.float64,
requires_grad=True)
Say I'm trying to count the number of times the element 5.0 appear. I found this SO question that is exactly the same, but the accepted answer is non differentiable:
(a == 5).sum()
>>>tensor(1)
(a == 5).sum().requires_grad
>>>False
My goal is to have a loss that enforces the element to appear N times:
loss = N - (a == 5).sum()
What you probably care about is differentiability wrt parameters, so your vector [1,2,3,4,5] is actually an output of f(x | theta). Sicne you casted everything onto integers, this will never create a meaningful gradient for theta, you have two paths:
Change your output, so that you do not output numbers, but rather distributions over number sequences, so instead of having a vector of integers, output a matrix of probabilities, N x K, where K is the maximum number and N number of integers, and an entry p_nk is a probability of nth number to be equal to k. Then, you can just write a nice smooth loss that will take expected number of each digit, lets call it Z (which is of length K) and then we can do
loss(P, Z) := - SUM_k [ || Z_k - [ SUM_n P_nk ] || ]
Treat the whole setup as RL problem, and then you do not need a "differentiable" loss. Just use a difference between expected occurences, and actual occurences as negative reward

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

How to process % to negative number in Visual Foxpro

How to do % to negative number in VF?
MOD(10,-3) = -2
MOD(-10,3) = 2
MODE(-10,-3) = -1
Why?
It is a regular modulo:
The mod function is defined as the amount by which a number exceeds
the largest integer multiple of the divisor that is not greater than
that number.
You can think of it like this:
10 % -3:
The largest multiple of 10 that is less than -3 is -2.
So 10 % -3 is -2.
-10 % 3:
Now, why -10 % 3 is 2?
The easiest way to think about it is to add to the negative number a multiple of 2 so that the number becomes positive.
-10 + (4*3) = 2 so -10 % 3 = (-10 + 12) % 3 = 2 % 3 = 3
Here's what we said about this in The Hacker's Guide to Visual FoxPro:
MOD() and % are pretty straightforward when dealing with positive numbers, but they get interesting when one or both of the numbers is negative. The key to understanding the results is the following equation:
MOD(x,y) = x - (y * FLOOR(x/y))
Since the mathematical modulo operation isn't defined for negative numbers, it's a pleasure to see that the FoxPro definitions are mathematically consistent. However, they may be different from what you'd initially expect, so you may want to check for negative divisors or dividends.
A little testing (and the manuals) tells us that a positive divisor gives a positive result while a negative divisor gives a negative result.

Resources