How to process % to negative number in Visual Foxpro - modulo

How to do % to negative number in VF?
MOD(10,-3) = -2
MOD(-10,3) = 2
MODE(-10,-3) = -1
Why?

It is a regular modulo:
The mod function is defined as the amount by which a number exceeds
the largest integer multiple of the divisor that is not greater than
that number.
You can think of it like this:
10 % -3:
The largest multiple of 10 that is less than -3 is -2.
So 10 % -3 is -2.
-10 % 3:
Now, why -10 % 3 is 2?
The easiest way to think about it is to add to the negative number a multiple of 2 so that the number becomes positive.
-10 + (4*3) = 2 so -10 % 3 = (-10 + 12) % 3 = 2 % 3 = 3

Here's what we said about this in The Hacker's Guide to Visual FoxPro:
MOD() and % are pretty straightforward when dealing with positive numbers, but they get interesting when one or both of the numbers is negative. The key to understanding the results is the following equation:
MOD(x,y) = x - (y * FLOOR(x/y))
Since the mathematical modulo operation isn't defined for negative numbers, it's a pleasure to see that the FoxPro definitions are mathematically consistent. However, they may be different from what you'd initially expect, so you may want to check for negative divisors or dividends.
A little testing (and the manuals) tells us that a positive divisor gives a positive result while a negative divisor gives a negative result.

Related

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

Lua decimal precision loss

Can someone explain why in lua running:
return 256.65 * 1000000 + .000000005 - 256 * 1000000 gives 649999.99999997
whereas
return 255.65 * 1000000 + .000000005 - 255 * 1000000 and
return 268.65 * 1000000 + .000000005 - 268 * 1000000 give 650000.0 ?
From what i can see it seems to be an issue strictly for decimal 65 (and it seems also 15) and for whole numbers within the range 256 - 267. I know this is related to doing these calculations with floating points, but I'm still curious as to what is special about these values in particular
What is special about these values is that 0.65 is not a binary fraction (even though it is a decimal fraction), and so cannot be represented exactly in floating point.
For the record, this is not specific to Lua. The same thing will happen in C.
For the same reason that 10/3 is a repeating fraction in base 10. In base 3, dividing by 3 would result in whole numbers. In base 2 -- which is used to represent numbers in a computer -- the numbers you're producing similarly result in fractions that can be exactly represented.
Further reading.

Erlang Calculating Pi to X decimal places

I have been given this question to work on a solution. I'm struggling to get my head around the recursion. Some break down of the question would be very helpful.
Given that Pi can be estimated using the function 4 * (1 – 1/3 + 1/5 – 1/7 + …) with more terms giving greater accuracy, write a function that calculates Pi to an accuracy of 5 decimal places.
I have got some example code however I really don't understand where/why the variables are entered like this. Possible breakdown of this code and why it is not accurate would be appreciated.
-module (pi).
-export ([pi/0]).
pi() -> 4 * pi(0,1,1).
pi(T,M,D) ->
A = 1 / D,
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
true -> T
end.
The formula comes from the evaluation of tg(pi/4) which is equal to 1. The inverse:
pi/4 = arctg(1)
so
pi = 4* arctg(1).
using the technique of the Taylor series:
arctg (x) = x - x^3/3 + ... + (-1)^n x^(2n+1)/(2n+1) + o(x^(2n+1))
so when x = 1 you get your formula:
pi = 4 * (1 – 1/3 + 1/5 – 1/7 + …)
the problem is to find an approximation of pi with an accuracy of 0.00001 (5 decimal). Lookinq at the formula, you can notice that
at each step (1/3, 1/5,...) the new term to add:
is smaller than the previous one,
has the opposite sign.
This means that each term is an upper estimation of the error (the term o(x^(2n+1))) between the real value of pi and the evaluation up to this term.
So it can be use to stop the recursion at a level where it is guaranty that the approximation is better than this term. To be correct, the program
you propose multiply the final result of the recursion by 4, so the error is no more guaranteed to be smaller than term.
looking at the code:
pi() -> 4 * pi(0,1,1).
% T = 0 is the initial estimation
% M = 1 is the sign
% D = 1 initial value of the term's index in the Taylor serie
pi(T,M,D) ->
A = 1 / D,
% evaluate the term value
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
% if the precision is not reach call the pi function with,
% new serie's evaluation (the previous one + sign * term): T+(M*A)
% new inverted sign: M*-1
% new index: D+2
true -> T
% if the precision is reached, give the result T
end.
To be sure that you have reached the right accuracy, I propose to replace A > 0.00001 by A > 0.0000025 (= 0.00001/4)
I can't find any error in this code, but I can't test it right now, anyway:
T is probably "total", M is "multiplicator", and D is "divisor".
By every step you:
check (the 'if' is in some way similar to a switch/case in c/c++/java) if the next term (A = 1/D) is bigger than 0.00001. If not, you can stop the recursion, you've got the 5 decimal places you were looking for. So "if true (default case) -> return T"
if it's bigger, you multiply A by M, add to the total, then multiply M by -1, add 2 to D, and repeat (so you get the next term, add again, and so on).
pi(T,M,D) ->
A = 1 / D,
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
true -> T
end.
I don't know Erlang myself but from the looks of it you are checking if 1/D is < 0.00001 when in reality you should be checking 4 * 1/D because that 4 is going to be multiplied through. For example in your case if 1/D was 0.000003 you would stop four function, but your total would actually have changed by 0.000012. Hope this helps.

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Strange for loop problem

I'm not sure if this is a bug or not, so I thought that maybe you folks might want to take a look.
The problem lies with this code:
for i=0,1,.05 do
print(i)
end
The output should be:
0
.05
.1
--snip--
.95
1
Instead, the output is:
0
.05
.1
--snip--
.95
This same problem happened with a while loop:
w = 0
while w <= 1 do
print(w)
w = w + .05
end
--output:
0
.05
.1
--snip--
.95
The value of w is 1, which can be verified by a print statement after the loop.
I have verified as much as possible that any step that is less than or equal .05 will produce this error. Any step above .05 should be fine. I verified that 1/19 (0.052631579) does print a 1. (Obviously, a decimal denominator like 19.9 or 10.5 will not produce output from [0,1] inclusive.) Is there a possibility that this is not an error of the language? Both the interpreter and a regular Lua file produce this error.
This is a rounding problem. The issue is that 0.05 is represented as a floating point binary number, and it does not have an exact representation in binary. In base 2 (binary), it is a repeating decimal similar to numbers like 1/3 in base 10. When added repeatedly, the rounding results in a number which is slightly more than 1. It is only very, very slightly more than 1, so if you print it out, it shows 1 as the output, but it is not exactly 1.
> x=0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05+0.05
> print(x)
1
> print(1==x)
false
> print(x-1)
2.2204460492503e-16
So, as you can see, although really close to 1, it is actually slightly more.
A similar situation can come up in decimal when we have repeating fractions. If we were to add together 1/3 + 1/3 + 1/3, but we had to round to six digits to work with, we would add 0.333333 + 0.333333 + 0.333333 and get 0.999999 which is not actually 1. This is an analogous case for binary math. 1/20 cannot be precisely represented in binary.
Note that the rounding is slightly different for multiplication so
> print(0.05*20-1)
0
> print(0.05*20==1)
true
As a result, you could rewrite your code to say
for i=0,20,1 do
print(i*0.05)
end
And it would work correctly. In general, it's advisable not to use floating point numbers (that is, numbers with decimal points) for controlling loops when it can be avoided.
This is a result of floating-point inaccuracy. A binary64 floating point number is unable to store 0.05 and so the result will be rounded to a number which is very slightly more than 0.05. This rounding error remains in the repeated sum, and eventually the final value will be slightly more than 1.0, and so will not be displayed.
This is a floating point thing. Computers don't represent floating point numbers exactly. Tiny rounding errors make it so that 20 additions of +0.05 does not result in precisely 1.0.
Check out this article: "What every programmer should know about floating-point arithmetic."
To get your desired behavior, you could loop i over 1..20, and set f=i*0.05
This is not a bug in Lua. The same thing happens in the C program below. Like others have explained, it's due to floating-point inaccuracy, more precisely, to the fact that 0.05 is not a binary fraction (that is, does not have a finite binary representation).
#include <stdio.h>
int main(void)
{
double i;
for (i=0; i<=1; i+=0.05) printf("%g\n",i);
return 0;
}

Resources