Finding standard deviation using only mean, min, max? - standard-deviation

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,

A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?

I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))

Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.

In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Related

Coefficients and Confidence Intervals - GLM Binomial (Logit)

I've run an Interrupted Time Series Analysis using a Binomial logistic regression in R.
glm(`Subject Refused Ratio` ~ Quarter + int2 + time_since_intervention2 , df, family = "binomial"(link='logit'), weights = sub_weight)
I want to derive the coefficients and confidence intervals for each of my outcomes and am currently doing so with the margins package, with the following outcome:
summary(margins(rrfit1a))
factor AME SE z p lower upper
int2 0.0963 0.1064 0.9050 0.3654 -0.1122 0.3047
Quarter -0.0006 0.0049 -0.1162 0.9075 -0.0101 0.0089
time_since_intervention2 -0.0056 0.0209 -0.2695 0.7875 -0.0466 0.0353
These seem largely consistent with the modelled data. For example it suggests the intervention (int2) could range between a 0.11 decrease and 0.30 increase.
However, I really need to add similar coefficient values and confidence intervals for the original Intercept. I have tried to do so using simple exp(coefficients) and the confint function within the MASS package. But the outcome doesn't quite tie in with what I would anticipate seeing.
exp(coefficients(rrfit1a))
(Intercept) Quarter int2 time_since_intervention2
0.9093160 0.9977377 1.4720697 0.9776187
For context the fitted value of the model in the first observation is around 0.47, which looks correct. So I wonder whether it is just a case of me misinterpreting the above or is there something more fundamental wrong with it?
Secondly, the confint outcome is:
> confint(rrfit1a, level = 0.90)
Waiting for profiling to be done...
5 % 95 %
(Intercept) -0.38990085 0.19896064
Quarter -0.03437363 0.02981353
int2 -0.31682909 1.09669529
time_since_intervention2 -0.16144941 0.11569710
This isn't what we'd expect to see or what our plotted confidence intervals look anything like.

Calculating mean of normal distribution

How can I calculate the mean of normal distribution knowing the sd, a percentile and its value ?
I've got a question where the :
sd = 100
the 18 percentile's value is 1200
I standardized the distribution converting it to the Z score and using fi function and Z table.
then tried to calculate by P(Z > ((1200-mean)/100)) = 0.18
I got that mean = 1142.858 but it is a wrong answer.
what did I do wrong ?
There are two different solutions of this question which are depends on the how the percentile is selected.
If we assume data are arranged in ascending order and pick the 1200 as initial 18 percentile value, the appropriate probabilty function would be P(z<((1200-mean)/100)) = 0.18 and then, if we apply the InvNorm function (Inverse Normal Probability Distribution Function), the corresponding z-score for a probability will be -0.915 which will make the equation as follows;
P(z<-0.915)=0.18 -> -0.915 = (1200-mean)/100 -> the mean will be 1291.5
If we assume data are arranged in ascending order and pick the 1200 as last 18 percentile value,the appropriate probabilty function would be P(z>((1200-mean)/100)) = 0.18 and then, if we apply the InvNorm function (Inverse Normal Probability Distribution Function), the corresponding z-score for a probability will be 0.915 which will make the equation as follows;
P(z> 0.915)=0.18 -> 0.915 = (1200-mean)/100 -> the mean will be 1108.5

Select an integer number of periods

Suppose we have sinusoidal with frequency 100Hz and sampling frequency of 1000Hz. It means that our signal has 100 periods in a second and we are taking 1000 samples in a second. Therefore, in order to select a complete period I'll have to take fs/f=10 samples. Right?
What if the sampling period is not a multiple of the frequency of the signal (like 550Hz)? Do I have to find the minimum multiple M of f and fs, and than take M samples?
My goal is to select an integer number of periods in order to be able to replicate them without changes.
You have f periods a second, and fs samples a second.
If you take M samples, it would cover M/fs part of a second, or P = f * (M/fs) periods. You want this number to be integer.
So you need to take M = fs / gcd(f, fs) samples.
For your example P = 1000 / gcd(100, 1000) = 1000 / 100 = 10.
If you have 60 Hz frequency and 80 Hz sampling frequency, it gives P = 80 / gcd(60, 80) = 80 / 20 = 4 -- 4 samples will cover 4 * 1/80 = 1/20 part of a second, and that will be 3 periods.
If you have 113 Hz frequency and 512 Hz sampling frequency, you are out of luck, since gcd(113, 512) = 1 and you'll need 512 samples, covering the whole second and 113 periods.
In general, an arbitrary frequency will not have an integer number of periods. Irrational frequencies will never even repeat ever. So some means other than concatenation of buffers one period in length will be needed to synthesize exactly periodic waveforms of arbitrary frequencies. Approximation by interpolation for fractional phase offsets is one possibility.

vowpalwabbit strange features count

I have found that during training my model vw shows very big (much more than my features count ) feature number count in it's log.
I have tried to reproduce it using some small example:
simple.test:
-1 | 1 2 3
1 | 3 4 5
then "vw simple.test" command says that it have used 8 features. +one feature is constant but what are the other ? And in my real exmaple difference between my features and features used in wv is abot x10 more.
....
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = t
num sources = 1
average since example example current current current
loss last counter weight label predict features
finished run
number of examples = 2
weighted example sum = 2
weighted label sum = 3
average loss = 1.9179
best constant = 1.5
total feature number = 8 !!!!
total feature number displays a sum of feature counts from all observed examples. So it's 2*(3+1 constant)=8 in your case. The number of features in current example is displayed in current features column. Note that only 2^Nth example is printed on screen by default. In general observations can have unequal number of features.

How to process % to negative number in Visual Foxpro

How to do % to negative number in VF?
MOD(10,-3) = -2
MOD(-10,3) = 2
MODE(-10,-3) = -1
Why?
It is a regular modulo:
The mod function is defined as the amount by which a number exceeds
the largest integer multiple of the divisor that is not greater than
that number.
You can think of it like this:
10 % -3:
The largest multiple of 10 that is less than -3 is -2.
So 10 % -3 is -2.
-10 % 3:
Now, why -10 % 3 is 2?
The easiest way to think about it is to add to the negative number a multiple of 2 so that the number becomes positive.
-10 + (4*3) = 2 so -10 % 3 = (-10 + 12) % 3 = 2 % 3 = 3
Here's what we said about this in The Hacker's Guide to Visual FoxPro:
MOD() and % are pretty straightforward when dealing with positive numbers, but they get interesting when one or both of the numbers is negative. The key to understanding the results is the following equation:
MOD(x,y) = x - (y * FLOOR(x/y))
Since the mathematical modulo operation isn't defined for negative numbers, it's a pleasure to see that the FoxPro definitions are mathematically consistent. However, they may be different from what you'd initially expect, so you may want to check for negative divisors or dividends.
A little testing (and the manuals) tells us that a positive divisor gives a positive result while a negative divisor gives a negative result.

Resources