I made a survey where users could vote on a subject. They were allowed to either yay it (+1) , nay it (–1) or don't care (0).
I only have the aggregate results in Google Sheets like
yay nay dontcare
Option A: 32 14 23
Option B: 12 37 20
Option C: 40 17 12
Option D: 64 3 2
The number of votes are always the same on every option.
Now I need to find out how controversial the answers are. I thought about STDEVP, but I do not have a list of cells, just the aggregates.
How do I find the standard deviation here with Google Sheets?
Assuming you ignore don't care's you can just take the prevalence of yay's and use sd=sqrt(p(1-p))
so if yay's are in column B, nays in C you use
=SQRT(B2/SUM(B2:C2) * (C2/SUM(B2:C2)))
Note that this is the standard deviation for a population.
If you want to include them you can use calculate the mean in E2 with
=SUMPRODUCT(B2:D2, {1, -1, 0}) / SUM(B2:D2)
Then you can calculate variance like this in F2
=SUMPRODUCT(ArrayFormula({1, -1, 0}-E2)^2, B2:D2) / (SUM(B2:D2)-1)
which is just taking every 1, -1, or 0 reduces by the mean, squares this deviation it and takes the average -1 degree of freedom (for the sample, leave the -1 out if you assume you have the population).
The Standard deviation is
=SQRT(F2)
Related
I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007
I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/
Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.
I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?
This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.
...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.
It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366
Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.
"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.
I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.
For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52
I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.
I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.