Why is my p value too low with > 1,000,000 obs (p<<0.01) - p-value

I know that generally a low P value is good since I want to reject the H0 hypothesis. But my problem is an odd one, and I would appreciate any help or insight you may give me.
I work with huge data sets (n > 1,000,000), each representing data of one year. I am required to analyse the data and find out whether the mean of the year is significantly different than the mean of the previous year. Yet everyone would prefer it to be non-significant instead of significant.
By "significant" I mean that I want to be able to tell my boss, "look, these non-significant changes are noise, while these significant changes represent something real to consider."
The problem is that simply comparing the two averages with a t-test always results in a significant difference, even if the difference is very very small (probably due to the huge sample size) and falls within the O.K zone of reality. So basically the way I perceive it, a p value does not function well for my needs.
What do you think I should do?

There is nothing wrong with the p value. Even slight effects with this number of observations will be flagged for significance. You have rightfully asserted that the effect size for such a sample is very weak. This basically nullifies whatever argument can be made for using the p value alone for "significance"...while the effect can be determined to not be by chance, its actual usefulness in the real world is likely low given it doesn't produce anything predictable.
For a comprehensive book on this subject, see the often-cited book by Jacob Cohen on power analysis. You can also check out my recent post on Cross Validated regarding two regression models with significant p values for predictors, but with radically different predictive power.

Related

adjusted fitness in NEAT algorithm

I'm learning about NEAT from the following paper: http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
I'm having trouble understanding how adjusted fitness penalizes large species and prevents them from dominating the population, I'll demonstrate my current understanding through an example and hopefully some one will correct my understanding.
Let's say we have two species, A and B, species A did really well last generation and were given more children, this generation they have 4 children and their fitnesses are [8,10,10,12] while B has 2 and their fitnesses are [9,9] so now their adjusted fitnesses will be A[2, 2.5, 2.5, 3] and B[4.5, 4.5].
Now onto distributing children, the paper states: "Every species is assigned a potentially different number of offspring in proportion to the sum of adjusted fitnesses f'_i of its member organisms"
So the sum of adjusted fitnesses is 10 for A and 9 for B thus A gets more children and keeps growing, so how does this process penalizes large species and prevent them from dominating the population?
Great question! I completely agree that this paper (specifically the part you quoted) says that the offspring are assigned based on the sum of adjusted fitnesses within a species. Since adjusted fitness is calculated by dividing fitness by the number of members of a species, this would be mathematically equivalent to assigning offspring based on the average fitness of each species (as in your example). As you say, that, in and of itself, should not have the effect of curtailing the growth of large species.
Unless I'm missing something, there is not enough information in the paper to determine whether A) There are additional implementation details not mentioned in the paper that cause this selection scheme to have the stated effect, B) This is a mistake in the writing of the paper, or C) This is how the algorithm was actually implemented and speciation wasn't helpful for the reasons the authors thought it was.
Regarding option A: Immediately after the line you quoted, the paper says "Species then reproduce by first eliminating the lowest performing members from the population. The entire population is then replaced by the offspring of the remaining organisms in each species." This could be implemented such that each species primarily replaces its own weakest organisms, which would make competition primarily occur within species. This is a technique called crowding (introduced in the Mahfoud, 1995 paper that this paper cites) and it can have similar effects to fitness sharing, especially if it were combined with certain other implementation decisions. However, it would be super weird for them to have done this, not mentioned it, and then said they were using fitness sharing rather than crowding. So I think this explanation is unlikely.
Regarding option B: Most computer science journal papers, like this one, are based off of groups of conference papers where the work was originally presented. The conference paper where most of the speciation research on NEAT was presented is here: https://pdfs.semanticscholar.org/78cc/6d52865d2eab817aaa3efd04fd8f46ca8b61.pdf. In the explanation of fitness sharing, that paper says: "Species then grow or shrink depending on whether their average adjusted fitness is above or below the population average" (emphasis mine). This is different than the sum of adjusted fitness referred to in the paper you linked to. If they were actually using the average (and mistakenly said sum), they'd effectively be dividing by the number of members of each species twice, which would make all of the other claims accurate, and make the data make sense.
Regarding option C: This one seems unlikely, since Figure 7 makes it look like there's definitely stable coexistence for longer than you'd expect without some sort of negative frequency dependence. Also, they clearly put a lot of effort into dissecting the effect of speciation, so I wouldn't expect them to miss something like that. Especially in such an impactful paper that so many people have built on.
So, on the whole, I'd say my money is on explanation B - that this is a one-word mistake that changes the meaning substantially. But it's hard to know for sure.
The solution is simple, as the population size is constant. Hence, all your calculations are correct, but your popsize is 6, and 10:9 is roughly even, which results in 3 A and 3 B, so actually, the species A is shrinking, while species B is growing (as intended).

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Are heuristic functions that produce negative values inadmissible?

As far as I understand, admissibility for a heuristic is staying within bounds of the 'actual cost to distance' for a given, evaluated node. I've had to design some heuristics for an A* solution search on state-spaces and have received a lot of positive efficiency using a heuristic that may sometimes returns negative values, therefore making certain nodes who are more 'closely formed' to the goal state have a higher place in the frontier.
However, I worry that this is inadmissible, but can't find enough information online to verify this. I did find this one paper from the University of Texas that seems to mention in one of the later proofs that "...since heuristic functions are nonnegative". Can anyone confirm this? I assume it is because returning a negative value as your heuristic function would turn your g-cost negative (and therefore interfere with the 'default' dijkstra-esque behavior of A*).
Conclusion: Heuristic functions that produce negative values are not inadmissible, per se, but have the potential to break the guarantees of A*.
Interesting question. Fundamentally, the only requirement for admissibility is that a heuristic never over-estimates the distance to the goal. This is important, because an overestimate in the wrong place could artificially make the best path look worse than another path, and prevent it from ever being explored. Thus a heuristic that can provide overestimates loses any guarantee of optimality. Underestimating does not carry the same costs. If you underestimate the cost of going in a certain direction, eventually the edge weights will add up to be greater than the cost of going in a different direction, so you'll explore that direction too. The only problem is loss of efficiency.
If all of your edges have positive costs, a negative heuristic value can only over be an underestimate. In theory, an underestimate should only ever be worse than a more precise estimate, because it provides strictly less information about the potential cost of a path, and is likely to result in more nodes being expanded. Nevertheless, it will not be inadmissible.
However, here is an example that demonstrates that it is theoretically possible for negative heuristic values to break the guaranteed optimality of A*:
In this graph, it is obviously better to go through nodes A and B. This will have a cost of three, as opposed to six, which is the cost of going through nodes C and D. However, the negative heuristic values for C and D will cause A* to reach the end through them before exploring nodes A and B. In essence, the heuristic function keeps thinking that this path is going to get drastically better, until it is too late. In most implementations of A*, this will return the wrong answer, although you can correct for this problem by continuing to explore other nodes until the greatest value for f(n) is greater than the cost of the path you found. Note that there is nothing inadmissible or inconsistent about this heuristic. I'm actually really surprised that non-negativity is not more frequently mentioned as a rule for A* heuristics.
Of course, all that this demonstrates is that you can't freely use heuristics that return negative values without fear of consequences. It is entirely possible that a given heuristic for a given problem would happen to work out really well despite being negative. For your particular problem, it's unlikely that something like this is happening (and I find it really interesting that it works so well for your problem, and still want to think more about why that might be).

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

When to use geometric vs arithmetic mean?

So I guess this isn't technically a code question, but it's something that I'm sure will come up for other folks as well as myself while writing code, so hopefully it's still a good one to post on SO.
The Google has directed me to plenty of nice lengthy explanations of when to use one or the other as regards financial numbers, and things like that.
But my particular context doesn't fit in, and I'm wondering if anyone here has some insight. I need to take a whole bunch of individual users' votes on how "good" a particular item is. I.e., some number of users each give a particular item a score between 0 and 10, and I want to report on what the 'typical' score is. What would be the intuitive reasons to report the geometric and/or arithmetic mean as the typical response?
Or, for that matter, would I be better off reporting the median instead?
I imagine there's some psychology involved in what the "best" method might be...
Anyway, there you have it.
Thanks!
Generally speaking, the arithmetic mean will suffice. It is much less computationally intensive than the geometric mean (which involves taking an n-th root).
As for the psychology involved, the geometric mean is never greater than the arithmetic mean, so arithmetic is the best choice if you'd prefer higher scores in general.
The median is most useful when the data set is relatively small and the chance of a massive outlier relatively high. Depending on how much precision these votes can take, the median can sometimes end up being a bit arbitrary.
If you really really want the most accurate answer possible, you could go for calculating the arithmetic-geomtric mean. However, this involved calculating both arithmetic and geometric means repeatedly, so it is very computationally intensive in comparison.
you want the arithmetic mean. since you aren't measuring the average change in average or something.
Arithmetic mean is correct.
Your scale is artificial:
It is bounded, from 0 and 10
8.5 is intuitively between 8 and 9
But for other scales, you would need to consider the correct mean to use.
Some other examples
In counting money, it has been argued that wealth has logarithmic utility. So the median between Bill Gates' wealth and a bum in the inner city would be a moderately successful business person. (Arithmetic average would hive you Larry Page.)
In measuring sound level, decibels already normalizes the effect. So you can take arithmetic average of decibels.
But if you are measuring volume in watts, then use quadratic means (RMS).
The answer depends on the context and your purpose. Percent changes were mentioned as a good time to use geometric mean. I use geometric mean when calculating antennas and frequencies since the percentage change is more important than the average or middle of the frequency range or average size of the antenna is concerned. If you have wildly varying numbers, especially if most are similar but one or two are "flyers" (far from the range of the others) the geometric mean will "smooth" the results (not let the different ones exert a change in the results more than they should). This method is used to calculate bullet group sizes (the "flyer" was probably human error, not the equipment, so the average is ""unfair" in that case). Another variation similar to geometric mean is the root mean square method. First you take the square root of the numbers, take THAT mean, and then square your answer (this provides even more smoothing). This is often used in electrical calculations and most electical meters are calculated in "RMS" (root mean square), not average readings. Hope this helps a little. Here is a web site that explains it pretty well. standardwisdom.com

Resources