adjusted fitness in NEAT algorithm - machine-learning

I'm learning about NEAT from the following paper: http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
I'm having trouble understanding how adjusted fitness penalizes large species and prevents them from dominating the population, I'll demonstrate my current understanding through an example and hopefully some one will correct my understanding.
Let's say we have two species, A and B, species A did really well last generation and were given more children, this generation they have 4 children and their fitnesses are [8,10,10,12] while B has 2 and their fitnesses are [9,9] so now their adjusted fitnesses will be A[2, 2.5, 2.5, 3] and B[4.5, 4.5].
Now onto distributing children, the paper states: "Every species is assigned a potentially different number of offspring in proportion to the sum of adjusted fitnesses f'_i of its member organisms"
So the sum of adjusted fitnesses is 10 for A and 9 for B thus A gets more children and keeps growing, so how does this process penalizes large species and prevent them from dominating the population?

Great question! I completely agree that this paper (specifically the part you quoted) says that the offspring are assigned based on the sum of adjusted fitnesses within a species. Since adjusted fitness is calculated by dividing fitness by the number of members of a species, this would be mathematically equivalent to assigning offspring based on the average fitness of each species (as in your example). As you say, that, in and of itself, should not have the effect of curtailing the growth of large species.
Unless I'm missing something, there is not enough information in the paper to determine whether A) There are additional implementation details not mentioned in the paper that cause this selection scheme to have the stated effect, B) This is a mistake in the writing of the paper, or C) This is how the algorithm was actually implemented and speciation wasn't helpful for the reasons the authors thought it was.
Regarding option A: Immediately after the line you quoted, the paper says "Species then reproduce by first eliminating the lowest performing members from the population. The entire population is then replaced by the offspring of the remaining organisms in each species." This could be implemented such that each species primarily replaces its own weakest organisms, which would make competition primarily occur within species. This is a technique called crowding (introduced in the Mahfoud, 1995 paper that this paper cites) and it can have similar effects to fitness sharing, especially if it were combined with certain other implementation decisions. However, it would be super weird for them to have done this, not mentioned it, and then said they were using fitness sharing rather than crowding. So I think this explanation is unlikely.
Regarding option B: Most computer science journal papers, like this one, are based off of groups of conference papers where the work was originally presented. The conference paper where most of the speciation research on NEAT was presented is here: https://pdfs.semanticscholar.org/78cc/6d52865d2eab817aaa3efd04fd8f46ca8b61.pdf. In the explanation of fitness sharing, that paper says: "Species then grow or shrink depending on whether their average adjusted fitness is above or below the population average" (emphasis mine). This is different than the sum of adjusted fitness referred to in the paper you linked to. If they were actually using the average (and mistakenly said sum), they'd effectively be dividing by the number of members of each species twice, which would make all of the other claims accurate, and make the data make sense.
Regarding option C: This one seems unlikely, since Figure 7 makes it look like there's definitely stable coexistence for longer than you'd expect without some sort of negative frequency dependence. Also, they clearly put a lot of effort into dissecting the effect of speciation, so I wouldn't expect them to miss something like that. Especially in such an impactful paper that so many people have built on.
So, on the whole, I'd say my money is on explanation B - that this is a one-word mistake that changes the meaning substantially. But it's hard to know for sure.

The solution is simple, as the population size is constant. Hence, all your calculations are correct, but your popsize is 6, and 10:9 is roughly even, which results in 3 A and 3 B, so actually, the species A is shrinking, while species B is growing (as intended).

Related

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

Why is my p value too low with > 1,000,000 obs (p<<0.01)

I know that generally a low P value is good since I want to reject the H0 hypothesis. But my problem is an odd one, and I would appreciate any help or insight you may give me.
I work with huge data sets (n > 1,000,000), each representing data of one year. I am required to analyse the data and find out whether the mean of the year is significantly different than the mean of the previous year. Yet everyone would prefer it to be non-significant instead of significant.
By "significant" I mean that I want to be able to tell my boss, "look, these non-significant changes are noise, while these significant changes represent something real to consider."
The problem is that simply comparing the two averages with a t-test always results in a significant difference, even if the difference is very very small (probably due to the huge sample size) and falls within the O.K zone of reality. So basically the way I perceive it, a p value does not function well for my needs.
What do you think I should do?
There is nothing wrong with the p value. Even slight effects with this number of observations will be flagged for significance. You have rightfully asserted that the effect size for such a sample is very weak. This basically nullifies whatever argument can be made for using the p value alone for "significance"...while the effect can be determined to not be by chance, its actual usefulness in the real world is likely low given it doesn't produce anything predictable.
For a comprehensive book on this subject, see the often-cited book by Jacob Cohen on power analysis. You can also check out my recent post on Cross Validated regarding two regression models with significant p values for predictors, but with radically different predictive power.

Naive Bayes without Naive assumption

I'm trying to understand why the naive Bayes classifier is linearly scalable with the number of features, in comparison to the same idea without the naive assumption. I understand how the classifier works and what's so "naive" about it. I'm unclear as to why the naive assumption gives us linear scaling, whereas lifting that assumption is exponential. I'm looking for a walk-through of an example that shows the algorithm under the "naive" setting with linear complexity, and the same example without that assumption that will demonstrate the exponential complexity.
The problem here lies in following quantity
P(x1, x2, x3, ..., xn | y)
which you have to estimate. When you assume "naiveness" (feature independence) you get
P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y)
and you can estimate each P(xi | y) independently. In a natural way, this approach scales linearly, since if you add another k features you need to estimate another k probabilities, each using some very simple technique (like counting objects with given feature).
Now, without naiveness you do not have any decomposition. Thus you you have to keep track of all probabilities of form
P(x1=v1, x2=v2, ..., xn=vn | y)
for each possible values of vi. In simplest case, vi is just "true" or "false" (event happened or not), and this already gives you 2^n probabilities to estimate (each possible assignment of "true" and "false" to a series of n boolean variables). Consequently you have exponential growth of the algorithm complexity. However, the biggest issue here is usually not computational one - but rather the lack of data. Since there are 2^n probabilities to estimate you need more than 2^n data points to have any estimate for all possible events. In real life you will not ever encounter dataset of size 10,000,000,000,000 points... and this is a number of required (unique!) points for 40 features with such an approach.
Candy Selection
On the outskirts of Mumbai, there lived an old Grandma, whose quantitative outlook towards life had earned her the moniker Statistical Granny. She lived alone in a huge mansion, where she practised sound statistical analysis, shielded from the barrage of hopelessly flawed biases peddled as common sense by mass media and so-called pundits.
Every year on her birthday, her entire family would visit her and stay at the mansion. Sons, daughters, their spouses, her grandchildren. It would be a big bash every year, with a lot of fanfare. But what Grandma loved the most was meeting her grandchildren and getting to play with them. She had ten grandchildren in total, all of them around 10 years of age, and she would lovingly call them "random variables".
Every year, Grandma would present a candy to each of the kids. Grandma had a large box full of candies of ten different kinds. She would give a single candy to each one of the kids, since she didn't want to spoil their teeth. But, as she loved the kids so much, she took great efforts to decide which candy to present to which kid, such that it would maximize their total happiness (the maximum likelihood estimate, as she would call it).
But that was not an easy task for Grandma. She knew that each type of candy had a certain probability of making a kid happy. That probability was different for different candy types, and for different kids. Rakesh liked the red candy more than the green one, while Sheila liked the orange one above all else.
Each of the 10 kids had different preferences for each of the 10 candies.
Moreover, their preferences largely depended on external factors which were unknown (hidden variables) to Grandma.
If Sameer had seen a blue building on the way to the mansion, he'd want a blue candy, while Sandeep always wanted the candy that matched the colour of his shirt that day. But the biggest challenge was that their happiness depended on what candies the other kids got! If Rohan got a red candy, then Niyati would want a red candy as well, and anything else would make her go crying into her mother's arms (conditional dependency). Sakshi always wanted what the majority of kids got (positive correlation), while Tanmay would be happiest if nobody else got the kind of candy that he received (negative correlation). Grandma had concluded long ago that her grandkids were completely mutually dependent.
It was computationally a big task for Grandma to get the candy selection right. There were too many conditions to consider and she could not simplify the calculation. Every year before her birthday, she would spend days figuring out the optimal assignment of candies, by enumerating all configurations of candies for all the kids together (which was an exponentially expensive task). She was getting old, and the task was getting harder and harder. She used to feel that she would die before figuring out the optimal selection of candies that would make her kids the happiest all at once.
But an interesting thing happened. As the years passed and the kids grew up, they finally passed from teenage and turned into independent adults. Their choices became less and less dependent on each other, and it became easier to figure out what is each one's most preferred candy (all of them still loved candies, and Grandma).
Grandma was quick to realise this, and she joyfully began calling them "independent random variables". It was much easier for her to figure out the optimal selection of candies - she just had to think of one kid at a time and, for each kid, assign a happiness probability to each of the 10 candy types for that kid. Then she would pick the candy with the highest happiness probability for that kid, without worrying about what she would assign to the other kids. This was a super easy task, and Grandma was finally able to get it right.
That year, the kids were finally the happiest all at once, and Grandma had a great time at her 100th birthday party. A few months following that day, Grandma passed away, with a smile on her face and a copy of Sheldon Ross clutched in her hand.
Takeaway: In statistical modelling, having mutually dependent random variables makes it really hard to find out the optimal assignment of values for each variable that maximises the cumulative probability of the set.
You need to enumerate over all possible configurations (which increases exponentially in the number of variables). However, if the variables are independent, it is easy to pick out the individual assignments that maximise the probability of each variable, and then combine the individual assignments to get a configuration for the entire set.
In Naive Bayes, you make the assumption that the variables are independent (even if they are actually not). This simplifies your calculation, and it turns out that in many cases, it actually gives estimates that are comparable to those which you would have obtained from a more (computationally) expensive model that takes into account the conditional dependencies between variables.
I have not included any math in this answer, but hopefully this made it easier to grasp the concept behind Naive Bayes, and to approach the math with confidence. (The Wikipedia page is a good start: Naive Bayes).
Why is it "naive"?
The Naive Bayes classifier assumes that X|YX|Y is normally distributed with zero covariance between any of the components of XX. Since this is a completely implausible assumption for any real problem, we refer to it as naive.
Naive Bayes will make the following assumption:
If you like Pickles, and you like Ice Cream, naive bayes will assume independence and give you a Pickle Ice Cream and think that you'll like it.
Which is may not be true at all.
For a mathematical example see: https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

How to deal with feature vector of variable length?

Say you're trying to classify houses based on certain features:
Total area
Number of rooms
Garage area
But not all houses have garages. But when they do, their total area makes for a very discriminating feature. What's a good approach to leverage the information contained in this feature?
You could incorporate a zero/one dummy variable indicating whether there is a garage, as well as the cross-product of the garage area with the dummy (for houses with no garage, set the area to zero).
The best approach is to build your dataset with all the features and in most cases it is just fine to fill with zeroes those columns that are not available.
Using your example, it would be something like:
Total area Number of rooms Garage area
100 2 0
300 2 5
125 1 1.5
Often, the learning algorithm that you chose would be powerful enough to use those zeroes to classify properly that entry. After all, absence of value it's still information for the algorithm. This just could become a problem if your data is skewed, but in that case you need to address the skewness anyway.
EDIT:
I just realize there were another answer with a comment of you being afraid to use zeroes, given the fact that could be confused with small garages. While I still don't see a problem with that (there should be enough difference between a small garage and zero), you can still use the same structure marking the non-existence area garage with a negative number ( let's say -1).
The solution indicated in the other answer is perfectly plausible too, having an extra feature indicating whether the house has garage or not would work fine (specially in decision tree based algorithms). I just prefer to keep the dimensionality of the data as low as possible, but at the end this is more a preference rather a technical decision.
You'll want to incorporate a zero indicator feature. That is, a feature which is 1 when the garage size is 0, and 0 for any other value.
Your feature vector will then be:
area | num_rooms | garage_size | garage_exists
Your machine learning algorithm will then be able to see this (non-linear) feature of garage size.

Resources