Genetic Algorithm, large population vs small one - machine-learning

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.

I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.

As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Related

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

How to differentiate between real improvement and random noise?

I am building an automatic translator in moses. To improve its performance, I use log-linear weight optimisation. This technique has a random component, which can affect slightly the final result (but I do not know exactly how much).
Suppose that the current performance of the model is 25 BLEU.
Suppose now I modify the language model (e.g. change the smoothing), and I get a performance of 26 BLEU.
My question is: how can I know if the improvement is because the modification, or is just noise from the random component?
This is pretty much what statistics is all about. You can basically do one of the two things (from the basic set of solutions, of course there are many more advanced):
try to measure/model/quantify the effect of randomness, if you know what is causing it, you might be able to actually compute how much it can affect your model. If analytical solution is not possible, you can always train 20 models with the same data/settings, gather results and estimate noise distribution. Once you have this you can perform statistical tests to check whether the improvement is statistically significant (for example by ANOVA tests).
simpler approach (but more expensive in terms of data/time) is to simply reduce the variance by averaging. In short - instead of training one model (or evaluating model once) which has this hard to determine noise component - do it many times, 10, 20, and average the results. This way you reduce the variance of the results in your analysis. This can (and should) be combined with the previous option - since now you have 20 results per run, thus you can again use statistical testes to see whether these are significantly different things.

Input selection for neural networks

I am going to use ANN for my work in which I have a large dataset, let say input[600x40] and output[600x6]. As one can see, the number of inputs (40) is too high for ANN and it may trap in local minimum and/or increases the CPU time dramatically. Is there any way to select the most informative input?
As my first try, I used the following code in Matlab to find the cross-correlation between each two inputs:
[rho, ~] = corr(inputs, 'rows','pairwise')
However, I think this simple correlation cannot identify some hidden complex relation between the inputs.
Any ideas?
First of all 40 inputs is a very small space and it should not be reduced. Large number of inputs is 100,000, not 40. Also, 600x40 is not a big dataset, nor the one "increasing the CPU time dramaticaly", if it learns slowly than check your code because it appears to be the problem, not your data.
Furthermore, feature selection is not a good way to go, you should use it only when gathering features is actually expensive. In any other scenario you are looking for dimensionality reduction, such as PCA, LDA etc. although as said before - your data should not be reduced, rather - you should consider getting more of it (new samples/new features).
Disclaimer: I'm with lejlot on this - you should get more data and
more features instead of trying to remove features. Still, that doesn't answer your question, so here we go.
Try most basic greedy approach - try removing each feature and retrain your ANN (several times, of course) and see if your results got better or worse. Choose this situation where results got better and improvement was the best. Repeat until you'll get no improvement by removing features. This will take a lot of time, so you may want to try doing it on some subset of your data (for example on 3 folds of dataset splitted into 10 folds).
It's ugly, but sometimes it works.
I repeat what I've said in disclaimer - this is not the way to go.

Image labeling performance using CRF

I need to develop an image labeling application, for this task I'm considering using Conditional Random Fields (CRF) over a set of superpixels, there exists quite a few papers that point out this technology as the state of the art for this task. As usual the task could be devided into two tasks:
Training model: which for this problem would be obtaining the parameter vector 'w', using for example
Testing: which would be obtaining the most feasible label assignment of a given set of superpixels, i.e argmax(P(y|x))
I'm aware of training-time to be quite high, however I have not found anything about testing-time nor performance, does anyone have and idea of how much time could take the testing problem? I suppose it will depend on the number of labels, image size, implementation, hardware, etc
Testing is slowish because you still have to solve a graph cuts problem (but nothing like training). There is an implementation you can try out at http://drwn.anu.edu.au/drwnProjMultiSeg.html (you have probably seen Stephen Gould's papers).
I still have the log file. but it is a bit hard to interpret so the following may not be totally accurate. On a super fast machine, I think it took about:
4.5 hours cpu time to train 20 classes on 276 images from MSRC dataset
50 mins cpu time to classify 256 images, most of which was spent doing alpha expansion

When to use geometric vs arithmetic mean?

So I guess this isn't technically a code question, but it's something that I'm sure will come up for other folks as well as myself while writing code, so hopefully it's still a good one to post on SO.
The Google has directed me to plenty of nice lengthy explanations of when to use one or the other as regards financial numbers, and things like that.
But my particular context doesn't fit in, and I'm wondering if anyone here has some insight. I need to take a whole bunch of individual users' votes on how "good" a particular item is. I.e., some number of users each give a particular item a score between 0 and 10, and I want to report on what the 'typical' score is. What would be the intuitive reasons to report the geometric and/or arithmetic mean as the typical response?
Or, for that matter, would I be better off reporting the median instead?
I imagine there's some psychology involved in what the "best" method might be...
Anyway, there you have it.
Thanks!
Generally speaking, the arithmetic mean will suffice. It is much less computationally intensive than the geometric mean (which involves taking an n-th root).
As for the psychology involved, the geometric mean is never greater than the arithmetic mean, so arithmetic is the best choice if you'd prefer higher scores in general.
The median is most useful when the data set is relatively small and the chance of a massive outlier relatively high. Depending on how much precision these votes can take, the median can sometimes end up being a bit arbitrary.
If you really really want the most accurate answer possible, you could go for calculating the arithmetic-geomtric mean. However, this involved calculating both arithmetic and geometric means repeatedly, so it is very computationally intensive in comparison.
you want the arithmetic mean. since you aren't measuring the average change in average or something.
Arithmetic mean is correct.
Your scale is artificial:
It is bounded, from 0 and 10
8.5 is intuitively between 8 and 9
But for other scales, you would need to consider the correct mean to use.
Some other examples
In counting money, it has been argued that wealth has logarithmic utility. So the median between Bill Gates' wealth and a bum in the inner city would be a moderately successful business person. (Arithmetic average would hive you Larry Page.)
In measuring sound level, decibels already normalizes the effect. So you can take arithmetic average of decibels.
But if you are measuring volume in watts, then use quadratic means (RMS).
The answer depends on the context and your purpose. Percent changes were mentioned as a good time to use geometric mean. I use geometric mean when calculating antennas and frequencies since the percentage change is more important than the average or middle of the frequency range or average size of the antenna is concerned. If you have wildly varying numbers, especially if most are similar but one or two are "flyers" (far from the range of the others) the geometric mean will "smooth" the results (not let the different ones exert a change in the results more than they should). This method is used to calculate bullet group sizes (the "flyer" was probably human error, not the equipment, so the average is ""unfair" in that case). Another variation similar to geometric mean is the root mean square method. First you take the square root of the numbers, take THAT mean, and then square your answer (this provides even more smoothing). This is often used in electrical calculations and most electical meters are calculated in "RMS" (root mean square), not average readings. Hope this helps a little. Here is a web site that explains it pretty well. standardwisdom.com

Resources