Genetic Algorithm CrossOver - machine-learning

I have a GA of population X.
After I run the gene and get the result for each gene I do some weighted multiply for the genes(so the better ranked genes get multiplied the most)
I get either x*2 or x*2+(x*100/10) genes. The 10% is random new genes it may or may not trigger depending on the mutation rate.
The problem is, I don' know what is the best approach to reduce the population to X again.
If the gene is a List should I just use list[::2] (or get every even index item from list)
What is a common practice when crossing genes?
EDIT:
Example of my GA with a population of 100;
Run the the 100 genes in the fitness function and get the result. Current Population: 100
Add 10% new random genes. Current Population: 110
Duplicate top 10% genes. Current Population: 121
Remove 10% worst genes. Current Population: 108
Combine all possible genes(no duplicates). Current Population: 5778
Remove genes from genepool until Population = 100. Current Population: 100
Restart the fitness function
What I want to know is: How should I do the last step? Currently I have a list with 5778 items and I take one every '58' or expressed as len(list)/startpopulation-1
Or should I use a 'while True' with a random.delete until len(list) == 100?
The new random genes should be added before or after the crossover?
Is there a way to make a gausian multiplication of the top-to-lowest rated items?
e.g: the top rated are multiplied by n, the second best by (n-1), the third by (n-2) ..., the worst rated multiplied by (n-n).

I do not really know why you are performing GA like that, could you give some references?
In any case here goes my typical solution for implementing a functional GA method:
Run the the 100 genes in the fitness function and get the result.
Randomly choose 2 genes based on the normalized fitness function
(consider this the probability of each gene to be chosen from the
pool) and cross-over. Repeat this step until you have 90 new genes
(45 times for this case). Save the top 5 without modification and
duplicate. Total genes: 100.
For the 90 new genes and the 5 duplicates on the new pool allow
them to mutate based on your mutation probability (typically 1%).
Total genes: 100.
Repeat from 1) to 3) until convergence, or X number of
iterations.
Note: You always want to keep unchanged the best genes such as you always get a better solution in each iteration.
Good luck!

Related

In the algorithm LambdaRank (in Learning to Rank) what does |∆ NDCG| means?

This Article describes the LambdaRank algorithm for information retrieval. In formula 8 page 6, the authors propose to multiply the gradient (lambda) by a term called |∆NDCG|.
I do understand that this term is the difference of two NDCGs when swapping two elements in the list:
the size of the change in NDCG (|∆NDCG|) given by swapping the rank positions of U1 and U2
(while leaving the rank positions of all other urls unchanged)
However, I do not understand which ordered list is considered when swapping U1 and U2. Is it the list ordered by the predictions from the model at the current iteration ? Or is it the list ordered by the ground-truth labels of the documents ? Or maybe, the list of the predictions from the model at the previous iteration as suggested by Tie-Yan Liu in his book Learning to Rank for Information Retrieval ?
Short answer: It's the list ordered by the predictions from the model at the current iteration.
Let's see why it makes sense.
At each training iteration, we perform the following steps (these steps are standard for all Machine Learning algorithms, whether it's classification or regression or ranking tasks):
Calculate scores s[i] = f(x[i]) returned by our model for each document i.
Calculate the gradients of model's weights ∂C/∂w, back-propagated from RankNet's cost C. This gradient is the sum of all pairwise gradients ∂C[i, j]/∂w, calculated for each document's pair (i, j).
Perform a gradient ascent step (i.e. w := w + u * ∂C/∂w where u is step size).
In "Speeding up RankNet" paragraph, the notion λ[i] was introduced as contributions of each document's computed scores (using the model weights at current iteration) to the overall gradient ∂C/∂w (at current iteration). If we order our list of documents by the scores from the model at current iteration, each λ[i] can be thought of as "arrows" attached to each document, the sign of which tells us to which direction, up or down, that document should be moved to increase NDCG. Again, NCDG is computed from the order, predicted by our model.
Now, the problem is that the lambdas λ[i, j] for the pair (i, j) contributes equally to the overall gradient. That means the rankings of documents below, let’s say, 100th position is given equal improtance to the rankings of the top documents. This is not what we want: we should prioritize having relevant documents at the very top much more than having correct ranking below 100th position.
That's why we multiply each of those "arrows" by |∆NDCG| to emphasise on top ranking more than the ranking at the bottom of our list.

what is epsilon/k how did that come in epsilon greedy algorithm

As it was told it would choose the arm having highest emperical mean with probability 1-epsilon how did epsilon/k add to it (and also epsilon/k for random probability selection)in the equation written for probability in the page no:6 of the paperAlgorithms for multi armed bandits.What does that mean epsilon/k writing there in the equation
This answer was taken from here:
Suppose you are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.
You have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second, related, goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that you have to use some coins to explore to find the best machine, and you want to use as many coins as possible on the best machine to exploit your knowledge.
Epsilon-greedy is almost too simple. As you play the machines, you keep track of the average payout of each machine. Then, you select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And you select machines that don’t have the highest current payout average with probability = epsilon / k.
It much easier to understand with a concrete example. Suppose, after your first 12 pulls, you played machine #1 four times and won $1 two times and $0 two times. The average for machine #1 is $2/4 = $0.50.
And suppose you’ve played machine #2 five times and won $1 three times and $0 two times. The average payout for machine #2 is $3/5 = $0.60.
And suppose you’ve played machine #3 three times and won $1 one time and $0 two times. The average payout for machine #3 is $1/3 = $0.33.
Now you have to select a machine to play on try number 13. You generate a random number p, between 0.0 and 1.0. Suppose you have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), you select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), you select a random machine, so each machine has a 1/3 chance of being selected.
Notice that machine #2 might get picked anyway because you select randomly from all machines.
Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.
There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations.
Let me try giving my point of view here.
Lets consider the similar example of 3 machines: A, B and C and assume B has the highest pay out.
If epsilon is 0.1, then what is the probability of choosing B?
Recall Epsilon Greedy algo, it says:
r = random() # any random number between 0 and 1(uniform distribution)
if r > epsilon:
choose "Best pay out at current time" #(currently it is B)
else:
choose randomly between three machines
so what is the probable number of chances of choosing B in 100 chances:
It will be sum of the below two:
1) 90 chances out of 100 (if condition)
2) one third chances out of remaining 10(else condition) as there are 3 machines(equal possibility of choosing each of them)
So the total chances can be 90 + 10* (1/3) (approx) = 93.33
But wait what if epsilon is 0.5?
Then total chances would be 95 + 5*(1/3)= 96.67
Thats y we say, probability of selecting the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k).
I hope this helps.

How to calculate the accuracy of classes from a 7x7 confusion matrix?

So I've got the following results from Naïves Bayes classification on my data set:
I am stuck however on understanding how to interpret the data. I am wanting to find and compare the accuracy of each class (a-g).
I know accuracy is found using this formula:
However, lets take the class a. If I take the number of correctly classified instances - 313 - and divide it by the total number of 'a' (4953) from the row a, this gives ~6.32%. Would this be the accuracy?
EDIT: if we use the column instead of the row, we get 313/1199 which gives ~26.1% which seems a more reasonable number.
EDIT 2: I have done a calculation of the accuracy of a in excel which gives me 84% as the accuracy, using the accuracy calculation shown above:
This doesn't seem right, as the overall accuracy of classification successfully is ~24%
No -- all you've calculated is tp/(tp+fn), the total correct identifications of class a, divided by the total of actual a examples. This is recall, not accuracy. You need to include the other two figures.
fp is the rest of the a column; tn is all of the other figures in the non-a rows and columns, the 6x6 sub-matrix. This will reduce all 35K+ trials to a 2x2 matrix with labels a and not a, the 2x2 confusion matrix with which you're already familiar.
Yes, you get to repeat that reduction for each of the seven features. I recommend doing it programmatically.
RESPONSE TO OP UPDATE
Your accuracy is that high: you have a huge quantity of true negatives, not-a samples that were properly classified as not-a.
Perhaps it doesn't feel right because our experience focuses more on the class in question. There are [other statistics that handle that focus.
Recall is tp / (tp+fn) -- of all items actually in class a, what percentage did we properly identify? This is the 6.32% figure.
Precision is tp / (tp + fp) -- of all items identified as class a, what percentage were actually in that class. This is the 26.1% figure you calculated.

Association Rule - Non-Binary Items

I have studied association rules and know how to implement the algorithm on the classic basket of goods problem, such as:
Transaction ID Potatoes Eggs Milk
A 1 0 1
B 0 1 1
In this problem each item has a binary identifier. 1 indicates the basket contains the good, 0 indicates it does not.
But what would be the best way to model a basket which can contain many of the same good? E.g., take the below, very unrealistic example.
Transaction ID Potatoes Eggs Milk
A 5 0 178
B 0 35 7
Using binary indicators in this case would obviously be losing a lot of information and I am seeking a model which takes into account not only the presence of items in the basket, but also the frequency that the items occur.
What would be a suitable algorithm for this problem?
In my actual data there are over one hundred items and, based on the profile of a user's basket, I would like to calculate the probabilities of the customer consuming the other available items.
An alternative is to use binary indicators but constructing them in a more clever way.
The idea is to set the indicator when an amount is more than the central value, which means that it shall be significant. If everyone buys 3 breads on average, does it make sense to flag someone as a "bread-lover" for buying two or three?
Central value can a plain arithmetic mean, one with outliers removed, or the median.
Instead of:
binarize(x) = 0 if x = 0
1 otherwise
you can use
binarize*(x) = 0 if x <= central(X)
1 otherwise
I think if you really want to have probabilities is to encode your data in a probabilistic way. Bayesian or Markov networks might be a feasible way. Nevertheless without having a reasonable structure this will be computational extremely expansive. For three item types this, however, seems to be feasible
I would try to go for a Neural Network Autoencoder if you have many more item types. If there is some dependency in the data it will discover that.
For the above example you could use a network with three input, two hidden and three output neurons.
A little bit more fancy would be to use 3 fully connected layers with drop out in the middle layer.

Need help maximizing 3 factors in multiple, similar objects and ordering appropriately

I need to write an algorithm in any language that would order an array based on 3 factors. I use resorts as an example (like Hipmunk). Let's say I want to go on vacation. I want the cheapest spot, with the best reviews, and the most attractions. However, there is obviously no way I can find one that is #1 in all 3.
Example (assuming there are 20 important attractions):
Resort A: $150/night...98/100 in favorable reviews...18 of 20 attractions
Resort B: $99/night...85/100 in favorable reviews...12 of 20 attractions
Resort C: $120/night...91/100 in favorable reviews...16 of 20 attractions
Resort B looks the most appealing in price, but is 3rd in the other 2 categories. Wherein, I can choose resort C for only $21 more a night and get more attractions and better reviews. Price is still important to me, but Resort A has outstanding reviews and a ton of attractions: Is $51 more worth the splurge?
I want to be able to populate a list that will order a lit from "best to worst" (I quote bc it is subjective to the consumer). How would I go about maximizing the value for each resort?
Should I put a weight for each factor (ie: 55% price, 30% reviews, 15% amenities) and come to the result of a set number and order them that way?
Do I need a mode, median and range for all the hotels and determine the average price, and have the hotels around the average price hold the most weight?
If it is a little confusing then check out www.hipmunk.com. They have an airplane sort they call Agony (and a hotel sort which is similar to my question) that they use as their own. I used resorts as an example to make my question hopefully make a little more sense. How does one put math to a problem like this?
I was about to ask the same question about multiple-factor weighted sorting, because my research only came up with answers (e.g. formulas with explanations) for two-factor sorting.
Even though we're both asking about 3 factors, I'll list the possibilities I've found in case they're helpful.
Possibilities:
Note: S is the "sorting score", which is what you'd sort by (asc or desc).
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax).
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 and #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but then the problem is how do you do this for more than two factors?
In your case, assigning the weights in #1 would probably be fine. You'll need to fine-tune the algorithm depending on what your users consider more important - you could expose the weights wx as a filter (like 1-10 dropdown) so your users can adjust their search on the fly. Or if you wanted to get clever you could poll your users before they're searching ("Which is more important to you?") and then assign a weighting set based on the response, and after tracking enough polls you could autosuggest the weighting scheme based on most responses.
Hope that gets you on the right track.
What about having variable weights, and letting the user adjust it through some input like levers, so that the sort order will be dynamically updated?

Resources