What should be called the majority verdict in Random Forest - machine-learning

Let us say I implemented a random forest algorithm with 20 trees using 20 random subsets of training data.
and there are 4 different class labels that can be predicted.
So, what exactly should be called a majority verdict.
If there are a total of 20 trees then should a majority verdict require that the highest voted class label is having at least 10 votes or does it simply need to be higher than other lables.
example:
Total Trees = 20, Class Labels are {A,B,C,D}
Scenario 1:
A= 10 votes
B= 4 votes
C= 3 votes
D = 3 votes
Clearly,A is the winner here
Scenario 2:
A= 6 votes
B= 5 votes
C= 5 votes
D = 4 votes
Can A be called the winner here?

If you are making a hard-decision, meaning you are asked to return the best guess, then yes A is the winner.
To capture the difference between these two cases, you can consider a soft-decision system instead, where you return the winner with a confidence value. An example confidence in this case can be the ratio of votes of A. Then, the first case would be a more confident estimate than the latter

Related

Multiple output MLP

I have an MLP model and my goal is to predict 2 variables(so ideally I should have 2 neurons in my final layer). However, the first variable (lets call it var1) has sub values(up to 12) and the second variable(var2) has just a single value. Does it make sense to have 13 neurons in my final layer and during back prop, I'll calculate 2 losses(one w.r.t to var1 and the second wrt var2), then sum them up?
For better intuition- my scenario is a bit complex so I'll use the house prediction analogy
Imagine we're trying to predict the price of houses and the number of rooms in each house(just for the sake of intuition)- we should have just 2 neurons in the final layer. However, lets say we want to be more specific(and we have enough data) to predict the prices of houses in 12 different states along side the number of rooms. So we'll have 13 neurons in our final layer(12 for the prices and 1 for # of rooms).
Does this architecture make sense?
Does it make sense then to compute our loss wrt to the 2 variables independently and sum them up?
Something like this
output_dims = 12
input_dims = 491776
mse_loss = nn.MSELoss()
model = MLP(input_dims, output_dims) #we get a tensor with 13 values
l1 = mse_loss(model[0][:-1], true_prices) #prices
l2 = mse_loss(model[0][-1], true_rooms) #number of rooms
loss = l1 + l2
optimizer.zero_grad()
loss.backward()
optimizer.step()

naive bayes for Forecast grade

I have data set of grade in four lessons (for example lesson a,lesson b,lesson c,lesson d) for 100 students and let's imagine this grades are In association with
grade of lesson f.
I want to implement naive Bayes for Forecast grade lesson f by that four grade but I don't know how use input for this.
I read naive Bayes for spam mail detection and in that, Possibility of each word Calculated.
But for grade I do not know what Possibility I must calculate.
I have tried like spam but for this example I have just four names (for each lesson)
In order to do a good classification, you need to have some information in plus about student than class they are taking. Following your exemple, spam detection is based on words, stop words which are generally spam (buy, promotion, money) or origin in http headers.
For the case to predict student grade, you could imagine having information about student like : social class, is he doing sport, male or female and so on.
Getting back to your question, it is not the name of the lessons which are interesting but the grades each students got at this lessons. You need to take grades of each four lessons and lesson f to train the naive Bayes classifier.
Your entry might look like that:
StudentID gradeA gradeB gradeC gradeD gradeF
1 10 9 8 5 8
2 3 5 3 8 8
3 5 3 1 1 2
4 10 10 10 5 4
After training your classifier you will pass new entry for a new student like that:
StudentID gradeA gradeB gradeC gradeD
1058 1 5 8 4
The classifier will be able to predict the grade for lesson F taking into consideration the precedant grades.
You might have notice that I intentionnally did a training dataset where gradeF is highly correlated with gradeD. It is what the Bayes classifier will try to learn, just in a more complexe way.

Genetic Algorithm CrossOver

I have a GA of population X.
After I run the gene and get the result for each gene I do some weighted multiply for the genes(so the better ranked genes get multiplied the most)
I get either x*2 or x*2+(x*100/10) genes. The 10% is random new genes it may or may not trigger depending on the mutation rate.
The problem is, I don' know what is the best approach to reduce the population to X again.
If the gene is a List should I just use list[::2] (or get every even index item from list)
What is a common practice when crossing genes?
EDIT:
Example of my GA with a population of 100;
Run the the 100 genes in the fitness function and get the result. Current Population: 100
Add 10% new random genes. Current Population: 110
Duplicate top 10% genes. Current Population: 121
Remove 10% worst genes. Current Population: 108
Combine all possible genes(no duplicates). Current Population: 5778
Remove genes from genepool until Population = 100. Current Population: 100
Restart the fitness function
What I want to know is: How should I do the last step? Currently I have a list with 5778 items and I take one every '58' or expressed as len(list)/startpopulation-1
Or should I use a 'while True' with a random.delete until len(list) == 100?
The new random genes should be added before or after the crossover?
Is there a way to make a gausian multiplication of the top-to-lowest rated items?
e.g: the top rated are multiplied by n, the second best by (n-1), the third by (n-2) ..., the worst rated multiplied by (n-n).
I do not really know why you are performing GA like that, could you give some references?
In any case here goes my typical solution for implementing a functional GA method:
Run the the 100 genes in the fitness function and get the result.
Randomly choose 2 genes based on the normalized fitness function
(consider this the probability of each gene to be chosen from the
pool) and cross-over. Repeat this step until you have 90 new genes
(45 times for this case). Save the top 5 without modification and
duplicate. Total genes: 100.
For the 90 new genes and the 5 duplicates on the new pool allow
them to mutate based on your mutation probability (typically 1%).
Total genes: 100.
Repeat from 1) to 3) until convergence, or X number of
iterations.
Note: You always want to keep unchanged the best genes such as you always get a better solution in each iteration.
Good luck!

Wilson scoring doesn't factor in negative votes?

I'm using the wilson scoring algorithm (code below) and realized it doesn't factor in negative votes.
Example:
Upvotes Downvotes Score
1 0 0.2070
0 0 0
0 1 0 <--- this is wrong
That isn't correct as the negative net votes should have a lower score.
def calculate_wilson_score(up_votes, down_votes)
require 'cmath'
total_votes = up_votes + down_votes
return 0 if total_votes == 0
z = 1.96
positive_ratio = (1.0*up_votes)/total_votes
score = (positive_ratio + z*z/(2*total_votes) - z * CMath.sqrt((positive_ratio*(1-positive_ratio)+z*z/(4*total_votes))/total_votes))/(1+z*z/total_votes)
score.round(3)
end
Update:
Here is a description of the Wilson scoring confidence interval on Wikipedia.
The Wilson Score Lower Confidence Bound posted will certainly take negative votes into account, although the lower confidence bound will not get lower than zero, which is perfectly fine. This approximation for ranking items is generally used for identifying the highest ranked items on a best-rated list. It thus may have undesirable properties when looking at the lowest ranked items, which are the type you are describing.
This method of ranking items was popularized by Evan Miller in a post on how not to sort by average rating, although he later stated
The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.
If you are genuinely interested in analyzing the lowest ranked items on a list, I would suggest either using the upper confidence bound, or using a Bayesian rating systems as described in: https://stackoverflow.com/a/30111531/3884938

What does recall mean in Machine Learning?

What's the meaning of recall of a classifier, e.g. bayes classifier? please give an example.
for example, the Precision = correct/correct+wrong docs for test data. how to understand recall?
Recall literally is how many of the true positives were recalled (found), i.e. how many of the correct hits were also found.
Precision (your formula is incorrect) is how many of the returned hits were true positive i.e. how many of the found were correct hits.
I found the explanation of Precision and Recall from Wikipedia very useful:
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 dogs identified, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.
So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".
Precision in ML is the same as in Information Retrieval.
recall = TP / (TP + FN)
precision = TP / (TP + FP)
(Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative).
It makes sense to use these notations for binary classifier, usually the "positive" is the less common classification. Note that the precision/recall metrics is actually the specific form where #classes=2 for the more general confusion matrix.
Also, your notation of "precision" is actually accuracy, and is (TP+TN)/ ALL
Giving you an example. Imagine we have a machine learning model which can detect cat vs dog. The actual label which is provided by human is called the ground-truth.
Again the output of your model is called the prediction. Now look at the following table:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
Say we want to find recall for the class cat. By definition recall means the percentage of a certain class correctly identified (from all of the given examples of that class). So for the class cat the model correctly identified it for 2 times (in example 0 and 2). But does it mean actually there are only 2 cats? No! In reality there are 3 cats in the ground truth (human labeled). So what is the percentage of correct identification of this certain class? 2 out of 3 that is (2/3) * 100 % = 66.67% or 0.667 if you normalize it within 1. Here is another prediction of cat in example 3 but it is not a correct prediction and hence, we are not considering it.
Now coming to mathematical formulation. First understand two terms:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FN (False negative): Predicting something negative when it is not actually negative.
Now for a certain class this classifier's output can be of two types: Cat or Dog (Not Cat). So the number correct identification is the number of True positive (TP). Again total number of examples of that class in ground-truth will be TP + FN. Because out of all cats the model either detected them correctly (TP) or didn't detect them correctly (FN i.e, the model falsely said Negative (Non Cat) when it was actually positive (Cat)). So For a certain class TP + FN denotes the total number of examples available in the ground truth of that class. So the formula is:
Recall = TP / (TP + FN)
Similarly recall can be calculated for Dog as well. At that time think the Dog as the positive class and the Cat as negative classes.
So for any number of classes to find recall of a certain class take the class as the positive class and take the rest of the classes as the negative classes and use the formula to find recall. Continue the process for each of the classes to find recall for all of them.
If you want to learn about precision as well then go here: https://stackoverflow.com/a/63121274/6907424
In very simple language: For example, in a series of photos showing politicians, how many times was the photo of politician XY was correctly recognised as showing A. Merkel and not some other politician?
precision is the ratio of how many times ANOTHER person was recognized (false positives) : (Correct hits) / (Correct hits) + (false positives)
recall is the ratio of how many times the name of the person shown in the photos was incorrectly recognized ('recalled'): (Correct calls) / (Correct calls) + (false calls)

Resources