Some questions about SAMME.R - machine-learning

The picture of the algorithm
The paper about SAMME.R algorithm
Firstly,in the 2a step, if fits a classifier T(x) to the training data using weights,but I don't know how the algorithm uses the classifier T(x) in the following part.
Secondly, in the 2b step, i don't know how to obtain the weighted class probability estimates. It just says we can use decision tree to estimate the probability, but I don't know how to do it.
Thanks in advance. My English is poor and my question may be vague. I am really sorry for this. If you can't understand my question, just comment it and I will try my best to expound my question clearly! Thank you very much!!

I also found myself contemplating on the same problem not too long ago. For whatever it's worth, here is my opinion on the matter:
Step 2a
Train a DecisionTree or any other classifier that can supply probability estimates. You can find an interesting article on estimating probabilities with DecisionTrees here. This classifier will be used in step 2b.
Step 2b
One way to look at this if you expand the formula:
In words, to compute the weighted probability for some label (i) you multiply the probability estimated at the previous step for label i with the sum of weights of the samples that have the label i. In practice, the classifier at step 2a may use the weights in some other way and at the end only supply the weighted probability estimates. A nice post on this for Decision Trees is here.
I hope you find this answer helpful!

Related

Is there a loss function considering bias and variance?

I'm trying to understand the bias and variance more.
I'm wondering if there is a loss function considering bias and variance.
As far as I know, the high bias makes underfit, and the high variance makes overfit.
the image from here
If we can consider the bias and variance in the loss, it could be like this, bias(x) + variance(x) + some_other_loss(x). And my curious point is two-part.
Is there a loss function considering bias and variance?
If the losses we normally have used already considered the bias and variance, how can I measure the bias and variance separately in scores?
This kind of question could be a fundamental mathematical question, I think. If you have any hint for that, I'll really appreciate it.
Thank you for reading my weird question.
After writing the question, I realized that regularization is one of the ways to reduce the variance. Then, 3) is it the way to measure the bias in a score?
Thank you again.
Update at Jan 16th, 2022
I have searched a little bit and answered myself. If there are wrong understandings, please comment below.
Bais is represented by loss value during training, so we don't need an additional bias loss function.
But for the variance, there is no way to score, because if we want to measure it we should get the training loss and unseen data's loss. But once we use the unseen data as a training loss, the unseen data be seen data. So this will are not unseen data anymore in terms of the model. So as far as I understand, there is no way to measure variance for training loss.
I hope other people can be helped and please comment your thinking if you have.
As you have clearly stated that high bias -> model is underfitting in comparison to a good fit, and high variance -> over fitting than a good fit.
Measuring either of them requires you to know the good fit in advance, which happens to be the end goal of training a model. Hence, it is not possible to measure underfitting or over fitting during training itself. However, if you can have an idea of a target amount of loss, you can use an early stopping callback to stop around the good fit.

How do sample weights work in classification models?

What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .

Machine Learning strategy needed

I have a group of 20 yes/no/na questions that my company uses to assess whether or not to bid for an opportunity. To date, we have filled out the questionnaire 634 times.
The current algorithm simply divides yes / (yes + no) and a score over 50% recommends that we pursue the opportunity. n/a answers are disregarded.
We have tracked win/loss data on all of the pursuits, so I have a labeled dataset and I'm considering a supervised machine learning algorithm to replace our crude yes/no calculation.
I'm looking for a suggested method of supervised machine learning in Python (I'm most familar with SKLearn). Decision Tree Classifier?
Thank you in advance.
You have 20 y/n answers as features. Let yes be 1 and no be 0. So there are 20 binary features.
You also have target variable (win/loss) data. Let win be 1 and loss be 0. You can use an SVM/ NN right away. In my experience SVM and logistic regression give similar accuracies.
But if you are looking to explain each feature's contribution in shaping the decision, you should use naive-bayes or Decision Trees
It is important to know who is saying yeses and nos, so if you have 10 experts answering those 20 questions with yes/no/na, you have 10x20x3 states or binary features where every expert has 60 features.
Besides you can use features from the project itself like if the project is from oil industry or mining or manufacturing, etc. Some experts might be better in prediction in one industry over the others.
For classification, you can try random forests from sklearn.
Note that instead of classification (labelling if the project was pursued or disregarded) you can change the problem into a regression task by labelling the samples with the amount of profit or loss the company achieved from either pursuing (- or +) or disregarding (0) the project.
Hope this helps.

How can I get the relative importance of features of a logistic regression for a particular prediction?

I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction.
Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative.
So far the best I have found are the following 3 options:
Monte Carlo Option: Fixing all other features, re-run the prediction replacing the feature we want to evaluate with random samples from the training set. Do this a large number of times. This would establish a baseline probability for the positive class. Then compare with the probability of the positive class of the original run. The difference is a measure of Importance of the feature.
"Leave-one-out" classifiers: To evaluate the importance of a feature, first create a model which uses all features, and then another that uses all features except the one being tested. Predict the new observation using both models. The difference between the two would be the importance of the feature.
Adjusted betas: Based on this answer, ranking the importance of the features by 'the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.'
All options (using betas, Monte Carlo and "Leave-one-out") seem like poor solutions to me.
The Monte Carlo is dependent on the distribution of the training set, and I cannot find any literature to support it.
The "leave one out" would be easily tricked by two correlated features (when one were absent, the other one would step in to compensate, and both would be given 0 importance).
The adjusted betas sounds plausible, but I cannot find any literature to support it.
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
Quick note #1: for Random Forests this is trivial, we can simply use the prediction + bias decomposition, as explained beautifully in this blog post. The problem here is how to do something similar with linear classifiers such as Logistic Regression.
Quick note #2: there are a number of related questions on stackoverflow (1 2 3 4 5). I have not been able to find an answer to this specific question.
If you want the importance of the features for a particular decision, why not simulate the decision_function (Which is provided by scikit-learn, so you can test whether you get the same value) step by step? The decision function for linear classifiers is simply:
intercept_ + coef_[0]*feature[0] + coef_[1]*feature[1] + ...
The importance of a feature i is then just coef_[i]*feature[i]. Of course this is similar to looking at the magnitude of the coefficients, but since it is multiplied with the actual feature and it is also what happens under the hood it might be your best bet.
I suggest to use eli5 which already have similar things implemented.
For you question:
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
I would say the answer come the the function show_weights() from eli5.
Furthermore this can be implemented with many other classifiers.
For more info you can see this question in related question.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Resources