Determining propensity scores with various classifiers - machine-learning

I've read a few papers which consider using classifiers besides logistic regression to determine propensity scores, but none of them explain how exactly to get the propensity scores from the model. Specifically, I'm interested in using J48, Bagging, and AdaBoostM1 (the latter two most likely either with logistic regression or J48). For J48 I was thinking I could just use the probability of that node being classified as yes to determine propensity and for the other two, take the probability from each tree or equation and average them. But I'm wondering if there is a correct way to do this. I have very little statistical background and no ML background. Any good resources would be greatly appreciated. Thanks.

Related

Regression Model Comparrison

I'm looking for metrics to compare various regressions models (e.g. SVM, Decision Tree, Neural Network etc), to decide the merits of each for solving a specific problem.
For my problem I have just over 80,000 training samples with 12 variables, all of which are independent and identically distributed.
I've done most of my research into neural networks but I'm drawing a blank when trying to compare them against other models.
Any input (including reading suggestions) would be greatly appreciated, thanks!
You can compare regression models by calculating the mean squared error for each model over a test set. The best model will simply be the one with the least error.
Sadly, there ist nothing like roc curves for regression models. Except your output is a binary variable like with logistic regression.

Random forest is worse than linear regression? It it normal and what is the reason?

I am trying to use machine learning to predict a dataset. It is a regression problem with 180 input features and 1 continuously-valued output. I try to compare deep neural networks, random forest regression, and linear regression.
As I expect, 3-hidden-layer deep neural networks outperform other two approaches with a root mean square error (RMSE) of 0.1. However, I unexpected to see that random forest even performs worse than linear regression (RMSE 0.29 vs. 0.27). In my expectation, the random forest can discover more complex dependencies between features to decrease error. I have tried to tune the parameters of random forest (number of trees, maximum features, max_depth, etc.). I also tried different K-cross validation, but the performance is still less than linear regression.
I searched online, and one answer says linear regression may perform better if features have a smooth, nearly linear dependence on the covariates. I do not fully get the point because if that is the case, should not deep neural networks give much performance gain?
I am struggling to give an explanation. Under what situation, random forest is worse than linear regression, but deep neural networks can perform much better?
If your features explain linear relation to the target variable then a Linear Model usually performs well than a Random Forest Model. It totally depends on the linear relations between your features.
That said, Linear models are not superior or the Random Forest is any inferior one.
Try scaling and transforming the data using MinMaxScaler() from scikit-learn to see if the linear model improves further
Pro Tips
If linear model is working like a charm you need to ask your self Why? and How? And get into the basics of both the models to understand why it worked on your data. These questions will lead you to feature engineer better. And as a matter of fact, Kaggle Grand Masters do use Linear Models in stacking to get that top 1% score by capturing the linear relations in the dataset.
So at the end of the day, linear models could wonders too.

How can I get the relative importance of features of a logistic regression for a particular prediction?

I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction.
Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative.
So far the best I have found are the following 3 options:
Monte Carlo Option: Fixing all other features, re-run the prediction replacing the feature we want to evaluate with random samples from the training set. Do this a large number of times. This would establish a baseline probability for the positive class. Then compare with the probability of the positive class of the original run. The difference is a measure of Importance of the feature.
"Leave-one-out" classifiers: To evaluate the importance of a feature, first create a model which uses all features, and then another that uses all features except the one being tested. Predict the new observation using both models. The difference between the two would be the importance of the feature.
Adjusted betas: Based on this answer, ranking the importance of the features by 'the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.'
All options (using betas, Monte Carlo and "Leave-one-out") seem like poor solutions to me.
The Monte Carlo is dependent on the distribution of the training set, and I cannot find any literature to support it.
The "leave one out" would be easily tricked by two correlated features (when one were absent, the other one would step in to compensate, and both would be given 0 importance).
The adjusted betas sounds plausible, but I cannot find any literature to support it.
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
Quick note #1: for Random Forests this is trivial, we can simply use the prediction + bias decomposition, as explained beautifully in this blog post. The problem here is how to do something similar with linear classifiers such as Logistic Regression.
Quick note #2: there are a number of related questions on stackoverflow (1 2 3 4 5). I have not been able to find an answer to this specific question.
If you want the importance of the features for a particular decision, why not simulate the decision_function (Which is provided by scikit-learn, so you can test whether you get the same value) step by step? The decision function for linear classifiers is simply:
intercept_ + coef_[0]*feature[0] + coef_[1]*feature[1] + ...
The importance of a feature i is then just coef_[i]*feature[i]. Of course this is similar to looking at the magnitude of the coefficients, but since it is multiplied with the actual feature and it is also what happens under the hood it might be your best bet.
I suggest to use eli5 which already have similar things implemented.
For you question:
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
I would say the answer come the the function show_weights() from eli5.
Furthermore this can be implemented with many other classifiers.
For more info you can see this question in related question.

Naive Bayes and Logistic Regression Error Rate

I have been trying to figure out the correlation between the error rate and the number of features in both of these models. I watched some videos, and the creator of the video said that a simple model can be better than a complicated model. So I figured that the more features I had the greater the error rate would be. This did not prove to be true in my work, and when I had less features the error rate went up. I'm not sure if I'm doing this incorrectly, or if the guy in the video made a mistake. Can someone care to explain? I also am curious how features relate to Logistic Regression's error rate as well.
Naive Bayes and Logistic Regression are a "generative-discriminative pair," meaning they have the same model form (a linear classifier), but they estimate parameters in different ways.
For feature x and label y, naive Bayes estimates a joint probability p(x,y) = p(y)*p(x|y) from the training data (that is, builds a model that could "generate" the data), and uses Bayes Rule to predict p(y|x) for new test instances. On the other hand, logistic regression estimates p(y|x) directly from the training data by minimizing an error function (which is more "discrimative").
These differences have implications for error rate:
When there are very few training instances, logistic regression might "overfit," because there isn't enough data to estimate p(y|x) reliably. Naive Bayes might do better because it models the entire joint distribution.
When the feature set is large (and sparse, like word features in text classification) naive Bayes might "double count" features that are correlated with each other, because it assumes that each p(x|y) event is independent, when they are not. Logistic regression can do a better job by naturally "splitting the difference" among these correlated features.
If the features really are (mostly) conditionally independent, both models might actually improve with more and more features, provided there are enough data instances. The problem comes when the training set size is small relative to the number of features. Priors on naive Bayes feature parameters, or regularization methods (like L1/Lasso or L2/Ridge) on logistic regression can help in these cases.

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources