Am trying to understand the difference in assumptions required for Naive Bayes and Logistic Regression.
As per my knowledge both Naive Bayes and Logistic Regression should have features independent to each other ie predictors should not have any multi co-linearity.
and only in Logistic Regression should follow linearity of independent variables and log-odds.
Correct me if am wrong and is there any other assumptions/differences between Naive and logistic regression
You are right durga. Them two have similar performances as well.
A difference is that NB assumes normal distributions, whereas logistic regression does not. As for speed, NB is much faster.
Logistic regression, according to this source:
1) Requires observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
2) Requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.
3) Requires little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
4) Assumes linearity of independent variables and log odds.
5) Typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model.
tl;dr:
Naive Bayes requires conditional independence of the variables. Regression family needs the feature to be not highly correlated to have a interpretable/well fit model.
Naive Bayes require the features to meet the "conditional independence" requirement which means:
This is very much different than the "regression family" requirements. What they need is that variables are not "correlated". Even if the features are correlated, the regression model might only become overfit or might become harder to interpret. So if you use a proper regularization, you would still get a good prediction.
Related
I am sure this question may not be in the brilliant category. But Somehow to learn machine learning i may start with stupid question. So, please.
I understood the terms of regressions partially.
The regression essentially give the idea of the relationship between the dependent and independent variables.
If the dependent variable is continuous and if you see the linear relation between dependent and independent, then linear regression is a way to go.
A slight change now. If the dependent value could be something like Binary value (Y/N), ie: the output value is binomial distribution, then logistic regression is a way to go that which demands non linear relationship between dependent and independent.
So far..Please correct me if i am wrong.
Now my question is with respect to ordinal logistic regression.
I have started looking at the below link for reference
https://statistics.laerd.com/spss-tutorials/ordinal-regression-using-spss-statistics.php
Where it is mentioned that " It can be considered as either a generalisation of multiple linear regression or as a generalisation of binomial logistic regression".
Could someone help me understand this above statement with examples?
Logistic regression can be considered as an extension of linear regression. But instead of predicting continuous variables, it predicts discrete variables by introducing the computation of an activation function. So, you are asked to produce a discriminatory function that based on X you produce a function that outputs f: [1,2, ..., k] where k is the number of classes that your problem presents. Now X can be composed of features that are both continuous or discrete. It does not matter, just make sure you apply pre-processing to them.
The base case for logistic regression is finding the decision boundary that divides two classes. But in order to add more classes, you have to implement another approach. There are several: softmax (https://en.wikipedia.org/wiki/Softmax_function), one-vs-all (https://en.wikipedia.org/wiki/Multiclass_classification), etc.
Finally, answering your question about ordinal logistic regression is an extension of logistic regression. But considers the order of the output variables such as in the case of a test. Take a look online for examples.
I am participating in the Kaggle San Francisco Crime competition and i am currently trying o number of different classifiers to test benchmark performances. I am using a LogisticRegressionClassifier from sklearn, without any parameter tuning and I noticed from sklearn.metrict.classification_report that it is only predicting the predominant classses,i.e. the classes which have the highest number of occurrences in my training set.
Intuition tells me that this has to parameter tuning, but I am not sure which parameters I have to tweek in order to make the classifier more aware of less predominant classes ( LogisticRegressionClassifier has quite a few ). At the moment it is predicting only 3 classes from 38 or smth like that so it definitely needs improvement.
Any ideas?
If your model is classifying only predominant classes then you are facing problem of imbalance classes. Here are some good reads to tackle this in machine learning.
Logistic Regression is a binary classifier and uses one-vs-all or one-vs-one technique for multiclass classification, which is not good if you have higher number of output classes (33 in your case). Try using other classifier. For a start , use softmax classifier which is an extension of logistic classifier having support for multi-class classification. In scikit learn, set multi_class variable as multinomial to use softmax regression.
Other way to improve your model could be using GridSearch for parameter tuning.
On a side note, I would recommend you to use other models as well.
I have read these lines in one of the IEEE Transaction on software learning
"Researchers have adopted a myriad of different techniques to construct software fault prediction models. These include various statistical techniques such as logistic regression and Naive Bayes which explicitly construct an underlying probability model. Furthermore, different machine learning techniques such as decision trees, models based on the notion of perceptrons, support vector machines, and techniques that do not explicitly construct a prediction model but instead look at a set of most similar
known cases have also been investigated.
Can anyone can explain what they are really want to convey.
Please give example.
Thanx in advance.
The authors seem to distinguish probabilistic vs non-probabilistic models, that is models that produce a distribution p(output | data) vs those that just produce an output output = f(data).
The description of the non-probabilistic algorithms is a bits odd to my taste, though. The difference between a (linear) support vector machine, a perceptron and logistic regression from the model and algorithmic perspective is not super large. Implying the former "look at a set of most similar known cases" and the latter doesn't seems strange.
The authors seem to be distinguishing models which compute per-class probabilities (from which you can derive a classification rule to assign an input to the most probable class, or, more complicated, assign an input to the class which has the least misclassification cost) and those which directly assign inputs to classes without passing through the per-class probability as an intermediate result.
A classification task can be viewed as a decision problem; in this case one needs per-class probabilities and a misclassification cost matrix. I think this approach is described in many current texts on machine learning, e.g., Brian Ripley's "Pattern Recognition and Neural Networks" and Hastie, Tibshirani, and Friedman, "Elements of Statistical Learning".
As a meta-comment, you might get more traction for this question on stats.stackexchange.com.
I have been trying to figure out the correlation between the error rate and the number of features in both of these models. I watched some videos, and the creator of the video said that a simple model can be better than a complicated model. So I figured that the more features I had the greater the error rate would be. This did not prove to be true in my work, and when I had less features the error rate went up. I'm not sure if I'm doing this incorrectly, or if the guy in the video made a mistake. Can someone care to explain? I also am curious how features relate to Logistic Regression's error rate as well.
Naive Bayes and Logistic Regression are a "generative-discriminative pair," meaning they have the same model form (a linear classifier), but they estimate parameters in different ways.
For feature x and label y, naive Bayes estimates a joint probability p(x,y) = p(y)*p(x|y) from the training data (that is, builds a model that could "generate" the data), and uses Bayes Rule to predict p(y|x) for new test instances. On the other hand, logistic regression estimates p(y|x) directly from the training data by minimizing an error function (which is more "discrimative").
These differences have implications for error rate:
When there are very few training instances, logistic regression might "overfit," because there isn't enough data to estimate p(y|x) reliably. Naive Bayes might do better because it models the entire joint distribution.
When the feature set is large (and sparse, like word features in text classification) naive Bayes might "double count" features that are correlated with each other, because it assumes that each p(x|y) event is independent, when they are not. Logistic regression can do a better job by naturally "splitting the difference" among these correlated features.
If the features really are (mostly) conditionally independent, both models might actually improve with more and more features, provided there are enough data instances. The problem comes when the training set size is small relative to the number of features. Priors on naive Bayes feature parameters, or regularization methods (like L1/Lasso or L2/Ridge) on logistic regression can help in these cases.
I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.