The official documentation details the capability to enforce monotonic constraints with its interface, and helps with a regression problem to illustrate:
https://xgboost.readthedocs.io/en/stable/tutorials/monotonic.html#:~:text=It%20is%20often%20the%20case,of%20scientific%20question%20being%20investigated.
The parameter can also be passed to xgboost to train a binary classification, is the monotonic constraint parameter conceptually equivalent to its usage in the regression method? Are there any caveats/considerations one should take into account?
Related
Am trying to understand the difference in assumptions required for Naive Bayes and Logistic Regression.
As per my knowledge both Naive Bayes and Logistic Regression should have features independent to each other ie predictors should not have any multi co-linearity.
and only in Logistic Regression should follow linearity of independent variables and log-odds.
Correct me if am wrong and is there any other assumptions/differences between Naive and logistic regression
You are right durga. Them two have similar performances as well.
A difference is that NB assumes normal distributions, whereas logistic regression does not. As for speed, NB is much faster.
Logistic regression, according to this source:
1) Requires observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
2) Requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.
3) Requires little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
4) Assumes linearity of independent variables and log odds.
5) Typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model.
tl;dr:
Naive Bayes requires conditional independence of the variables. Regression family needs the feature to be not highly correlated to have a interpretable/well fit model.
Naive Bayes require the features to meet the "conditional independence" requirement which means:
This is very much different than the "regression family" requirements. What they need is that variables are not "correlated". Even if the features are correlated, the regression model might only become overfit or might become harder to interpret. So if you use a proper regularization, you would still get a good prediction.
So as I understand it, to implement an unsupervised Naive Bayes, we assign random probability to each class for each instance, then run it through the normal Naive Bayes algorithm. I understand that, through each iteration, the random estimates get better, but I can't for the life of me figure out exactly how that works.
Anyone care to shed some light on the matter?
The variant of Naive Bayes in unsupervised learning that I've seen is basically application of Gaussian Mixture Model (GMM, also known as Expectation Maximization or EM) to determine the clusters in the data.
In this setting, it is assumed that the data can be classified, but the classes are hidden. The problem is to determine the most probable classes by fitting a Gaussian distribution per class. Naive Bayes assumption defines the particular probabilistic model to use, in which the attributes are conditionally independent given the class.
From "Unsupervised naive Bayes for data clustering with mixtures of
truncated exponentials" paper by Jose A. Gamez:
From the previous setting, probabilistic model-based clustering is
modeled as a mixture of models (see e.g. (Duda et al., 2001)), where
the states of the hidden class variable correspond to the components
of the mixture (the number of clusters), and the multinomial
distribution is used to model discrete variables while the Gaussian
distribution is used to model numeric variables. In this way we move
to a problem of learning from unlabeled data and usually the EM
algorithm (Dempster et al., 1977) is used to carry out the learning
task when the graphical structure is fixed and structural EM
(Friedman, 1998) when the graphical structure also has to be
discovered (Pena et al., 2000). In this paper we focus on the
simplest model with fixed structure, the so-called Naive Bayes
structure (fig. 1) where the class is the only root variable and all
the attributes are conditionally independent given the class.
See also this discussion on CV.SE.
Okay I have a lot of confusion in regards to the way likelihood functions are defined in the context of different machine learning algorithms. For the context of this discussion, I will reference Andrew Ng 229 lecture notes.
Here is my understanding thus far.
In the context of classification, we have two different types of algorithms: discriminative and generative. The goal in both of these cases is to determine the posterior probability, that is p(C_k|x;w), where w is parameter vector and x is feature vector and C_k is kth class. The approaches are different as in discriminative we are trying to solve for the posterior probability directly given x. And in the generative case, we are determining the conditional distributions p(x|C_k), and prior classes p(C_k), and using Bayes theorem to determine P(C_k|x;w).
From my understanding Bayes theorem takes the form: p(parameters|data) = p(data|parameters)p(parameters)/p(data) where the likelihood function is p(data|parameters), posterior is p(parameters|data) and prior is p(parameters).
Now in the context of linear regression, we have the likelihood function:
p(y|X;w) where y is the vector of target values, X is design matrix.
This makes sense in according to how we defined the likelihood function above.
Now moving over to classification, the likelihood is defined still as p(y|X;w). Will the likelihood always be defined as such ?
The posterior probability we want is p(y_i|x;w) for each class which is very weird since this is apparently the likelihood function as well.
When reading through a text, it just seems the likelihood is always defined to different ways, which just confuses me profusely. Is there a difference in how the likelihood function should be interpreted for regression vs classification or say generative vs discriminative. I.e the way the likelihood is defined in Gaussian discriminant analysis looks very different.
If anyone can recommend resources that go over this in detail I would appreciate this.
A quick answer is that the likelihood function is a function proportional to the probability of seeing the data conditional on all the parameters in your model. As you said in linear regression it is p(y|X,w) where w is your vector of regression coefficients and X is your design matrix.
In a classification context, your likelihood would be proportional to P(y|X,w) where y is your vector of observed class labels. You do not have a y_i for each class, because your training data was observed to be in one particular class. Given your model specification and your model parameters, for each observed data point you should be able to calculate the probability of seeing the observed class. This is your likelihood.
The posterior predictive distribution, p(y_new_i|X,y), is the probability you want in paragraph 4. This is distinct from the likelihood because it is the probability for some unobserved case, rather than the likelihood, which relates to your training data. Note that I removed w because typically you would want to marginalize over it rather than condition on it because there is still uncertainty in the estimate after training your model and you would want your predictions to marginalize over that rather than condition on one particular value.
As an aside, the goal of all classification methods is not to find a posterior distribution, only Bayesian methods are really concerned with a posterior and those methods are necessarily generative. There are plenty of non-Bayesian methods and plenty of non-probabilistic discriminative models out there.
Any function proportional to p(a|b) where a is fixed is a likelihood function for b. Note that p(a|b) might be called something else, depending on what's interesting at the moment. For example, p(a|b) can also be called the posterior for a given b. The names don't really matter.
I am doing a project that has some Natural Language Processing to do. I am using stanford MaxEnt Classifier for the purpose.But I am not sure, whether Maximum entropy model and logistic regression are one at the same or is it some special kind of logistic regression?
Can anyone come up with an explanation?
This is exactly the same model. NLP society prefers the name Maximum Entropy and uses the sparse formulation which allows to compute everything without direct projection to the R^n space (as it is common for NLP to have huge amount of features and very sparse vectors).
You may wanna read the attachment in this post, which gives a simple derivation:
http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/
An explanation is quoted from "Speech and Language Processing" by Daniel Jurafsky & James H. Martin.:
Each feature is an indicator function, which picks out a subset of the training observations. For each feature we add a constraint on our total distribution, specifying that our distribution for this subset should match the empirical distribution we saw in our training data. We then choose the maximum entropy distribution which otherwise accords with these constraints.
Berger et al. (1996) show that the solution to this optimization problem turns out to be exactly the probability distribution of a multinomial logistic regression model whose weights W maximize the likelihood of the training data!
In Max Entropy the feature is represnt with f(x,y), it mean you can design feature by using the label y and the observerable feature x, while, if f(x,y) = x it is the situation in logistic regression.
in NLP task like POS, it is common to design feature's combining labels. for example: current word ends with "ous" and next word is noun. it can be feature to predict whether the current word is adj
I am a novice to machine learning, I have read about the HMM but I still have a few questions:
When applying the HMM for machine learning, how can the initial, emmission and transition probabilities be obtained?
Currently I have a set of values (consisting the angles of a hand which I would like to classify via an HMM), what should my first step be?
I know that there are three problems in a HMM (ForwardBackward, Baum-Welch, and Viterbi), but what should I do with my data?
In the literature that I have read, I never encountered the use of distribution functions within an HMM, yet the constructor that JaHMM uses for an HMM consists of:
number of states
Probability Distribution Function factory
Constructor Description:
Creates a new HMM. Each state has the same pi value and the transition probabilities are all equal.
Parameters:
nbStates The (strictly positive) number of states of the HMM.
opdfFactory A pdf generator that is used to build the pdfs associated to each state.
What is this used for? And how can I use it?
Thank you
You have to somehow model and learn the initial, emmision, and tranisition probabilities such that they represent your data.
In the case of discrete distributions and not to much variables/states you can obtain them form maximum likelihood fitting or you train a discriminative classifier that can give you a probability estimate like Random Forests or Naive Bayes. For continuous distributions have a look at Gaussian Processes or any other regression method like Gaussian Mixture Models or Regression Forests.
Regarding your 2. and 3. question: they are to general and fuzzy to be answered here. You should kindly refer to the following books: "Pattern Recognition and Machine Learning" by Bishop and "Probabilistic Graphical Models" by Koller/Friedman.