Suppose I have a training set made by (x, y) samples.
To apply a generative algorithm, let's say the Gaussian discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
or I just need to I know if x ~ Normal(mu, sigma) given y?
How can I evaluate if p(x|y) follows a multivariate Normal distribution well enough (up to a threshold) to me to use generative algorithm?
That's a lot of questions.
To apply a generative algorithm, let's say the Gaussian
discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
No, you must assume that's true for some mu, sigma pair. In practice you won't know what mu and sigma is, so you'll need to either estimate it (frequentist, Max Likelihood/Max A Posteriori estimates), or even better incorporate uncertainty about your estimates of the parameters into predictions (Bayesian methodology).
How can I evaluate if p(x|y) follows a multivariate Normal distribution?
Classically, using a goodness of fit test. If the dimensionality of x is more than a handful, though, this won't work because standard tests involve the number of items in bins, and the number of bins you need in high dimensions is astronomical so you have very low expected counts.
A better idea is to say the following: what are my options for modelling the (conditional) distribution of x? You can compare between these options using model comparison techniques. Read up on model checking and comparison.
Finally, your last point:
well enough (up to a threshold) to me to use generative algorithm?
The paradox of many generative methods, including Fisher's Linear Discriminant Analysis for example, as well as the Naive Bayes classifier, is that the classifier can work very well even though the model is poor for the data. There's no particularly sound reason why this should be the case, but many have observed it to be empirically true. Whether it works can be checked much more easily than whether the assumed distribution explains the data very well: just split your data into training and testing and find out!
Related
I am a beginner in machine learning. So any help or suggestion would be of great help.
I have read that putting weights on features and Predicting is a very bad idea. But what if few features needs to be weighted.
In a classification problem let's say it's a common norm that age is most dependent, how do I give weights to this feature. I was thinking to normalize it but with a variance of 1.5 or 2 (other features with variance 1), I believe that this feature will have more weight. Is this fundamentally wrong ? If wrong any other method.
Does it effect differently for classification and regression problems ?
If we are talking specifically about random forests (as you tagged) then you can use the Weighted Subspace Random Forest algorithm (in R wsrf package). The algorithm determines a weight for each variable and then uses these during the model building.
The informativeness of a variable with respect to the class is
measured by an information gain ratio. The measure is used as the
probability of that variable being selected for inclusion in the
variable subspace when splitting a specific node during the tree
building process. Therefore, variables with higher values by the
measure are more likely to be chosen as candidates during variable
selection and a stronger tree can be built.
Generally if a feature has more Importance compared to other features and the model is Dense enough, with enough training sample, your model will automatically give it more Importance by optimizing weight matrices to account for that because we have partial derivatives in back propagation which calculate change by each connection, so it learns to give more importance to that feature on itself. If you don't normalize it, but scale it to a higher scale, you might have overstated it's important.
In practice a neural network works best if the inputs are centered and white. That means that their covariance is diagonal and the mean is the zero vector. This improves optimization of the neural net, since the hidden activation functions don't saturate that fast and thus do not give you near zero gradients early on in learning.
If you do scale just one feature up by a small value, it may or may not have desired effects, but the higher probability is of saturated gradients, so we avoid it.
I'm trying to modify an standard kNN algorithm to obtain the probability of belonging to a class instead of just the usual classification. I haven't found much information about Probabilistic kNN, but as far as I understand, it works similar to kNN, with the difference that it calculates the percentage of examples of every class inside the given radius.
So I wonder, what's the difference then between Naive Bayes and Probabilistic kNN? I just can spot that Naive Bayes takes into consideration the prior possibility, while PkNN does not. Am I getting it wrong?
Thanks in advance!
To be honest there is nearly no similarity.
Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For contiuous case - It will fit a radial Normal distribution to your whole class (each of them) and then make a decision through argmax_y N(m_y, Sigma_y)
KNN on the other hand is not a probabilistic model. Modification that you are refering to is simply a "smooth" version of the original idea, where you return ratio of each class in the nearest neighbours set (and this is not really any "probabilistic kNN", it is just regular kNN which rough estimate of probability). This assumes nothing about data distribution (besides being localy smooth). In particular - it is a nonparametric model which, given enough training samples, will fit perfectly to any dataset. Naive Bayes will fit perfectly only to K gaussians (where K is number of classes).
(I don't know how to format math formulas. For more details and clear representations, please see this.)
I would like to propose an opposite view that KNN is a kind of simplified Naive Bayes (NB) by viewing KNN as a mean of density estimation.
To perform density estimation, we attempt to estimate p(x) = k/NV, where k is the number of samples lying in a region R, N is the total sample number, and V is the volume of the region R. Usually, there are two ways to estimate it: (1) fixing V, calculate k, which is known as kernel density estimation or Parzen window; (2) fixing k, calculate V, which is the KNN-based density estimation. The latter one is much less famous than the former one due to its many drawbacks.
Yet, we can use KNN-based density estimation to connect KNN and NB. Given total N samples, Ni samples for class ci, we can write the NB in the form of KNN-based density estimation by considering a region contain x:
P(ci|x) = P(x|ci)P(ci)/P(x) = (ki/NiV)(Ni/N)/(k/NV) = ki/k,
where ki is the sample number of class ci lying in the region. The final form ki/k is actually the KNN classifier.
I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.
Okay I have a lot of confusion in regards to the way likelihood functions are defined in the context of different machine learning algorithms. For the context of this discussion, I will reference Andrew Ng 229 lecture notes.
Here is my understanding thus far.
In the context of classification, we have two different types of algorithms: discriminative and generative. The goal in both of these cases is to determine the posterior probability, that is p(C_k|x;w), where w is parameter vector and x is feature vector and C_k is kth class. The approaches are different as in discriminative we are trying to solve for the posterior probability directly given x. And in the generative case, we are determining the conditional distributions p(x|C_k), and prior classes p(C_k), and using Bayes theorem to determine P(C_k|x;w).
From my understanding Bayes theorem takes the form: p(parameters|data) = p(data|parameters)p(parameters)/p(data) where the likelihood function is p(data|parameters), posterior is p(parameters|data) and prior is p(parameters).
Now in the context of linear regression, we have the likelihood function:
p(y|X;w) where y is the vector of target values, X is design matrix.
This makes sense in according to how we defined the likelihood function above.
Now moving over to classification, the likelihood is defined still as p(y|X;w). Will the likelihood always be defined as such ?
The posterior probability we want is p(y_i|x;w) for each class which is very weird since this is apparently the likelihood function as well.
When reading through a text, it just seems the likelihood is always defined to different ways, which just confuses me profusely. Is there a difference in how the likelihood function should be interpreted for regression vs classification or say generative vs discriminative. I.e the way the likelihood is defined in Gaussian discriminant analysis looks very different.
If anyone can recommend resources that go over this in detail I would appreciate this.
A quick answer is that the likelihood function is a function proportional to the probability of seeing the data conditional on all the parameters in your model. As you said in linear regression it is p(y|X,w) where w is your vector of regression coefficients and X is your design matrix.
In a classification context, your likelihood would be proportional to P(y|X,w) where y is your vector of observed class labels. You do not have a y_i for each class, because your training data was observed to be in one particular class. Given your model specification and your model parameters, for each observed data point you should be able to calculate the probability of seeing the observed class. This is your likelihood.
The posterior predictive distribution, p(y_new_i|X,y), is the probability you want in paragraph 4. This is distinct from the likelihood because it is the probability for some unobserved case, rather than the likelihood, which relates to your training data. Note that I removed w because typically you would want to marginalize over it rather than condition on it because there is still uncertainty in the estimate after training your model and you would want your predictions to marginalize over that rather than condition on one particular value.
As an aside, the goal of all classification methods is not to find a posterior distribution, only Bayesian methods are really concerned with a posterior and those methods are necessarily generative. There are plenty of non-Bayesian methods and plenty of non-probabilistic discriminative models out there.
Any function proportional to p(a|b) where a is fixed is a likelihood function for b. Note that p(a|b) might be called something else, depending on what's interesting at the moment. For example, p(a|b) can also be called the posterior for a given b. The names don't really matter.
Here in SO I found the following explanation of generative and discriminitive algorithms:
"A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal?
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal."
And here is the definition for parametric and nonparametric algorithms
"Parametric: data are drawn from a probability distribution of specific form up to unknown parameters.
Nonparametric: data are drawn from a certain unspecified probability distribution.
"
So essentially can we say that generative and parametric algorithms assume underlying model whereas discriminitve and nonparametric algorithms dont assume any model?
thanks.
Say you have inputs X (probably a vector) and output Y (probably univariate). Your goal is to predict Y given X.
A generative method uses a model of the joint probability p(X,Y) to determine P(Y|X). It is thus possible given a generative model with known parameters to sample jointly from the distribution p(X,Y) to produce new samples of both input X and output Y (note they are distributed according to the assumed, not true, distribution if you do this). Contrast this to discriminative approaches which only have a model of the form p(Y|X). Thus provided with input X they can sample Y; however, they cannot sample new X.
Both assume a model. However, discriminative approaches assume only a model of how Y depends on X, not on X. Generative approaches model both. Thus given a fixed number of parameters you might argue (and many have) that it's easier to use them to model the thing you care about, p(Y|X), than the distribution of X since you'll always be provided with the X for which you wish to know Y.
Useful references: this (very short) paper by Tom Minka. This seminal paper by Andrew Ng and Michael Jordan.
The distinction between parametric and non-parametric models is probably going to be harder to grasp until you have more stats experience. A parametric model has a fixed and finite number of parameters regardless of how many data points are observed. Most probability distributions are parametric: consider a variable z which is the height of people, assumed to be normally distributed. As you observe more people, your estimate for the parameters \mu and \sigma, the mean and standard deviation of z, become more accurate but you still only have two parameters.
In contrast, the number of parameters in a non-parametric model can grow with the amount of data. Consider an induced distribution over peoples' heights which places a normal distribution over each observed sample, with mean given by the measurement and fixed standard deviation. The marginal distribution over new heights is then a mixture of normal distributions, and the number of mixture components increases with each new data point. This is a non-parametric model of people's height. This specific example is called a kernel density estimator. Popular (but more complicated) non parametric models include Gaussian Processes for regression and Dirichlet Processes.
A pretty good tutorial on non-parametrics can be found here, which constructs the Chinese Restaurant Process as the limit of a finite mixture model.
I don't think you can say it. E.g. linear regression is a discriminative algorithm - you make an assumption about P(Y|X), and then estimate paramenters directly from the data, without making any assumption about P(X) or P(X|Y), as you would do in case of generative models. But at the same time, aby inference based on linear regression, including the properties of the paramenters, is a parametric estimation, as there is an assumption about behaviour of unobserved errors.
Here I'm only talking about parametric/non-parametric. Generative/ discriminative is a separate concept.
Non-parametric model means you don't make any assumptions on the distribution of your data. For example, in the real world, data will not 100% follow theoretical distributions like Gaussian, beta, Poisson, Weibull, etc. Those distributions are developed for our need's to model the data.
On the other hand, parametric models try to completely explain our data using parameters. In practice, this way is preferred because it makes easier to define how the model should behave in different circumstances (for example, we already know the derivative/gradients of the model, what happens when we set the rate too high/too low in Poisson, etc.)