How to derive a marginal likelihood function? - machine-learning

I'm a little confused on the integral over ''theta'' of marginal likelihood function (http://en.wikipedia.org/wiki/Marginal_likelihood,Section: "Applications"-"Bayesian model comparison", the third equation on this page):
Why does the probability of x given M equal the integral and how to derive the equation?

This integral is nothing more than than the law of total probability in continuous form. Thus it can be derived directly from the probability axioms. Given the second formula in the link (Wikipedia), the only thing you have to do to arrive at the formula you are looking for is to replace the sum over discrete states by an integral.
So, what does it mean intuitively? You assume a model for your data X, which depends on a variable theta. For a given theta, the probability of a dataset X is thus p(X|theta). As you are not sure on the exact value of theta, you choose it to follow a distribution p(theta|alpha) specified by a (constant) parameter alpha. Now, the distribution of X is directly determined by alpha (this should be clear ... just ask yourself whether there is something other it might depend on ... and find nothing). Therefore, you can calculate its exact influence by integrating out the variable theta. This is what the law of total probability states.
If you don't get it by this explanation, I suggest you to play a bit around with conditional probabilities for discrete states, which in fact often leads to obvious results. The extension to the continuous case is then straightforward.
EDIT: The third equation shows the same which I tried to explain above. You have a model M. This model has parameters theta distributed by p(theta|M) -- you could also write this p_M(theta), for example.
These parameters determine the distribution of the data X via p(X|theta, M) ... i.e. each theta gives a different distribution of X (for a chosen model M). This form, however, is not convenient to work with. What you want is a summarized statement on the model M, not on its various possible choices for theta. So, in a way, you now want to know the average of X given a model M (note that in the model M also a chosen distribution of its parameters is included. For example, M does not simply mean "Neural Network", but rather something like "Neural Network with weights uniformly distributed in [-1,1]").
Obtaining this "average" requires only basic statistics: Just take the model, p(X|theta, M), multiply it by the density p(theta| M), and integrate over theta. This is essentially what you do for any average in statistics. All together, you arrive at the marginalization p(x|M).

Related

What does the y-axis of a Partial Dependence Plot (PDP) for binary classification mean?

I am not sure what the y-axis of my PDP implies? Is that the probability for my target feature to be 1 (binary classification) or something else?
If you do the partial dependence plot of column a and you want to interpret the y value at x = 0.0, the y-axis value represent the average probability of class 1 computed by
changing value of column a in all rows in your dataset to 0.0
predicting all changed row with your fitted model
averaging the probability given by the model
I may not good at explaining but you can read more about PDP at https://christophm.github.io/interpretable-ml-book/pdp.html. Hope this help :)
Generally speaking, we can produce a classifier from a function, f, producing a real-value output plus a threshold. We call the output an 'activation'. If the activation meets a threshold condition is met, the we say the class is detected:
is_class := ( f(x0, x1, ...) > threshold )
and
activation = f(x0, x1, ...)
PDP plots simply show activation values as they change in response to changes in an input value (we ignore the threshold). That is might plot:
f(x0, x, x2, x3, ...)
as a single input x varies. Typically, we hold the others constant, although we can also plot in 2d and 3d.
Sometimes we're interested in:
how a single change the activation
how multiple inputs independently change the activation
how multiple activations change based on different inputs, and so on.
Strictly speaking, we need not even be talking about a classifier when looking a PDP plots. Any function that productions a real-value output (an activation) in response to one of more real-valued feature inputs that we can vary allows us to produce PDP plots.
Classifier activations need not be, and often should not be, interpreted as probabilities, as others have written. In very many cases, this is simply just incorrect. Nevertheless, the analysis of the activation levels is of interest to us, independently of whether the activations represent probabilities: in PDP plots, we can see, for example, which feature values produce strong change - more horizontal plots may imply a worthless feature.
Similarly, in RoC plots, we explicitly examine information about the true-positive and false-position detection rates that result for varying the threshold of activation values.
In both cases, there's no necessity that the classifier produce probabilities as its activation.
Interpretation of PDP plots is fraught with dangers. At a minimum, you need to be clear about what is being held constant as a input feature is varied. Were the other features set to zero (a good choice for linear models)? Did we the set them to their most common values in the test set? Or the most common values for a known class in a sample? Without this information, the vertical axis may be less helpful.
Knowing that an activation is a probability also doesn't seem to helpful in PDP plots -- you can't expect the area under it to sum to one. Perhaps the most useful thing you might find is error cases, where output probabilities are not in the range 0..1.

Full-Rank Assumption in Least Squares Estimation (Linear Regression)

In Ordinary Least Square Estimation, the assumption is for the Samples matrix X (of shape N_samples x N_features) to have "full column rank".
This is apparently needed so that the linear regression can be reduced to a simple algebraic equation using the Moore–Penrose inverse. See this section of the Wikipedia article for OLS:
https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation
In theory this means that if all columns of X (i.e. features) are linearly independent we can make an assumption that makes OLS simple to calculate, correct?
What does this mean in practice?
Does this mean that OLS is not calculable and will result in an error for such input data X? Or will the result just be bad?
Are there any classical datasets for which linear regression fails due to this assumption not being true?
The full rank assumption is only needed if you were to use the inverse (or cholesky decomposition, or QR or any other method that is (mathematically) equivalent to computing the inverse). If you use the Moore-Penrose inverse you will still compute an answer. When the full rank assumtion is violated there is no longer a unique answer, ie there are many x that minimise
||A*x-b||
The one you will compute with the Moore-Penrose will be the x of minimum norm. See here, for exampleA

implement Naive Bayes Gaussian classifier on the number classification data

I am trying to implement Naive Bayes Gaussian classifier on the number classification data. Where each feature represents a pixel.
When trying to implement this, I hit a bump, I've noticed that some the feature variance equate to 0.
This is an issue because I would not be able to divide by 0 when trying to solve the probability.
What can I do to work around this?
Very short answer is you cannot - even though you can usually try to fit Gaussian distribution to any data (no matter its true distribution) there is one exception - the constant case (0 variance). So what can you do? There are three main solutions:
Ignore 0-variance pixels. I do not recommend this approach as it loses information, but if it is 0 variance for each class (which is a common case for MNIST - some pixels are black, independently from class) then it is actually fully justified mathematically. Why? The answer is really simple, if for each class, given feature is constant (equal to some single value) then it brings literally no information for classification, thus ignoring it will not affect the model which assumes conditional independence of features (such as NB).
Instead of doing MLE estimate (so using N(mean(X), std(X))) use the regularised estimator, for example of form N(mean(X), std(X) + eps), which is equivalent to adding eps-noise independently to each pixel. This is a very generic approach that I would recommend.
Use better distribution class, if your data is images (and since you have 0 variance I assume these are binary images, maybe even MNIST) you have K features, each in [0, 1] interval. You can use multinomial distribution with bucketing, so P(x e Bi|y) = #{ x e Bi | y } / #{ x | y }. Finally this is usually the best thing to do (however requires some knowledge of your data), as the problem is you are trying to use a model which is not suited for the data provided, and I can assure you, that proper distribution will always give better results with NB. So how can you find a good distribution? Plot conditonal marginals P(xi|y) for each feature, and look how they look like, based on that - choose distribution class which matches the behaviour, I can assure you these will not look like Gaussians.

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?
You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.
To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).
The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

Resources