iterated conditional mode E step EM - machine-learning

I wanted to know what is the mathematical justification for using ICM as an approximation for the E step in an EM algorithm.
As I understand in the E step the idea is to find a distribution that is equal to the posterior distribution of the latent variable, which guarantees that the likelihood increases or find the best possible distribution from some simpler family of distributions which guarantees that a lower bound of the likelihood functions increases.
How does one mathematically justify the use of ICM in such an E-step? Any reference/derivations/notes would be very helpful.

Let's consider a simple CRF which represent the likelihood of the labelling (y) given observation (x). Also assume likelihood depends on the parameter \theta. In the inference, you know only the x and trying to infer on y. What you simply do is applying EM algorithm in a way that E steps finds the labelling y (argmax P(y|x,\theta)) and M step finds the parameter \theta (argmax P(\theta|x,y)). M step can be accomplished by using any optimization algorithm because \theta is in general not high dimensional (at least not as high as dimension of y). E step is simply inference over an MRF/CRF having no hidden variable since \theta is independently optimized in M step. ICM is an algorithm which is used to perform inference. If you want a reference, you can simply read Murphy's book http://www.cs.ubc.ca/~murphyk/MLbook/, I think Chapter 26 is quite related.

Related

Full-Rank Assumption in Least Squares Estimation (Linear Regression)

In Ordinary Least Square Estimation, the assumption is for the Samples matrix X (of shape N_samples x N_features) to have "full column rank".
This is apparently needed so that the linear regression can be reduced to a simple algebraic equation using the Moore–Penrose inverse. See this section of the Wikipedia article for OLS:
https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation
In theory this means that if all columns of X (i.e. features) are linearly independent we can make an assumption that makes OLS simple to calculate, correct?
What does this mean in practice?
Does this mean that OLS is not calculable and will result in an error for such input data X? Or will the result just be bad?
Are there any classical datasets for which linear regression fails due to this assumption not being true?
The full rank assumption is only needed if you were to use the inverse (or cholesky decomposition, or QR or any other method that is (mathematically) equivalent to computing the inverse). If you use the Moore-Penrose inverse you will still compute an answer. When the full rank assumtion is violated there is no longer a unique answer, ie there are many x that minimise
||A*x-b||
The one you will compute with the Moore-Penrose will be the x of minimum norm. See here, for exampleA

Why is inference in Markov Random Fields hard?

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.

max-margin linear separator using libsvm

I have a set of N data points X with +/- labels for which I'd like to calculate the max-margin linear separator (aka classifier, hyperplane) or fail if no such linear separator exist.
I do not want to avoid overfitting in the context of this question, as I do so elsewhere. So no slack variables ; no cross-validation ; no limits on the number of support vectors ; just find max-margin separator or fail.
How to I use libsvm to do so? I believe you can't give c=0 in C-SVM and you can't give nu=1 in nu-svm.
Related question (which I think didn't provide an answer):
Which of the parameters in LibSVM is the slack variable?
In the case of C-SVM, you should use a linear kernel and a very large C value (or nu = 0.999... for nu-SVM). If you still have slacks with this setting, probably your data is not linearly separable.
Quick explanation: the C-SVM optimization function tries to find the hyperplane having maximum margin and lowest misclassification costs at the same time. The misclassification costs in the C-SVM formulation is defined by: distance from the misclassified point to its correct side of the hyperplane, multiplied by C. If you increase the C value (or nu value for nu-SVM), every misclassified point will be too costly and an hyperplane that separates the data perfectly will be preferable for the optimization function.

How to derive a marginal likelihood function?

I'm a little confused on the integral over ''theta'' of marginal likelihood function (http://en.wikipedia.org/wiki/Marginal_likelihood,Section: "Applications"-"Bayesian model comparison", the third equation on this page):
Why does the probability of x given M equal the integral and how to derive the equation?
This integral is nothing more than than the law of total probability in continuous form. Thus it can be derived directly from the probability axioms. Given the second formula in the link (Wikipedia), the only thing you have to do to arrive at the formula you are looking for is to replace the sum over discrete states by an integral.
So, what does it mean intuitively? You assume a model for your data X, which depends on a variable theta. For a given theta, the probability of a dataset X is thus p(X|theta). As you are not sure on the exact value of theta, you choose it to follow a distribution p(theta|alpha) specified by a (constant) parameter alpha. Now, the distribution of X is directly determined by alpha (this should be clear ... just ask yourself whether there is something other it might depend on ... and find nothing). Therefore, you can calculate its exact influence by integrating out the variable theta. This is what the law of total probability states.
If you don't get it by this explanation, I suggest you to play a bit around with conditional probabilities for discrete states, which in fact often leads to obvious results. The extension to the continuous case is then straightforward.
EDIT: The third equation shows the same which I tried to explain above. You have a model M. This model has parameters theta distributed by p(theta|M) -- you could also write this p_M(theta), for example.
These parameters determine the distribution of the data X via p(X|theta, M) ... i.e. each theta gives a different distribution of X (for a chosen model M). This form, however, is not convenient to work with. What you want is a summarized statement on the model M, not on its various possible choices for theta. So, in a way, you now want to know the average of X given a model M (note that in the model M also a chosen distribution of its parameters is included. For example, M does not simply mean "Neural Network", but rather something like "Neural Network with weights uniformly distributed in [-1,1]").
Obtaining this "average" requires only basic statistics: Just take the model, p(X|theta, M), multiply it by the density p(theta| M), and integrate over theta. This is essentially what you do for any average in statistics. All together, you arrive at the marginalization p(x|M).

Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

Resources