Write Dirichlet Log Likelihood with DCP ruleset - cvxpy

I would like to write the log likelihood of the Dirichlet density as a disciplined convex programming (DCP) optimization problem with respect to the parameters of the Dirichlet distribution alpha. However, the log likelihood
def dirichlet_log_likelihood(p, alpha):
"""Log of Dirichlet density.
p: Numpy array of shape (K,) that sums to 1.
alpha: Numpy array of shape (K, ) with positive elements.
"""
L = np.log(scipy.special.gamma(alpha.sum()))
L -= np.log(scipy.special.gamma(alpha)).sum()
L += np.sum((alpha - 1) * np.log(p))
return L
despite being concave in alpha is not formulated as DCP because it involves the difference of two concave functions np.log(gamma(alpha.sum())) and np.log(gamma(alpha)).sum(). I would like if possible, to formulate this function of alpha so that it follows the DCP ruleset, so that maximum-likelihood estimation of alpha can be performed with cvxpy.
Is this possible, and if so how might I do it?

As you note, np.log(gamma(alpha.sum())) and -np.log(gamma(alpha)).sum() have different curvature, so you need to combine them as
np.log(gamma(alpha.sum()) / gamma(alpha).sum())
to have any chance of modelling them under the DCP ruleset. The combined expression above can be recognized as the logarithm of the multivariate beta function, and since the multivariate beta function can be written as a product of bivariate beta functions (see here), you can expand the log-product to a sum-log expression where each term is of the form
np.log(beta(x,y))
and this is the convex atom you need in your DCP ruleset. What remains of you, to use it in practice, is to feed in an approximation of this atom into cvxpy. The np.log(gamma(x)) approximation here will serve as a good starting point.
Please see math.stackexchange.com for more details.

Related

Dimensions of LSTM variant in Deep Mind's Differentiable Neural Computer (DNC)

I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.
By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.

Why does TensorFlow's documentation call a softmax's input "logits"?

TensorFlow calls each of the inputs to a softmax a logit. They go on to define the softmax's inputs/logits as: "Unscaled log probabilities."
Wikipedia and other sources say that a logit is the log of the odds, and the inverse of the sigmoid/logistic function. I.e., if sigmoid(x) = p(x), then logit( p(x) ) = log( p(x) / (1-p(x)) ) = x.
Is there a mathematical or conventional reason for TensorFlow to call a softmax's inputs "logits"? Shouldn't they just be called "unscaled log probabilities"?
Perhaps TensorFlow just wanted to keep the same variable name for binary logistic regression (where it makes sense to use the term logit) and categorical logistic regression...
This question was covered a little bit here, but no one seemed bothered by the use of the word "logit" to mean "unscaled log probability".
Logit is nowadays used in ML community for any non-normalised probability distribution (basically anything that gets mapped to a probability distribution by a parameter-less transformation, like sigmoid function for a binary variable or softmax for multinomial one). It is not a strict mathematical term, but gained enough popularity to be included in TF documentation.

Practicing Kernel trick in SVM

I am reading the theory of SVM. In kernel trick, what I understand is, if we have a data which is not linear separable in the original dimensions n, we use the kernel to map the data to a higher space to be linear separable (we have to choose the right kernel depending on the data set, etc). However, when I watched this video of Andrew ng Kernel SVM, What I understand is we can map original data into a smaller space which make me confused!? Any explanation.
Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).
other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?
Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF
k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?
Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.
I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?
Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).
Exactly as you said - kernel is an inner product of the projected space, not the projection itself. The whole trick is that you do not ever transform your data, because it is computationally too expensive to do so.
other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?
For rbf kernel, g is actually a mapping from R^n into the space of continuous functions (L2), and each point is mapped into unnormalized gaussian distribution with mean x, and variance sigma^2. Thus (up to some normalizing constant A that we will drop)
g(x) = N(x, sigma^2)[z] / A # notice this is not a number but a function of z!
and now inner product in the space of functions is the integral of products over the whole domain thus
K(x, y) = <g(x), g(y)>
= INT_{R^n} N(x, sigma^2)[z] N(y, sigma^2)[z] / A^2 dz
= B exp(-||x-y||^2 / (2*sigma^2))
where B is some constant factor (normalization) depending solely on sigma^2, thus we can drop it (as scaling does not really matter here) for computational simplicity.
Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?
as said before - transformation is never explicitly used, you simply show that inner product of your hyperplane with the transformed point can be expressed again as inner products with support vectors, thus you do not ever transform anything, just use kernels
<w, g(x)> = < SUM_{i=1}^N alpha_i y_i g(sv_i), g(x)>
= SUM_{i=1}^N alpha_i y_i <g(sv_i), g(x)>
= SUM_{i=1}^N alpha_i y_i K(sv_i, x)
where sv_i is i'th support vector, alpha_i is the per-sample weight (Lagrange multiplier) found during the optimization process and y_i is label of i'th support vector.
Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.
In this case m is infinity, as your new space is space of continuous functions in the domain of R^n -> R, thus a single vector (function) is defined as a continuum (size of the set of real numbers) values - one per each possible input value coming from R^n (it is a simple set theory result that R^n for any positive n is of size continuum). Thus in terms of pure mathematics, m = |R|, and using set theory this is so called Beth_1 (https://en.wikipedia.org/wiki/Beth_number).
I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?
You do not choose m, it is defined by the kernel itself. Implementing kernel trick in practise requires expressing all your optimization routines in the form, where training points are used solely in the context of inner products, and just replacing them with kernel calls. This is way too complex to describe in SO form.

How are the following types of neural network-like techniques called?

The neural network applications I've seen always learn the weights of their inputs and use fixed "hidden layers".
But I'm wondering about the following techniques:
1) fixed inputs, but the hidden layers are no longer fixed, in the sense that the functions of the input they compute can be tweaked (learned)
2) fixed inputs, but the hidden layers are no longer fixed, in the sense that although they have clusters which compute fixed functions (multiplication, addition, etc... just like ALUs in a CPU or GPU) of their inputs, the weights of the connections between them and between them and the input can be learned (this should in some ways be equivalent to 1) )
These could be used to model systems for which we know the inputs and the output but not how the input is turned into the output (figuring out what is inside a "black box"). Do such techniques exist and if so, what are they called?
For part (1) of your question, there are a couple of relatively recent techniques that come to mind.
The first one is a type of feedforward layer called "maxout" which computes a piecewise linear output function of its inputs.
Consider a traditional neural network unit with d inputs and a linear transfer function. We can describe the output of this unit as a function of its input z (a vector with d elements) as g(z) = w z, where w is a vector with d weight values.
In a maxout unit, the output of the unit is described as
g(z) = max_k w_k z
where w_k is a vector with d weight values, and there are k such weight vectors [w_1 ... w_k] per unit. Each of the weight vectors in the maxout unit computes some linear function of the input, and the max combines all of these linear functions into a single, convex, piecewise linear function. The individual weight vectors can be learned by the network, so that in effect each linear transform learns to model a specific part of the input (z) space.
You can read more about maxout networks at http://arxiv.org/abs/1302.4389.
The second technique that has recently been developed is the "parametric relu" unit. In this type of unit, all neurons in a network layer compute an output g(z) = max(0, w z) + a min(w z, 0), as compared to the more traditional rectified linear unit, which computes g(z) = max(0, w z). The parameter a is shared across all neurons in a layer in the network and is learned along with the weight vector w.
The prelu technique is described by http://arxiv.org/abs/1502.01852.
Maxout units have been shown to work well for a number of image classification tasks, particularly when combined with dropout to prevent overtraining. It's unclear whether the parametric relu units are extremely useful in modeling images, but the prelu paper gets really great results on what has for a while been considered the benchmark task in image classification.

Convolution Vs Correlation

Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too
You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Correlation
Cross correlation
Convolution
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
Prerequisites
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python
The best answer I got were from this document:http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."
Convolution is just like correlation, except that we flip over the filter before correlating

Resources