What does the capital letter 'J' mean in cost function J(θ)? - machine-learning

I saw many people using 'J(θ)' to represent cost function, what does the capital letter 'J' exactly mean?

As far as I know it comes from the Jacobian Matrix which represents the derivitave in all the directions. And because it has become the standard in machine learning it sticks.

But in machine learning J(theta) is not a matrix, it is a multidimensionnal real function that we want to minimize.
So the letter J may well be a reference to Jacobi, but why it is used to name a cost function still remains a mystery, that no MOOC nor blog has revealed.

Related

Polynomial regression in machine learning coursera

While going through Andrew NG's Coursera course on machine learning . I found this particular thing that prices of a house might goes down after certain value of x in Quadratic regression equation. Can anyone explain why is it so?
Andrew Ng is trying to show that a Quadratic function doesn't really make sense to represent the price of houses.
This what the graph of a quadratic function might look like -->
The values of a, b and c were chosen randomly for this example.
As you can see in the figure, the graph first rises to a maximum and then begins to dip. This isn't representative of the real-world since the price of a house wouldn't normally come down with an increasingly larger house.
He recommends that we use a different polynomial function to represent this problem better, such as the cubic function.
The values of a, b, c and d were chosen randomly for this example.
In reality, we would use a different method altogether for choosing the best polynomial function to fit a problem. We would try different polynomial functions on a cross-validation dataset and have an algorithm choose the best suited one. We could also manually chose a polynomial function for a dataset if we already know the trend that our data would follow (due to prior mathematical or physical knowledge).

logistic regression cost function scikit learn

So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.

Does Bishop's book imply that a unit neuron feed to itself in backpropagation?

I just read Bishop book's Pattern Recognition and Machine Learning, I read the chapter 5.3 about the backpropagation, it said in his book that, In general feed-forward network, each unit computes a weighted sum of its inputs of the form $$\a_j=\sum\limits_{i}w_{ji}z_i$$#
Then the book say that the sum in above equation transformed by non linear activation function $h(.)$ to give the activation $z_j$ of unit $j$ in the form $$z_j=h(a_j)$$.
I think the notation is somehow awkward, suppose I want to compute $a_2$, then
$$a_2=w_{21}z_1+w_{2,2}z_2+\dots$$
Then it means that $$a_2=w_{21}z_1+w_{2,2}h(a_2)+\dots$$ it means that the neuron $a_2$ connected to itself?
Please correct me if I wrong.
You are wrong, but I agree that there are some easy problems with the notation used. The tricky bit is that sum is over i, but he never claimed that it is a sum over natural numbers. In other words, in each equation used he is omitting the set of indices, which denote which neurons are connected to which one, it should be something like sum_{i \in set_of_neurons_connected_to(j)}, and obviously it does not imply any self loops unless you specify such graph. This notation is quite common in ML, not only in Bishop book.
No, the neuron does not connect to itself. If you are referring to the example in 5.3.2, in particular formulas (5.62-64) this confusion comes from the fact that you are reading about a shallow network. If you add a layer and compute z_{j+1} you will see that this time the x_{i} is indeed z_{j}.

Is there hope for using Lasagne's Adam implementation for Probabilistic Matrix Factorization?

I am implementing Probabilistic Matrix Factorization models in theano and would like to make use of Adam gradient descent rules.
My goal is to have a code that is as uncluttered as possible, which means that I do not want to keep explicitly track of the 'm' and 'v' quantities from Adam's algorithm.
It would seem that this is possible, especially after seeing how Lasagne's Adam is implemented: it hides the 'm' and 'v' quantities inside the update rules of theano.function.
This works when the negative log-likelihood is formed with each term that handles a different quantity. But in Probabilistic Matrix Factorization each term contains a dot product of one latent user vector and one latent item vector. This way, if I make an instance of Lasagne's Adam on each term, I will have multiple 'm' and 'v' quantities for the same latent vector, which is not how Adam is supposed to work.
I also posted on Lasagne's group, actually twice, where there is some more detail and some examples.
I thought about two possible implementations:
each existing rating (which means each existing term in the global NLL objective function) has its own Adam updated via a dedicated call to a theano.function. Unfortunately this lead to a non correct use of Adam, as the same latent vector will be associated to different 'm' and 'v' quantities used by the Adam algorithm, which is not how Adam is supposed to work.
Calling Adam on the entire objective NLL, which will make the update mechanism like simple Gradient Descent instead of SGD, with all the disadvantages that are known (high computation time, staying in local minima etc.).
My questions are:
maybe is there something that I did not understand correctly on how Lasagne's Adam works?
Would the option number 2 be actually like SGD, in the sense that every update to a latent vector is going to make an influence to another update (in the same Adam call) that make use of that updated vector?
Do you have any other suggestions on how to implement it?
Any idea on how to solve this issue and avoid manually keeping replicated vectors and matrices for the 'v' and 'm' quantities?
It looks like in the paper they are suggesting that you optimise the entire function at once using gradient descent:
We can then perform gradient descent in Y , V , and W to minimize the objective function given by Eq. 10.
So I would say your option 2 is the correct way to implement what they have done.
There aren't many complexities or non-linearities in there (beside the sigmoid) so you probably aren't that likely to run into typical optimisation problems associated with neural networks that necessitate the need for something like Adam. So as long as it all fits in memory I guess this approach will work.
If it doesn't fit in memory, perhaps there's some way you could devise a minibatch version of the loss to optimise over. Would also be interested to see if you could add one or more neural networks to replace some of these conditional probabilities. But that's getting a bit off-topic...

Approximate string matching with a letter confusion matrix?

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.
However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.
I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?
EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.
I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.
I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.
Peter Kleiweg's Rug/L04 (for computational dialectology) includes an implementation of Levenshtein distance which allows you to specify non-uniform insertion, deletion, and substitution costs.

Resources