Maximum Likelihood Estimation-MLE - machine-learning

I have this question related to "Maximum Likelihood Estimation"...I've tried to solve it but I couldn't ... would you please help!
Suppose for an event X, there are three possible
values, A, B and C. Now we repeat X for N times. The number of times that we observe A or B
is N1, the number of times that we observe A or C is N2. Let pA be the unknown frequency of
value A. Please give the maximum likelihood estimation of pA

This question doesn't really belong in Stackoverflow, but I will answer it anyway.
You can look at the number of observations of A (I am calling this N_A) as a hidden variable and maximize with respect to the marginal distribution (the sum over all possible values of N_A).
There is no closed form solution, in general, for the MLE parameters - solutions are the zeros of a polynomial constrained to the simplex. Below, I have derived an Expectation Maximization updates.

Related

Consider three mutually independent classifiers, A, B, C, with equal error probabilities:

Here's the problem:
Consider three mutually independent classifiers, A, B, C, with equal error probabilities:
Pr(errA) = Pr(errB) = Pr(errC) = t
Let D be another classifier that takes the majority vote of A, B, and C.
• What is Pr(errD)?
• Plot Pr(errD) as a function of t.
• For what values of t, the performance of D is better than any of the other three classifiers?
My questions are:
(1) I couldn't figure out the error probability of D. I thought it would be 1 minus alpha (1 - α), but I am not sure.
(2) How to plot t(Pr(errD))? I assume without finding Pr(errD) then I can plot it.
(3) Here as well, I couldn't figure it out. Comparatively, how should I determine the performance of D?
If I understand well, your problem can be formulated with simple terms without any ensemble learning.
Given that D is the result of a vote by 3 classifiers, D is wrong if and only if at most one of the estimators is right.
A,B,C are independent, so:
the probability of none being right is t^3
the probability of one being right while the other two are wrong is 3(1-t)t^2 (the factor 3 is because there are three ways to achieve this)
So P(errD) = t^3 + 3(1-t)t^2 = -2t^3 + 3t^2
You should be able to plot this as a function of t in the interval [0:1] without too many difficulties.
As for your third question, just solve P(errA) - P(errD) >0 (this means that the error probability of D is smaller than for A and so that its performance is better). If you solve this, you should find that the condition is t<0.5.
To come back to ensemble learning, note that the assumption of independence between your estimators is usually not verified in practice.

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?
I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

How to overcome Drawback of Inverse Document Frequency (IDF)

Please tell me how to overcome the problem of negative weighting in IDF. Can someone give a small example?
IDF is defined as N/n(t) where n(t) is the number of documents that a term 't' occurs in and N is the total number of documents in the collection. Sometimes, a log() is applied around this fraction.
Please observe that this fraction N/n(t) is always >= 1. For a word which appears in all documents, a likely case of which is the English word "the", the value of idf is 1. Even if a log is applied around this fraction, the value is always >= zero. (Recall the graph of the log function which monotonically increases from -inf to +inf with log(x)<0 if x<1 log(1)=0 and log(x)>0 if x>1).
So, there's no way in which a standard definition of idf can be negative.
The answer given by Debasis is entirely correct. Negative idf might still resulf, however, if a small +1 term is added to the divisor to avoid divide-by-zero errors. One source that suggests this is the Wikipedia article on tf-idf. The problem is that the LOG operation then results in a negative value if the occurrence count n(t) is equal to the count of documents N (i.e. it appears in all of them). I just ran into this issue when implementing tf-idf on a toy problem, where a document count N = 3 and an occurrence count of 3 would ordinarily = 0, but resulted in an idf of -0.287682072451781 because of the +1 correction term increased the divisor to 4, > than the count of documents. Perhaps this was the culprit behind the negative weights the O.P. experienced. I figured I'd post this just in case someone else runs into this initially baffling problem again. The fix is simple: remove the +1 term and find another way to avoid divide-by-zero errors.

Resources