Consider three mutually independent classifiers, A, B, C, with equal error probabilities: - machine-learning

Here's the problem:
Consider three mutually independent classifiers, A, B, C, with equal error probabilities:
Pr(errA) = Pr(errB) = Pr(errC) = t
Let D be another classifier that takes the majority vote of A, B, and C.
• What is Pr(errD)?
• Plot Pr(errD) as a function of t.
• For what values of t, the performance of D is better than any of the other three classifiers?
My questions are:
(1) I couldn't figure out the error probability of D. I thought it would be 1 minus alpha (1 - α), but I am not sure.
(2) How to plot t(Pr(errD))? I assume without finding Pr(errD) then I can plot it.
(3) Here as well, I couldn't figure it out. Comparatively, how should I determine the performance of D?

If I understand well, your problem can be formulated with simple terms without any ensemble learning.
Given that D is the result of a vote by 3 classifiers, D is wrong if and only if at most one of the estimators is right.
A,B,C are independent, so:
the probability of none being right is t^3
the probability of one being right while the other two are wrong is 3(1-t)t^2 (the factor 3 is because there are three ways to achieve this)
So P(errD) = t^3 + 3(1-t)t^2 = -2t^3 + 3t^2
You should be able to plot this as a function of t in the interval [0:1] without too many difficulties.
As for your third question, just solve P(errA) - P(errD) >0 (this means that the error probability of D is smaller than for A and so that its performance is better). If you solve this, you should find that the condition is t<0.5.
To come back to ensemble learning, note that the assumption of independence between your estimators is usually not verified in practice.

Related

How can we implement efficiently a maximum set coverage arc of fixed cardinality?

I am working on solving the following problem and implement the solution in C++.
Let us assume that we have an oriented weighted graph G = (V, A, w) and P a set of persons.
We receive a number of queries such that every query gives a person p and two vertices s and d and asks to compute the minimum weighted path between s and d for the person p. One person can have multiple paths.
After the end of all queries I have a number k <= |A| and I should give k arcs such that the number of persons using at least one of the k arcs is maximal (this is a maximum coverage problem).
To solve the first part I implemented the Djikistra algorithm using priority_queue and I compute the minimal weight between s and d. (Is this a good way to do ?)
To solve the second part I store for every arc the set of persons that use this arc and I use a greedy algorithm to compute the set of arcs (at each stage, I choose an arc used by the largest number of uncovered persons). (Is this a good way to do it ?)
Finally, if my algorithms are goods how can I implement them efficiently in C++?

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

How to cut down train error for a dense-matrix factorization task?

This problem may seem very different from the normal Matrix Factorization task which is widely used in recommender system.
My problem is described as below:
Given a dense Matrix M
(approximately 55000*200, may contain much negative elements, 0.1< abs(M[i][j]) <1 )
I have to find two matrix A(55000*1400) and B(1400*200), such that:
AB=M
However, we have some knowledge about A. We have another Matrix C, if C[i][j] = 0, then A[i][j] must be zero, otherwise it can be any value(C[i][j] = 1).
In my practice , I use machine learning to solve the problem, my loss function is:
||(A*C)(element-wise product) x B - M ||(2)(L2 norm)
I have tried adagrad,momentum,adadelta and some other optimization method, but the train error is pretty much and is cut down slowly (learning_rate = 0.1)
UP1:
Well, actually I've got a machine with 32GB memory and I only need 2 min for each epoch. I decompose an element in M only if its corresponding element in C is anotated as 1. Practically , I only decompose M[i][j] when C[i][j] = 1, and after I decompose M[i][j], I solve the gradient for M[i][j] to update A[i : ] and B[ : j] at once. So, the batch I used is too small--just contain one element. Also , I have to mention that C is a pretty sparse matrix. For each line in C, there is only 2-3 elements that are anotated as 1.
After struggling with it for nearly half month, I finally got the answer: I should update the matrix A much more quickly, say, update the parameters at a more smaller step. I originally updated every element in A only once per epoch, much less than B. However, after I changed the code to let A be updated at the same speed as B, then surprise happened: it worked pretty well!
Maybe smaller steps will help SGD work better? I don't really believe it mathematically.

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?
I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

Resources