confusion about apprenticeship learning algorithm step - machine-learning

I've been following the paper here http://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf but cannot figure out what operation the division symbol in section 3.1 indicates. All of the mu vectors are the same dimensionality; how are we supposed to perform division with them?

It looks like a typical division of numbers. You have something of form
A^T B
-----
C^T D
where A, B, C and D are vectors, thus A^T B is a number (it is just a dot product) and so is C^T D.

Related

Consider three mutually independent classifiers, A, B, C, with equal error probabilities:

Here's the problem:
Consider three mutually independent classifiers, A, B, C, with equal error probabilities:
Pr(errA) = Pr(errB) = Pr(errC) = t
Let D be another classifier that takes the majority vote of A, B, and C.
• What is Pr(errD)?
• Plot Pr(errD) as a function of t.
• For what values of t, the performance of D is better than any of the other three classifiers?
My questions are:
(1) I couldn't figure out the error probability of D. I thought it would be 1 minus alpha (1 - α), but I am not sure.
(2) How to plot t(Pr(errD))? I assume without finding Pr(errD) then I can plot it.
(3) Here as well, I couldn't figure it out. Comparatively, how should I determine the performance of D?
If I understand well, your problem can be formulated with simple terms without any ensemble learning.
Given that D is the result of a vote by 3 classifiers, D is wrong if and only if at most one of the estimators is right.
A,B,C are independent, so:
the probability of none being right is t^3
the probability of one being right while the other two are wrong is 3(1-t)t^2 (the factor 3 is because there are three ways to achieve this)
So P(errD) = t^3 + 3(1-t)t^2 = -2t^3 + 3t^2
You should be able to plot this as a function of t in the interval [0:1] without too many difficulties.
As for your third question, just solve P(errA) - P(errD) >0 (this means that the error probability of D is smaller than for A and so that its performance is better). If you solve this, you should find that the condition is t<0.5.
To come back to ensemble learning, note that the assumption of independence between your estimators is usually not verified in practice.

Decision tree completeness and unclassified data

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?
I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

Non binary decision tree to binary decision tree (Machine learning)

This is homework question, so I just need help may be yes/No and few comment will be appreciated!
Prove: Arbitrary tree (NON binary tree) can be converted to equivalent binary decision tree.
My answer:
Every decision can be generated just using binary decisions. Hence that decision tree too.
I don't know formal proof. Its like I can argue with Entropy(Gain actually) for that node will be E(S) - E(L) - E(R). And before that may be it is E(S) - E(Y|X=t1) - E(Y|X=t2) - and so on.
But don't know how to say?!
You can give a constructive proof of something like this, demonstrating how to convert an arbitrary decision tree into a binary decision tree.
Imagine that you are sitting at node A, and you have a choice of traversing to B, C, and D based on whether or not your example satisfies requirements B, C or D. If this is a proper decision tree, B, C and D are mutually exclusive and cover all cases.
A -> B
-> C
-> D
Since they're mutually exclusive, you could imagine splitting your tree into a binary decision: B or not B; on the not B branch, we know that either C or D has to be true, since B, C, and D were mutually exclusive and cover all cases. In other words:
A -> B
-> ~B
---> C
---> D
Then you can copy whatever was going to go after B on to the branch that follows B, performing the same simplification. Same for C and D.

Neural Networks: What does "linearly separable" mean?

I am currently reading the Machine Learning book by Tom Mitchell. When talking about neural networks, Mitchell states:
"Although the perceptron rule finds a successful weight vector when
the training examples are linearly separable, it can fail to converge
if the examples are not linearly separable. "
I am having problems understanding what he means with "linearly separable"? Wikipedia tells me that "two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line."
But how does this apply to the training set for neural networks? How can inputs (or action units) be linearly separable or not?
I'm not the best at geometry and maths - could anybody explain it to me as though I were 5? ;) Thanks!
Suppose you want to write an algorithm that decides, based on two parameters, size and price, if an house will sell in the same year it was put on sale or not. So you have 2 inputs, size and price, and one output, will sell or will not sell. Now, when you receive your training sets, it could happen that the output is not accumulated to make our prediction easy (Can you tell me, based on the first graph if X will be an N or S? How about the second graph):
^
| N S N
s| S X N
i| N N S
z| S N S N
e| N S S N
+----------->
price
^
| S S N
s| X S N
i| S N N
z| S N N N
e| N N N
+----------->
price
Where:
S-sold,
N-not sold
As you can see in the first graph, you can't really separate the two possible outputs (sold/not sold) by a straight line, no matter how you try there will always be both S and N on the both sides of the line, which means that your algorithm will have a lot of possible lines but no ultimate, correct line to split the 2 outputs (and of course to predict new ones, which is the goal from the very beginning). That's why linearly separable (the second graph) data sets are much easier to predict.
This means that there is a hyperplane (which splits your input space into two half-spaces) such that all points of the first class are in one half-space and those of the second class are in the other half-space.
In two dimensions, that means that there is a line which separates points of one class from points of the other class.
EDIT: for example, in this image, if blue circles represent points from one class and red circles represent points from the other class, then these points are linearly separable.
In three dimensions, it means that there is a plane which separates points of one class from points of the other class.
In higher dimensions, it's similar: there must exist a hyperplane which separates the two sets of points.
You mention that you're not good at math, so I'm not writing the formal definition, but let me know (in the comments) if that would help.
Look at the following two data sets:
^ ^
| X O | AA /
| | A /
| | / B
| O X | A / BB
| | / B
+-----------> +----------->
The left data set is not linearly separable (without using a kernel). The right one is separable into two parts for A' andB` by the indicated line.
I.e. You cannot draw a straight line into the left image, so that all the X are on one side, and all the O are on the other. That is why it is called "not linearly separable" == there exist no linear manifold separating the two classes.
Now the famous kernel trick (which will certainly be discussed in the book next) actually allows many linear methods to be used for non-linear problems by virtually adding additional dimensions to make a non-linear problem linearly separable.

Resources