Neural Networks: What does "linearly separable" mean? - machine-learning

I am currently reading the Machine Learning book by Tom Mitchell. When talking about neural networks, Mitchell states:
"Although the perceptron rule finds a successful weight vector when
the training examples are linearly separable, it can fail to converge
if the examples are not linearly separable. "
I am having problems understanding what he means with "linearly separable"? Wikipedia tells me that "two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line."
But how does this apply to the training set for neural networks? How can inputs (or action units) be linearly separable or not?
I'm not the best at geometry and maths - could anybody explain it to me as though I were 5? ;) Thanks!

Suppose you want to write an algorithm that decides, based on two parameters, size and price, if an house will sell in the same year it was put on sale or not. So you have 2 inputs, size and price, and one output, will sell or will not sell. Now, when you receive your training sets, it could happen that the output is not accumulated to make our prediction easy (Can you tell me, based on the first graph if X will be an N or S? How about the second graph):
^
| N S N
s| S X N
i| N N S
z| S N S N
e| N S S N
+----------->
price
^
| S S N
s| X S N
i| S N N
z| S N N N
e| N N N
+----------->
price
Where:
S-sold,
N-not sold
As you can see in the first graph, you can't really separate the two possible outputs (sold/not sold) by a straight line, no matter how you try there will always be both S and N on the both sides of the line, which means that your algorithm will have a lot of possible lines but no ultimate, correct line to split the 2 outputs (and of course to predict new ones, which is the goal from the very beginning). That's why linearly separable (the second graph) data sets are much easier to predict.

This means that there is a hyperplane (which splits your input space into two half-spaces) such that all points of the first class are in one half-space and those of the second class are in the other half-space.
In two dimensions, that means that there is a line which separates points of one class from points of the other class.
EDIT: for example, in this image, if blue circles represent points from one class and red circles represent points from the other class, then these points are linearly separable.
In three dimensions, it means that there is a plane which separates points of one class from points of the other class.
In higher dimensions, it's similar: there must exist a hyperplane which separates the two sets of points.
You mention that you're not good at math, so I'm not writing the formal definition, but let me know (in the comments) if that would help.

Look at the following two data sets:
^ ^
| X O | AA /
| | A /
| | / B
| O X | A / BB
| | / B
+-----------> +----------->
The left data set is not linearly separable (without using a kernel). The right one is separable into two parts for A' andB` by the indicated line.
I.e. You cannot draw a straight line into the left image, so that all the X are on one side, and all the O are on the other. That is why it is called "not linearly separable" == there exist no linear manifold separating the two classes.
Now the famous kernel trick (which will certainly be discussed in the book next) actually allows many linear methods to be used for non-linear problems by virtually adding additional dimensions to make a non-linear problem linearly separable.

Related

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much
This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.
I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

How to concatenate word vectors to form sentence vector

I have learned in some essays (Tomas Mikolov...) that a better way of forming the vector for a sentence is to concatenate the word-vector.
but due to my clumsy in mathematics, I am still not sure about the details.
for example,
supposing that the dimension of word vector is m; and that a sentence has n words.
what will be the correct result of concatenating operation?
is it a row vector of 1 x m*n ? or a matrix of m x n ?
There are at least three common ways to combine embedding vectors; (a) summing, (b) summing & averaging or (c) concatenating. So in your case, with concatenating, that would give you a 1 x m*a vector, where a is the number of sentences. In the other cases, the vector length stays the same. See gensim.models.doc2vec.Doc2Vec, dm_concat and dm_mean - it allows you to use any of those three options [1,2].
[1] http://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.LabeledLineSentence
[2] https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py

Decision tree completeness and unclassified data

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Intuition about the kernel trick in machine learning

I have successfully implemented a kernel perceptron classifier, that uses an RBF kernel. I understand that the kernel trick maps features to a higher dimension so that a linear hyperplane can be constructed to separate the points. For example, if you have features (x1,x2) and map it to a 3-dimensional feature space you might get: K(x1,x2) = (x1^2, sqrt(x1)*x2, x2^2).
If you plug that into the perceptron decision function w'x+b = 0, you end up with: w1'x1^2 + w2'sqrt(x1)*x2 + w3'x2^2which gives you a circular decision boundary.
While the kernel trick itself is very intuitive, I am not able to understand the linear algebra aspect of this. Can someone help me understand how we are able to map all of these additional features without explicitly specifying them, using just the inner product?
Thanks!
Simple.
Give me the numeric result of (x+y)^10 for some values of x and y.
What would you rather do, "cheat" and sum x+y and then take that value to the 10'th power, or expand out the exact results writing out
x^10+10 x^9 y+45 x^8 y^2+120 x^7 y^3+210 x^6 y^4+252 x^5 y^5+210 x^4 y^6+120 x^3 y^7+45 x^2 y^8+10 x y^9+y^10
And then compute each term and then add them together? Clearly we can evaluate the dot product between degree 10 polynomials without explicitly forming them.
Valid kernels are dot products where we can "cheat" and compute the numeric result between two points without having to form their explicit feature values. There are many such possible kernels, though only a few have been getting used a lot on papers / practice.
I'm not sure if I'm answering your question, but as I remember the "trick" is that you don't explicitly calculate inner products. The perceptron calculates a straight line that separates the clusters. To get curved lines or even circles, instead of changing the perceptron you can change the space that contains the clusters. This is done by using a transformation usually called phi that transform coordinates to from one space to another. The perceptron algorithm is then applied in the new space where it produces a straight line, but when that line then is transformed back to the original space it can be curved.
The trick is that the perceptron only needs to know the inner product of the points of the clusters it is trying to separate. This means that we only need to be able to calculate the inner product of the transformed points. This is what the kernel does K(x,y) = <phi(x), phi(y)> where < . , . > is the inner product in the new space. This means that there is no need to do all the transformations to the new space and back, we don't even need to explicitly know what the transformation phi() is. All that is needed is that K defines an inner product in some space and hope that this inner product and space is useful for separating our clusters.
I think that there was some theorem that says that if the space represented by the kernel has higher dimensionality than the original space it is likely that it will separate the clusters better.
There is really not much to it
The weight in the higher space is
w = sum_i{a_i^t * Phi(x_i)}
and the input vector in the higher space
Phi(x)
so that the linear classification in the higher space is
w^t * input + c > 0
so if you put these together
sum_i{a_i * Phi(x_i)} * Phi(x) + c = sum_i{a_i * Phi(x_i)^t * Phi(x)} + c > 0
the last dot product's computational complexity is linear to the number of dimensions (often intractable, or not wanted)
We solve this by going over to the kernel "magic answer to the dot product"
K(x_i, x) = Phi(x_i)^t * Phi(x)
which gives
sum_i{a_i * K(x_i, x)} + c > 0

Resources