What does the "support" mean in Support Vector Machine? - machine-learning

What the meaning of the word "support" in the context of Support Vector Machine, which is a supervised learning model?

Copy-pasted from Wikipedia:
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

In SVMs the resulting separating hyper-plane is attributed to a sub-set of data feature vectors (i.e., the ones that their associated Lagrange multipliers are greater than 0). These feature vectors were named support vectors because intuitively you could say that they "support" the separating hyper-plane or you could say that for the separating hyper-plane the support vectors play the same role as the pillars to a building.
Now formally, paraphrasing Bernhard Schoelkopf 's and Alexander J. Smola's book titled "Learning with Kernels" page 6:
"In the searching process of the unique optimal hyper-plane we consider hyper-planes with normal vectors w that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce computational cost of evaluating the decision function. The hyper-plane will then only depend on a sub-set of the training patterns called Support Vectors."
That is, the separating hyper-plane depends on those training data feature vectors, they influence it, it's based on them, consequently they support it.

In a kernel space, the simplest way to represent the separating hyperplane is by the distance to data instances. These data instances are called "support vectors".
The kernel space could be infinite. But as long as you can compute the kernel similarity to the support vectors, you can test which side of the hyperplane an object is, without actually knowing what this infinite dimensional hyperplane looks like.
In 2d, you could of course just produce an equation for the hyperplane. But this doesn't yield any actual benefits, except for understanding the SVM.

Related

What's the major difference between glove and word2vec?

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Is doc vector learned through PV-DBOW equivalent to the average/sum of the word vectors contained in the doc?

I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!
No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.

Machine Learning - SVM

If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?
You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.
Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004
This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

non linear svm kernel dimension

I have some problems with understanding the kernels for non-linear SVM.
First what I understood by non-linear SVM is: using kernels the input is transformed to a very high dimension space where the transformed input can be separated by a linear hyper-plane.
Kernel for e.g: RBF:
K(x_i, x_j) = exp(-||x_i - x_j||^2/(2*sigma^2));
where x_i and x_j are two inputs. here we need to change the sigma to adapt to our problem.
(1) Say if my input dimension is d, what will be the dimension of the
transformed space?
(2) If the transformed space has a dimension of more than 10000 is it
effective to use a linear SVM there to separate the inputs?
Well it is not only a matter of increasing the dimension. That's the general mechanism but not the whole idea, if it were true that the only goal of the kernel mapping is to increase the dimension, one could conclude that all kernels functions are equivalent and they are not.
The way how the mapping is made would make possible a linear separation in the new space.
Talking about your example and just to extend a bit what greeness said, RBF kernel would order the feature space in terms of hyperspheres where an input vector would need to be close to an existing sphere in order to produce an activation.
So to answer directly your questions:
1) Note that you don't work on feature space directly. Instead, the optimization problem is solved using the inner product of the vectors in the feature space, so computationally you won't increase the dimension of the vectors.
2) It would depend on the nature of your data, having a high dimensional pattern would somehow help you to prevent overfitting but not necessarily will be linearly separable. Again, the linear separability in the new space would be achieved because the way the map is made and not only because it is in a higher dimension. In that sense, RBF would help but keep in mind that it might not perform well on generalization if your data is not locally enclosed.
The transformation usually increases the number of dimensions of your data, not necessarily very high. It depends. The RBF Kernel is one of the most popular kernel functions. It adds a "bump" around each data point. The corresponding feature space is a Hilbert space of infinite dimensions.
It's hard to tell if a transformation into 10000 dimensions is effective or not for classification without knowing the specific background of your data. However, choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
For example, the MNIST database of handwritten digits contains 60K training examples and 10K test examples with 28x28 binary images.
Linear SVM has ~8.5% test error.
Polynomial SVM has ~ 1% test error.
Your question is a very natural one that almost everyone who's learned about kernel methods has asked some variant of. However, I wouldn't try to understand what's going on with a non-linear kernel in terms of the implied feature space in which the linear hyperplane is operating, because most non-trivial kernels have feature spaces that it is very difficult to visualise.
Instead, focus on understanding the kernel trick, and think of the kernels as introducing a particular form of non-linear decision boundary in input space. Because of the kernel trick, and some fairly daunting maths if you're not familiar with it, any kernel function satisfying certain properties can be viewed as operating in some feature space, but the mapping into that space is never performed. You can read the following (fairly) accessible tutorial if you're interested: from zero to Reproducing Kernel Hilbert Spaces in twelve pages or less.
Also note that because of the formulation in terms of slack variables, the hyperplane does not have to separate points exactly: there's an objective function that's being maximised which contains penalties for misclassifying instances, but some misclassification can be tolerated if the margin of the resulting classifier on most instances is better. Basically, we're optimising a classification rule according to some criteria of:
how big the margin is
the error on the training set
and the SVM formulation allows us to solve this efficiently. Whether one kernel or another is better is very application-dependent (for example, text classification and other language processing problems routinely show best performance with a linear kernel, probably due to the extreme dimensionality of the input data). There's no real substitute for trying a bunch out and seeing which one works best (and make sure the SVM hyperparameters are set properly---this talk by one of the LibSVM authors has the gory details).

Resources