I have a set of N data points X with +/- labels for which I'd like to calculate the max-margin linear separator (aka classifier, hyperplane) or fail if no such linear separator exist.
I do not want to avoid overfitting in the context of this question, as I do so elsewhere. So no slack variables ; no cross-validation ; no limits on the number of support vectors ; just find max-margin separator or fail.
How to I use libsvm to do so? I believe you can't give c=0 in C-SVM and you can't give nu=1 in nu-svm.
Related question (which I think didn't provide an answer):
Which of the parameters in LibSVM is the slack variable?
In the case of C-SVM, you should use a linear kernel and a very large C value (or nu = 0.999... for nu-SVM). If you still have slacks with this setting, probably your data is not linearly separable.
Quick explanation: the C-SVM optimization function tries to find the hyperplane having maximum margin and lowest misclassification costs at the same time. The misclassification costs in the C-SVM formulation is defined by: distance from the misclassified point to its correct side of the hyperplane, multiplied by C. If you increase the C value (or nu value for nu-SVM), every misclassified point will be too costly and an hyperplane that separates the data perfectly will be preferable for the optimization function.
Related
I am not sure what the y-axis of my PDP implies? Is that the probability for my target feature to be 1 (binary classification) or something else?
If you do the partial dependence plot of column a and you want to interpret the y value at x = 0.0, the y-axis value represent the average probability of class 1 computed by
changing value of column a in all rows in your dataset to 0.0
predicting all changed row with your fitted model
averaging the probability given by the model
I may not good at explaining but you can read more about PDP at https://christophm.github.io/interpretable-ml-book/pdp.html. Hope this help :)
Generally speaking, we can produce a classifier from a function, f, producing a real-value output plus a threshold. We call the output an 'activation'. If the activation meets a threshold condition is met, the we say the class is detected:
is_class := ( f(x0, x1, ...) > threshold )
and
activation = f(x0, x1, ...)
PDP plots simply show activation values as they change in response to changes in an input value (we ignore the threshold). That is might plot:
f(x0, x, x2, x3, ...)
as a single input x varies. Typically, we hold the others constant, although we can also plot in 2d and 3d.
Sometimes we're interested in:
how a single change the activation
how multiple inputs independently change the activation
how multiple activations change based on different inputs, and so on.
Strictly speaking, we need not even be talking about a classifier when looking a PDP plots. Any function that productions a real-value output (an activation) in response to one of more real-valued feature inputs that we can vary allows us to produce PDP plots.
Classifier activations need not be, and often should not be, interpreted as probabilities, as others have written. In very many cases, this is simply just incorrect. Nevertheless, the analysis of the activation levels is of interest to us, independently of whether the activations represent probabilities: in PDP plots, we can see, for example, which feature values produce strong change - more horizontal plots may imply a worthless feature.
Similarly, in RoC plots, we explicitly examine information about the true-positive and false-position detection rates that result for varying the threshold of activation values.
In both cases, there's no necessity that the classifier produce probabilities as its activation.
Interpretation of PDP plots is fraught with dangers. At a minimum, you need to be clear about what is being held constant as a input feature is varied. Were the other features set to zero (a good choice for linear models)? Did we the set them to their most common values in the test set? Or the most common values for a known class in a sample? Without this information, the vertical axis may be less helpful.
Knowing that an activation is a probability also doesn't seem to helpful in PDP plots -- you can't expect the area under it to sum to one. Perhaps the most useful thing you might find is error cases, where output probabilities are not in the range 0..1.
I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.
When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.
If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.
As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.
I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).