svm conceptual query - machine-learning

I have some basic conceptual queries on SVM - it will be great if any one can guide me on this. I have been studying books and lectures for a while but have not been able to get answers for these queries correctly
Suppose I have m featured data points - m > 2. How will I know if the data points are linearly separable or not?. If I have understood correctly, linearly separable data points - will not need any special kernel for finding the hyper plane as there is no need to increase the dimension.
Say, I am not sure whether the data is linearly separable or not. I try to get a hyper plane with linear kernel, once with slackness and once without slackness on the lagrange multipliers. What difference will I see on the error rates on training and test data for these two hyper planes. If I understood correctly, if the data is not linearly separable, and if I am not using slackness then there cannot be any optimal plane. If that is the case, should the svm algorithm give me different hyper planes on different runs. Now when I introduce slackness - should I always get the same hyper plane, every run ? And how exactly can I find out from the lagrange multipliers of a hyper plane, whether the data was linearly separable or not.
Now say from 2 I came to know somehow that the data was not linearly separable at m dimensions. So I will try to increase the dimensions and see if it is separable at a higher dimension. How do I know how high I will need to go ? I know the calculations do not go into that space - but is there any way to find out from 2 what should be the best kernel for 3 (i.e I want to find a linearly separating hyper plane).
What is the best way to visualize hyper planes and data points in Matlab where the feature dimensions can be as big as 60 - and the hyperplane is at > 100 dimensions (i,e data points in few hundreds and using Gaussian Kernels the feature vector changes to > 100 dimensions).
I will really appreciate if someone clears these doubts
Regards

I'm going to try to focus on your questions (1), (2) and (3). In practice the most important concern is not if the problem becomes linearly separable but how well the classifier performs on unseen data (i.e. how well it classifies). It seems you want to find a good kernel for which data is linearly separable, and you will always be able to do this (consider putting at each training point an extremely narrow gaussian RBF), but what you really want is good performance on unseen data. That being said:
If the problem is not linearly separable and not using slacks the optimization will fail. It depends on the implementation and the specific optimization algorithm how it fails, does it not converge?, does it not find a descent direction? does it run into numerical difficulties? Even if you wanted to determine cases with slacks, you can still run into numerical difficulties that would make and that alone would make your algorithm of linear separability unreliable
How high do you need to go? Well that is a fundamental question. It is called the problem of data representation. For straight forward solutions people use held out data (people don't care about linear separability they care about good performance on held out data) and do parameter search (for example an RBF kernel can is strictly more expressive than a linear kernel) under the correct gammas. So the problem becomes finding a good gamma for your data. See for example this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.880
I don't think there is a trivial connection between the values of the lagrangian multipliers and linear separability. You can try an high alphas whose value is C, but I'm not sure you'll be able to say much.

Related

Intuition behind data preprocessing in ML

I'm going through CS231n to understand the basics of neural networks.
Attached is the slide in which Justin (the tutor) gives the reasoning for why data preprocessing is required and I don't completely understand. The explanation given is similar to the one given on the slide and I don't get it. The slide is below.
The second question I have is: is it actually normalisation or standardisation? This link implies that it is standardisation, whereas the course material says it is normalisation.
Any help will be appreciated.
A) The meaning of "less sensitive to small changes in weights" can easily be visualized. Imagine to operate a little change in the weights of the drawn hyperplane, i.e. rotate it a bit. If the samples are located around the origin, you'll notice that they can still be correctly classified. If they're far away from the origin, the same little change in weights will lead to bigger misclassifications.
B) Sometimes standardization and normalization are used interchangeably.
Standardization: I quote from Machine Learning and Pattern Recognition by Bishop : "For the purposes of this example, we have made a linear re-scaling of the data, known as standardizing, such that each of the variables has zero mean and unit standard deviation."
Normalization could be e.g. min-max normalization when you scale all feature values to the [0,1] range, or feature vector normalization when you divide the feature vector by its modulus.

Possible/maybe category in deep learning

I'm interested in taking advantage of some partially labeled data that I have in a deep learning task. I'm using a fully convolutional approach, not sampling patches from the labeled regions.
I have masks that outline regions of definite positive examples in an image, but the unmasked regions in the images are not necessarily negative - they may be positive. Does anyone know of a way to incorporate this type of class in a deep learning setting?
Triplet/contrastive loss seems like it may be the way to go, but I'm not sure how to accommodate the "fuzzy" or ambiguous negative/positive space.
Try label smoothing as described in section 7.5.1 of Deep Learning book:
We can assume that for some small constant eps, the training set label y is correct with probability 1 - eps, and otherwise any of the other possible labels might be correct.
Label smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of eps / k and 1 - (k - 1) / k * eps, respectively.
See my question about implementing label smoothing in Pandas.
Otherwise if you know for sure, that some areas are negative, other are positive while some are uncertain, then you can introduce a third uncertain class. I have worked with data sets that contained uncertain class, which corresponded to samples that could belong to any of the available classes.
I'm assuming that you are struggling with a data segmantation task with a problem of a ill-definied background (e.g. you are not sure if all examples are correctly labeled). Recently I came across the similiar problem and this is what I came across during my research:
In old days before deep learning and at the begining of deep learning era - the common way to deal with that is to smooth your output with some kind of a probability model which would take into account the possibility of a noisy labels (you could read about this in a Learning to Label from Noisy Data chapter from this book. It's important to discriminate this probabilistic models from models used to smooth your labels w.r.t. to image or label structure like classical CRFs for bilateral smoothing.
What we finally used (and worked really well) is the Channel Inhibited Softmax idea from this paper. In terms of a mathematical properties - it makes your network much more robust to some objects not labeled - because it makes your network to output much higher positive valued logits at correctly labeled objects.
You could treat this as a semi-supervised problem. Use the full dataset without labels to train a bottleneck autoencoder structure (or a GAN approach). This pretrained model can then be adjusted (e.g. removing the last layers, adding a better layer structure at the end on top of the bottleneck features) and finetuned on the labeled data.

Convergence and regularization in linear regression classifier

I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
Convergence
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Resources