One-class Support Vector Machine Sensitivity Drops when the number of training sample increase - machine-learning

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks

The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

Why does having too many principal components for handwritten digits classification result in less accuracy

I'm currently using PCA to do handwritten digits recognition for MNIST database (each digit has about 1000 observations and 784 features). One thing I have found confusing is that the accuracy is the highest when it has 40 PCs. If the number of PCs grows from this point, the accuracy starts to drop continuously.
From my understanding of PCA, I thought the more components I have, the better I can describe a dataset. Why does the accuracy becomes less if I have too many PCs?
In order to identify the optimum number of components, you need to plot the elbow curve
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The idea behind PCA is to reduce the dimensionality of the data by finding the principal components.
Lastly, I do not think that PCA can overfit the data as it is not a learning/ fitting algorithm.
You are just trying to project the data based on eigen-vectors to capture most of the variance along an axis.
This video should help: https://www.youtube.com/watch?v=_UVHneBUBW0

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

svm conceptual query

I have some basic conceptual queries on SVM - it will be great if any one can guide me on this. I have been studying books and lectures for a while but have not been able to get answers for these queries correctly
Suppose I have m featured data points - m > 2. How will I know if the data points are linearly separable or not?. If I have understood correctly, linearly separable data points - will not need any special kernel for finding the hyper plane as there is no need to increase the dimension.
Say, I am not sure whether the data is linearly separable or not. I try to get a hyper plane with linear kernel, once with slackness and once without slackness on the lagrange multipliers. What difference will I see on the error rates on training and test data for these two hyper planes. If I understood correctly, if the data is not linearly separable, and if I am not using slackness then there cannot be any optimal plane. If that is the case, should the svm algorithm give me different hyper planes on different runs. Now when I introduce slackness - should I always get the same hyper plane, every run ? And how exactly can I find out from the lagrange multipliers of a hyper plane, whether the data was linearly separable or not.
Now say from 2 I came to know somehow that the data was not linearly separable at m dimensions. So I will try to increase the dimensions and see if it is separable at a higher dimension. How do I know how high I will need to go ? I know the calculations do not go into that space - but is there any way to find out from 2 what should be the best kernel for 3 (i.e I want to find a linearly separating hyper plane).
What is the best way to visualize hyper planes and data points in Matlab where the feature dimensions can be as big as 60 - and the hyperplane is at > 100 dimensions (i,e data points in few hundreds and using Gaussian Kernels the feature vector changes to > 100 dimensions).
I will really appreciate if someone clears these doubts
Regards
I'm going to try to focus on your questions (1), (2) and (3). In practice the most important concern is not if the problem becomes linearly separable but how well the classifier performs on unseen data (i.e. how well it classifies). It seems you want to find a good kernel for which data is linearly separable, and you will always be able to do this (consider putting at each training point an extremely narrow gaussian RBF), but what you really want is good performance on unseen data. That being said:
If the problem is not linearly separable and not using slacks the optimization will fail. It depends on the implementation and the specific optimization algorithm how it fails, does it not converge?, does it not find a descent direction? does it run into numerical difficulties? Even if you wanted to determine cases with slacks, you can still run into numerical difficulties that would make and that alone would make your algorithm of linear separability unreliable
How high do you need to go? Well that is a fundamental question. It is called the problem of data representation. For straight forward solutions people use held out data (people don't care about linear separability they care about good performance on held out data) and do parameter search (for example an RBF kernel can is strictly more expressive than a linear kernel) under the correct gammas. So the problem becomes finding a good gamma for your data. See for example this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.880
I don't think there is a trivial connection between the values of the lagrangian multipliers and linear separability. You can try an high alphas whose value is C, but I'm not sure you'll be able to say much.

Resources