SVM Hard margin: why imbalanced dataset may cause bad results? - machine-learning

I can understand why soft Margin SVM are subject to imbalanced training set: minimizing the error of the optimiziation problem can drive to classify all data training to be negative (if |negative examples| >> |positive examples| ).
But, in hard margin SVM, I haven't slack variables and C costant, so I don't want to minimize the error, because hard margin SVM expected no error (for the definition of the problem)! hard margin SVM just search the support vectors and maximize the margin between the classes support hyperplanes "identified" by the support vectors; now, if I have "behind" the negative support vectors (i.e. the negative class suppoort hyperplane) a lot of points or the same number of positive points, these do not affect my margin and separation hyperplane;
it's always the same since it depends just from support vector, and they are always the same regardless if I increase the number of points! why hard margin SVM are subject to imbalanced dataset or where my reasoning is wrong?
thanks!

For a true hard margin SVM there are two options for any data set, regardless of how its balanced:
The training data is perfectly separable in feature space, you get a resulting model with 0 training errors.
The training data is not separable in feature space, you will not get anything (no model).
Additionally, take note that you could train hard margin SVM on any data set given a kernel that is complex enough (RBF with very large gamma, for instance). The resulting model is generally bad, though, as it is a total overfit of the training data.

Related

Accuracy below 50% for binary classification

I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up the negative class. However, for some subsets of the negative class I get accuracy values lower than 50%, i.e. random guessing. I am wondering, should I worry about these results being much lower than 50%? Thank you!
It's impossible to fully answer this question without specific details, so here instead are guidelines:
If you have a dataset with equal amounts of classes, then random guessing would give you 50% accuracy on average.
To be clear, are you certain your model has learned something on your training dataset? Is the training dataset accuracy higher than 50%? If yes, continue reading.
Assuming that your validation set is large enough to rule out statistical fluctuations, then lower than 50% accuracy suggests that something is indeed wrong with your model.
For example, are your classes accidentally switched somehow in the validation dataset? Because notice that if you instead use 1 - model.predict(x), your accuracy would be above 50%.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

How can I make Weka classify the smaller class, with a 2:1 class imbalance?

How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.

Word2Vec: Number of Dimensions

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Resources