What do the features given by a feature selection method mean in a binary classifier which has a cross validation accuracy of 0? - machine-learning

So I know that given a binary classifier, the farther away you are from an accuracy of 0.5 the better your classifier is. (I.e. A binary classifier that gets everything wrong can be converted to one which gets everything right by always inverting its decisions.)
However, I have an inner feature selection process, which provides me "good" features to use (I'm trying out recursive feature elimination, and another based on Spearman's rank correlation coefficient). Given that the classifier using these "good" features gets a cross validation accuracy of 0, can I still conclude that the features selected are useful and are predictive of the class in this binary prediction problem?

To simplify, let's assume you're testing on some balanced set. Half the testing data is positive and half the testing data is negative.
I would say that something strange is happening that is flipping the sign of your decision. That classifier you're evaluating is very useful, but you would need to flip the decision it makes. You should probably check your code to make sure you're not flipping the class of the training data. Some libraries (LIBSVM for example) require that the first training example is from the positive class.
To summarize: It seems the features you're selecting are useful, but it seems you have a bug that is flipping the classes.

Related

Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression

I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC.
However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Any suggestions?
Thanks in advance.
If you have enough data, you can simply retune the probabilities. For example, given the "predicted probability" output of your gaussian classifier, you can go back through (on a held out dataset) and at different prediction values, estimate the probability of the positive class.
Further, you can simply set up an optimization on your holdout set to determine the best threshold(without actually estimating the probability). Since it's one dimensional, you shouldn't even need to do anything fancy for optimization-- test like 500 different thresholds and pick the one which minimizes the costs associated with misclassifications.

Prediction of Stock Returns with ML Algrithm

I am working on a prediction model for stock returns over a fixed period of time (say n days). I am was hoping to gather a few ideas ahead of time. My questions are:
1) Would it be best to turn this into a classification problem, say create a dummy variable with returns larger than x%? Then I could try the entire arsenal of ML Algorithms.
2) If I don't turn it into a classification problem but use say a regression model, would it make sense or be necessary to transform the returns into logs?
Any thoughts are appreciated.
EDIT: My goal with this is relatively broadly defined, in the sense that I would simple like to improve performance of the selection process (pick positive returns and avoid negative ones)
Best under what quality? Turning it into a thresholding problem simply means translating the problem space to a much simpler one. Your problem definition is your own; you can turn it into a binary classification problem (>x or not), a multi-class classification problem (binning into ranges) or simply keep it as a prediction task. If you do the latter, you can still apply binning or classification as a post-processing step.
Classification is just a subclass of prediction. The log transformation employed by logistic regression is no more than a neat trick to turn the outputs into something that resembles a probability distribution; don't put too much thought into it. That said, applying transformations on your output is not necessarily bad (you could for instance apply some normalization to keep your output within the range of some activation function).

cross validating a train set where the class variable has a different distribution than the actual population

(noob in ML, be patient)
I want to test the performance of my scikit-learn SVMLinear classifier. My train-set has a different class distribution than the actual population, but my test-set is a representative, and distributes like the actual population.
I noticed that there's a class-weight parameter, and I want to try giving my classifier the actual population distribution, and see if it helps it perform better.
However - as my train-set distribution is different, so will be my validation set, right? So should I expect an improvement on the validation, or must I use my test-set to see the improvement? And if so - isn't it against the rules to calibrate using the test-set which will lead to burning the test-set or overfitting?
I've thought about bootstrap re-sampling of my train-set: making it distribute the same as the general population, and only then training and validating my model. Is this a good solution?
Thanks!
It seems that you have some good ideas which are mostly worth trying. The answers mostly depend on the application and the size of your train/test set.
It is against the rules to calibrate based on test set and again use the whole test set for evaluation. However, if your test set is large enough, you can always divide your test set to two sets: validation set, and actual test set. Then, your final evaluation will be based on a smaller test set, which might be still acceptable depending on the application.
For your training set that you believe it has different class distribution than the actual population, there might be several things worth trying. Usually the most acceptable approach is to use a classifier that can handle these differences (usually with fewer parameters to avoid over-fitting). There is a whole topic of classification and regression on skewed datasets that you can look through. Other than the classifier, provided that you did not derive the actual population from your test set, the methods below might help too:
1- One of them can be (as you said) bootstrap re-sampling in case that your training set is large enough for that.
2- Another approach can be generating more training samples by adding some noise to the current samples of the training set. For example if you are classifying images of birds, you can randomly make images darker or brighter, or randomly move them a few pixels to the sides or up and down (select values randomly in a small enough range). This way, you can add to the training set in a way to get the desired distribution.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Can one conduct training an SVM with detected false positives iteratively?

I'm working on a machine learning problem in image processing. I want to get the location of an object in an image by using Histogram of Oriented Gradients (HOG) and a support vector machine (SVM). I've read a couple of articles and tutorials about training the SVM. The setup is pretty standard. I have labeled positive training images and now need to generate a set of negative training samples.
In literature, the approach to generate negative training samples by randomly choosing a position is found very often. I've also seen some approaches where in a successive step to choosing random negative samples, the false-positives of a detection are used as negative training samples once again.
However, I'm wondering if one could not use this approach generally from the start. So one would generate only one false training sample randomly, run the detection and put false-positives in the negative training set again. This seems quite an obvious strategy to me, but I wonder if I'm missing something.
The theory behind this method is laid out in Object Detection with Discriminatively Trained Part Based Models by P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan in their PAMI paper. In essence, your starting negative set does not matter, you will always converge to the same classifier if you iteratively add hard samples (with an SVM margin > -1). Starting with a single negative would simply make this convergence slower.
To me it sounds like you want to train the SVM classifier online/incrementally, i.e. updating the classifier with new samples. Such methods are generally only used if new data comes available over time. In your case it seems that you can generate a whole set of negative training samples, so there would be no need to train it incrementally. I'm inclined to say that training the classifier in one run will be better than doing this incrementally (as hinted at by larsmans).
(Again, I'm not an image processing specialist, so take this with a grain of salt.)
I'm wondering if one could not use this approach generally from the start.
You'd need some way to detect the false positives from a classification run. To do so, you need a ground truth, that is, you need a human in the loop. In effect, you'd be doing active learning. If that's what you want to do, you could just as well start with a bunch of hand-labeled negative examples.
Alternatively, you could set this up as a PU learning problem. I have no idea whether that works well with images, but for text classification, it sometimes works.

Resources