Using Convolution Neural network as Binary classifiers - image-processing

Given any image I want my classifier to tell if it is Sunflower or not. How can I go about creating the second class ? Keeping the set of all possible images - {Sunflower} in the second class is an overkill. Is there any research in this direction ? Currently my classifier uses a neural network in the final layer. I have based it upon the following tutorial :
https://github.com/torch/tutorials/tree/master/2_supervised
I am taking images with 254x254 as the input.
Would SVM help in the final layer ? Also I am open to using any other classifier/features that might help me in this.

The standard approach in ML is that:
1) Build model
2) Try to train on some data with positive\negative examples (start with 50\50 of pos\neg in training set)
3) Validate it on test set (again, try 50\50 of pos\neg examples in test set)
If results not fine:
a) Try different model?
b) Get more data
For case #b, when deciding which additional data you need the rule of thumb which works for me nicely would be:
1) If classifier gives lots of false positive (tells that this is a sunflower when it is actually not a sunflower at all) - get more negative examples
2) If classifier gives lots of false negative (tells that this is not a sunflower when it is actually a sunflower) - get more positive examples
Generally, start with some reasonable amount of data, check the results, if results on train set or test set are bad - get more data. Stop getting more data when you get the optimal results.
And another thing you need to consider, is if your results with current data and current classifier are not good you need to understand if the problem is high bias (well, bad results on train set and test set) or if it is a high variance problem (nice results on train set but bad results on test set). If you have high bias problem - more data or more powerful classifier will definitely help. If you have a high variance problem - more powerful classifier is not needed and you need to thing about the generalization - introduce regularization, remove couple of layers from your ANN maybe. Also possible way of fighting high variance is geting much, MUCH more data.
So to sum up, you need to use iterative approach and try to increase the amount of data step by step, until you get good results. There is no magic stick classifier and there is no simple answer on how much data you should use.

It is a good idea to use CNN as the feature extractor, peel off the original fully connected layer that was used for classification and add a new classifier. This is also known as the transfer learning technique that has being widely used in the Deep Learning research community. For your problem, using the one-class SVM as the added classifier is a good choice.
Specifically,
a good CNN feature extractor can be trained on a large dataset, e.g. ImageNet,
the one-class SVM can then be trained using your 'sunflower' dataset.
The essential part of solving your problem is the implementation of the one-class SVM, which is also known as anomaly detection or novelty detection. You may refer http://scikit-learn.org/stable/modules/outlier_detection.html for some insights about the method.

Related

How to actually use a validation set when using support vector machines in sklearn

While working with SVMs, I am seeing that it is a good practice to perform a three way split on the original data set, something along the lines of, say, a 70/15/15 split.
This split would correspond to %70 for training, %15 for testing, and %15 for what is referred to as "validation."
I'm fairly clear on why this is a good practice, but I'm not sure about the nuts and bolts needed to actually perform this. Lots of online sources discuss the importance, but I can't seem to find a definite (or at least algorithmic) description of the process. For example, sklearn discusses it here but stops before giving any solid tools.
Here's my idea:
Train the algorithm, using training set
Find error rate, using testing set
?? tweak parameters
Get error rate again, using validation set
If anyone could point me in the direction of a good resource, I'd be grateful.
The role of the validation set in all supervised learning algorithms is to find the optimium for the parameters of the algorithm (if there are any).
After splitting your data into traing/validation/test data, the best practise to train an algorithm is like that:
choose initial learning parameters
train the algorithm using the training set and the parameters
get the (validation) accuracy using the validation set (cross-validation test)
change parameters and continue with 2 until found parameters leading to best validation accuracy
get the (test) accuracy using the test set which represents the actual expected accuracy of your trained algorithm on new unseen data.
There are some advanced approaches for performing the cross-validation test. Some libraries like libsvm have them included: the k-fold cross validation.
In k-fold cross validation you split your train data randomly into k same-sized portions. You train using k-1 portions and cross validate with the remaining portion. You do this k-times with different subsets and finally using the average.
Wikipedia is a good source:
http://en.wikipedia.org/wiki/Supervised_learning
http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

How to continue to train SVM based on the previous model

We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.

progressive random forest?

I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.

Can one conduct training an SVM with detected false positives iteratively?

I'm working on a machine learning problem in image processing. I want to get the location of an object in an image by using Histogram of Oriented Gradients (HOG) and a support vector machine (SVM). I've read a couple of articles and tutorials about training the SVM. The setup is pretty standard. I have labeled positive training images and now need to generate a set of negative training samples.
In literature, the approach to generate negative training samples by randomly choosing a position is found very often. I've also seen some approaches where in a successive step to choosing random negative samples, the false-positives of a detection are used as negative training samples once again.
However, I'm wondering if one could not use this approach generally from the start. So one would generate only one false training sample randomly, run the detection and put false-positives in the negative training set again. This seems quite an obvious strategy to me, but I wonder if I'm missing something.
The theory behind this method is laid out in Object Detection with Discriminatively Trained Part Based Models by P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan in their PAMI paper. In essence, your starting negative set does not matter, you will always converge to the same classifier if you iteratively add hard samples (with an SVM margin > -1). Starting with a single negative would simply make this convergence slower.
To me it sounds like you want to train the SVM classifier online/incrementally, i.e. updating the classifier with new samples. Such methods are generally only used if new data comes available over time. In your case it seems that you can generate a whole set of negative training samples, so there would be no need to train it incrementally. I'm inclined to say that training the classifier in one run will be better than doing this incrementally (as hinted at by larsmans).
(Again, I'm not an image processing specialist, so take this with a grain of salt.)
I'm wondering if one could not use this approach generally from the start.
You'd need some way to detect the false positives from a classification run. To do so, you need a ground truth, that is, you need a human in the loop. In effect, you'd be doing active learning. If that's what you want to do, you could just as well start with a bunch of hand-labeled negative examples.
Alternatively, you could set this up as a PU learning problem. I have no idea whether that works well with images, but for text classification, it sometimes works.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources