Support vector machines for mutliple object categorization - image-processing

I am trying to use linear SVMs for multi-class object category recognition. So far what I have understood is that there are mainly two approaches used- one-vs-all(OVA) and one-vs-one(OVO).
But I am having difficulty understanding its implementation. I mean the steps that I think is used are:
First the feature descriptors are prepared from let's say SIFT. So I have a 128XN feature vector.
Next to prepare a SVM classifier model for a particluar object category(say car), I take 50 images of car as the positive training set and total 50 images of rest categories taking randomly from each category (Is this part correct?). I prepare such models for all such categories (say 5 of them).
Next when I have an input image, do I need to input the image into all the 5 models and then check their values (+1/-1) for each of these models? I am having difficulty understanding this part.

In one-vs-all approach, you have to check for all 5 models. Then you can take the decision with the most confidence value. LIBSVM gives probability estimates.
In one-vs-one approach, you can take the majority. For example, you test 1 vs. 2, 1 vs. 3, 1 vs. 4 and 1 vs. 5. You classify it as 1 in 3 cases. You do the same for other 4 classes. Suppose for other four classes the values are [0, 1, 1, 2]. Therefore, class 1 was obtained most number of times, making that class as the final class. In this case, you could also do total of probability estimates. Take the maximum. That would work unless in one pair the classification goes extremely wrong. For example, in 1 vs. 4, it classifies 4 (true class is 1) with a confidence 0.7. Then just because of this one decision, your total of probability estimates may shoot up and give wrong results. This issue can be examined experimentally.
LIBSVM uses one vs. one. You can check the reasoning here. You can read this paper too where they defend one vs. all classification approach and conclude that it is not necessarily worse than one vs. one.

In short, your positive training samples are always the same. In one vs one you train n classifiers with negative samples from each of the negative classes taken separately. In one vs all you lump all negative samples together and train a single classifier.. The problem with the former approach is that you have to consider all n outcomes to decide on the class. The problem with the latter approach is that lumping al negativel object classes create may create a non homogeneous class that is hard to process and analyse.

Related

Cross validation and Improvement

I was wondering how the cross validation process can improve a model. I am totally new to this field and keen to learn.
I understood the principle of cross-validation but don't understand how it improves a model. Let's say the model is divided into 4 folds than if I train my model on the 3 first fourth and test on the last one the model is gonna train well. But when I repeat this step by training the model on the last 3 fourth and test on the first one, most of the training data has already been "reviewed" by the model? The model won't improve with data already seen right? Is it a "mean" of the models made with the different training data sets?
Thank you in advance for your time!
Cross validation doesn't actually improve the model, but helps you to accurately score it's performance.
Let's say at the beginning of your training you divide your data into 80% train and 20% test sets. Then you train on the said 80% and test on 20% and get the performance metric.
The problem is, when separating the data in the beginning, you did so hopefully randomly, or otherwise arbitrary, and as a result, the model performance you obtained is somehow relying on the pseudo-random number generator you've used or your judgement.
So instead you divide your data into, for example, 5 random equal sets. Then you take set 1, put it aside, train on sets 2-5, test on set 1 and record the performance metric. Then you put aside set 2, and train a fresh (not trained) model on sets 1, 3-5, test on set 2, record the metric and so on.
After 5 sets you will have 5 performance metrics. If you take their average (of the most appropriate kind) it would be a better representation of your model performance, because you are 'averaging out' the random effects of data splitting.
I think it is explained well in this blog with some code in Python.
With 4-fold cross-validation you are effectively training 4 different models. There's no dependency between the models and one does not train on top of the other.
What will happen later depends on the implementation. Typically you can access all models that were trained and it's left to you what to do with that.

Cleveland heart disease dataset - can’t describe the class

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
The dataset description says that the values go from 0 to 4 but the attribute description says:
0: < 50% coronary disease
1: > 50% coronary disease
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?
If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
References :
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.
is this dataset meant to be a multiclass or a binary classification problem?
Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?
Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

How to understand output from a Multiclass Neural Network

Built a flow in Azure ML using a Neural network Multiclass module (for settings see picture).
Some more info about the Multiclass:
The data flow is simple, split of 80/20.
Preparation of the data is made before it goes into Azure. Data looks like this:
My problem comes when I want to make sense of the output and if possible transform/calculate the output to probabilities. Output looks like this:
My question: If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1? And how sure can I be that actual outcome will be a 1?
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome? Or what type of outcomes should I watch out for?
To start with, your are in a binary classification setting, not in a multi-class one (we normally use this term when number of classes > 2).
If scored probabilities output for my model is 0.6 and scored labels = 1, how sure is the model of the scored labels 1?
In practice, the scored probabilities are routinely interpreted as the confidence of the model; so, in this example, we would say that your model has 60% confidence that the particular sample belongs to class 1 (and, complementary, 40% confidence that it belongs to class 0).
And how sure can I be that actual outcome will be a 1?
If you don't have any alternate means of computing such outcomes yourself (e.g. a different model), I cannot see how this question is different from your previous one.
Can I safely assume that a scored probabilities of 0.80 = 80% chance of outcome?
This is the kind of statement that would drive a professional statistician mad; nevertheless, the clarifications above regarding the confidence should be enough for your purposes (they are enough indeed for ML practitioners).
My answer in Predict classes or class probabilities? should also be helpful.

caret: using random forest and include cross-validation

I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...

How to select training data for naive bayes classifier

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:
Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
Is there a large difference in the prior probabilities of each class?
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.
The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.
Quote from wikipedia:
K-fold cross-validation
In K-fold cross-validation, the
original sample is randomly
partitioned into K subsamples. Of the
K subsamples, a single subsample is
retained as the validation data for
testing the model, and the remaining K
− 1 subsamples are used as training
data. The cross-validation process is
then repeated K times (the folds),
with each of the K subsamples used
exactly once as the validation data.
The K results from the folds then can
be averaged (or otherwise combined) to
produce a single estimation. The
advantage of this method over repeated
random sub-sampling is that all
observations are used for both
training and validation, and each
observation is used for validation
exactly once. 10-fold cross-validation
is commonly used.
In stratified K-fold cross-validation,
the folds are selected so that the
mean response value is approximately
equal in all the folds. In the case of
a dichotomous classification, this
means that each fold contains roughly
the same proportions of the two types
of class labels.
You should always take the common approach in order to have comparable results with other scientific data.
I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.
Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.

Resources