I'm working on image classification using Bag Of Visual Words, SIFT and SVM with RBF kernel.
I'm doing some cross-validation tuning the following parameters: k (dictionary size); c, gamma (kernel values).
I've tried 1100 different (k, c, gamma) combinations but several of them obtained the best accuracy value (which is around 95,3%).
How can I choose the combination that guarantees me the better generalization?
(The one that'll give me the best result on the test set)
The only data I have are the 1100 values of accuracy relative to each combination.
EDIT
I'm sorry I forgot to say that I have 6 classes, I know ROC can be extended to multiclass problems but I'm looking for something else, if exists.
My validation set has 4 images for each class so the accuracy is just:
(number of correctly classified images)/(24)
Related
I'm fairly new to data analysis and machine learning. I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone help shed some light on this?
Thanks in anticipation
This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
1. Cross Validation
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
Accuracy is a bad measure for classification performance. Look at scores like precision vs recall or AUROC or f1.
Don't try program CV yourself, use sklearns GridSearchCV
If you are doing any preprocessing on your data that calculates some sort of state using the data, that needs to be done on only the training data in each fold. For example if you are scaling your data you can't include the test data when you do the scaling. You need to fit (and transform) the scaler on the training data and then use that same scaler to transform on your test data (don't fit again). To get this to work in CV you need to use sklearn Pipelines. This is very important, make sure you understand it.
You might get more stability if you stratify your train-test-split based on the output class. See the stratify argument on train_test_split.
Note the CV is the industry standard and that's what you should do, but there are other options:
2. Bootstrapping
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
3. Making sure you test set is representative of your validation set
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
4. Get a larger data set
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).
This "behavior" is to be expected. Of course you get different results, when training and test is split differently.
You can approach the problem statistically, by repeating each 'k' several times with new train-validation-splits. Then take the median performance for each k. Or even better: look at the performance distribution and the median. A narrow performance distribution for a given 'k' is also a good sign that the 'k' is chosen well.
Afterwards you can use the test set to test your model
Let's suppose I have a noisy 2d data set where one person watching the data could easily draw a straight line in the data so that the mean squared error is minimized.
The model of the line has the form y = mx + b, where x is the input value, y is the predicted value of the model and m and b are trained variables to minimize the cost.
My question is that if we plug some input x1 to the model, it will always output the same number, not taking into account how sparse the data is. How can a model like this predict different values from same inputs?
Maybe this could be done taking all the errors from the model line to the points, making a distribution of them, taking an expected value of such distribution and then adding that value to y?
If the data is 2d, and it can be perfectly modeled with a straight line then there is no data-based nor statistical-based reason not to claim that the process is fully deterministic, and you should output one value.
However, if you have many more dimensions, or your fit is not perfect (error is minimised but not 0) then what you are after is either predicting distribution of values or at least confidence bounds. There are many probabilistic models that can model distribution of the outputs rather than a singe value. In particular linear regression does that, it assumes that you have a Gaussian error around your predictions, thus effectively once you obtain MSE "A" you can draw predictions from N(mx+b, A) - which, as you can easily see degenerates to deterministic model when A=0. These predictions are optimal in expectation, and they are simply your way of "simulating observations" according to the model. There are also meta methods, if you treat your predictor as a black box - you can train multiple models on subsets of data, and treat their predictions as samples to fit a distribution (again for simplicity it could be a single Gaussian).
So I have a set of data, 1900 rows and 22 columns. 21 column is just numbers but that one crucial that I want to train the data on has 3 stages: a,b, and c.
I have tried both decision trees/jungles, and neural networks and no matter how I set them up I can't get more than 55% precision.
Usually it's around 50% accuracy and the best I was ever able to get was 55% overall accuracy and around 70% average.
Should I even use NN on a such small dataset? As I said I tried with other ML algorithms but they don't yield anything better.
I think that there is no clear answer to your question. Low accuracy score may come from a few reasons. I will state some of them in the following points :
When you use decision trees / neural networks - low accuracy may be a result of a wrong setup of metaparameters (like maximum height of a tree or number of trees in DT or wrong topology or data preparation in NN case). What I advise you is to use a grid or random search for both NN and DT to look for the best metaparameters for your algorithm (in case of "static" (not sequential data) packages like e.g. h20 in R or Scikit-learn in Python may do a great job) and in neural network case - normalize your data properly (e.g. subtract mean and divide by standard deviation every x column of your data).
Your dataset might be inconsistent. If e.g. your data has not a property that there exists a functional dependency between x and y (what means that y = f(x) for some f) then what is learnt during a training session is a probability that given x - your example belong to some specified class. This inconsistency might seriously harm your accuracy. What I advice you in this case is to try specify if that phenomenon occurs and then e.g. try to segmentate your data to solve the problem.
Your data set might be simply too small. Try to get more data in this case.
Suppose I have a training set made by (x, y) samples.
To apply a generative algorithm, let's say the Gaussian discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
or I just need to I know if x ~ Normal(mu, sigma) given y?
How can I evaluate if p(x|y) follows a multivariate Normal distribution well enough (up to a threshold) to me to use generative algorithm?
That's a lot of questions.
To apply a generative algorithm, let's say the Gaussian
discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
No, you must assume that's true for some mu, sigma pair. In practice you won't know what mu and sigma is, so you'll need to either estimate it (frequentist, Max Likelihood/Max A Posteriori estimates), or even better incorporate uncertainty about your estimates of the parameters into predictions (Bayesian methodology).
How can I evaluate if p(x|y) follows a multivariate Normal distribution?
Classically, using a goodness of fit test. If the dimensionality of x is more than a handful, though, this won't work because standard tests involve the number of items in bins, and the number of bins you need in high dimensions is astronomical so you have very low expected counts.
A better idea is to say the following: what are my options for modelling the (conditional) distribution of x? You can compare between these options using model comparison techniques. Read up on model checking and comparison.
Finally, your last point:
well enough (up to a threshold) to me to use generative algorithm?
The paradox of many generative methods, including Fisher's Linear Discriminant Analysis for example, as well as the Naive Bayes classifier, is that the classifier can work very well even though the model is poor for the data. There's no particularly sound reason why this should be the case, but many have observed it to be empirically true. Whether it works can be checked much more easily than whether the assumed distribution explains the data very well: just split your data into training and testing and find out!
I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Weka contains two meta-classifiers of interest:
weka.classifiers.meta.CostSensitiveClassifier
weka.classifiers.meta.MetaCost
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.
The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.