Correctly splitting the dataset - machine-learning

I have downloaded a dataset of 10 class objects for the object detection. The dataset is not divided into training, validation, and testing. However, the author has mentioned in his paper to divide the dataset in 20% Training, 20% Validation, and 60% Testing and images are choose randomly.
Following the criteria said by the author, I have randomly selected 20% images for Training, 20% images for Validation, and 60% images Testing.
I want to know couple of things
1) Do I need to put difficult images in training set or validation set or testing set? for example currently there is 41 difficult images in test set, 30 in Training set and 20 in validation set.
2) How can I ensure that all ten object classes are equally distributed?
Updated
3)Ideally, for balance split difficult images should be equally distributed? and how much it effect the result if testing have more difficult, or training have more difficult or validation have more?
Ten classes: Airplane, Storage tank, Baseball ground, Tennis Court, Basketball court, ground track field, Bridge, Ship, Harbor, and Vehicle.
I have total 650 images, among them 466 images have exactly one class and there are more than one objects in a image
Airplane = 88 images, Storage tank= 10 images, Baseball ground=46 images, Tennis Court =29 images, Basketball court =32 images, ground track field=55 images, Bridge 58 images, Ship=36 images, Harbor 27 images, and Vehicle=85 images.
Remaining 184 images have multiple classes.
In total 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles

The most common technique is a random selection. For example, if you have 1000 images you can create an array that contains the names of every file and you can aleatorize the elements using a random permutation. Then you can use the first 200 elements for training, the next 200 elements for validation and the other elements for test (in the case of 20%,20%,60%)
If there is a extremely unbalanced class you can force the same proportion of classes in every set. To do that you must do the procedure that I mentioned class by class.
You shouldn't choose images by hand. If you know that there are some difficult images in your dataset you can not choose them by hand to include them in the train, validation and test set.
If you want a fair comparison of your algorithm, if a few images can highly modify the accuracy. You can repeat the random split several times. In some cases there will be many difficult images in the training set, and in other cases in the validation or test set. Then you can privide the mean and standard deviation of your accuracy (or the metric that you are using).
UPDATED:
I see, in your description you have more than 1 object in a image. Isn't it?
For example, can you have two ships and one bridge?
I use to work with datasets that contain a single object in every image. Then to detect several objects in a image I scan different parts of a image looking for single objects.
Probably the author of the paper that you have mentioned divided the dataset randomly. If you use a more complex division in a research paper you should mention it.
About your question about how is the effect of having more diffecult images in every set, the answer is very complex. It depends on the algorithm and how similar are the images of the training set when comparing with the images of the validation and test set.
With a complex model (for example a Neural Net with a lot of layers and neurons) you can obtain the accuracy you want on the traning set(for example 100%). Then if the images are very similar to the images in the validation and test set the accuracy will be similar. But if they are not very similar you have overfitted and the accuracy will be slower in the validation and test set. To solve that you need a simpler model (for example reducing the number of neurons or using a good regularization technique), in that case the accuracy will be slower in the training set but the accuracy of the validation and test set will be closer to the accuracy obtained with the training set.

Related

Why does object detection result in multiple found objects?

I trained an object detector with CreateML and when I test the model in CreateML, I get a high number of identified objects:
Notes:
The model was trained on a small data set of ~30 images with that particular label face-gendermale occuring ~20 times.
Each training image has 1-3 labelled objects.
There are 5 label total.
Questions:
Is that expected or is there something wrong with the model?
If this is expected, how should I evaluate these multiple results or even count the number of objects found in the model?
Cross-posted in Apple Developer Forums. Photo of man © Jason Stitt | Dreamstime.com
A typical object detection model with make about 1000 predictions for every image (although it can be much more depending on the model architecture). Most of these predictions have very low confidence, so they are filtered out. Then the ones that are left over are sent through non-maximum suppression (NMS), which removes bounding boxes that overlap too much.
In your case, it seems that the threshold for NMS is too low (or too high), because many overlapping boxes survive.
However, it also seems that the model hasn't been trained very well yet, probably because you used very few images.

Can I reuse test data as training data?

I am using cnn to classify images. I have 1000 images to begin my journey. So I use 900 as training dataset and 100 as testing dataset. I got a model of ~70% correctness.
Then I get another 150 images today. so I have two ideas to continue:
(1) Can I combine the previous 100 test data + 900 train data to be a "new" training set so I can have 1000 training data to get a possibly better model? Then I can use the new 150 images as the new "test" data?
(2) Can I combine the new 150 images + 900 train data to be a "new" training set to train a better model and still continue to use the previous 100 test data set to test the new model?
Obviously I am going to try both but I am not sure in theory which one is better... Any comments? thanks.
You should train on as much data as possible if you want the best CNN possible. Theory says that the more training data you have, the closer your test error will be to your training error. That means your CNN will be better at classifying examples it wasn't trained on. On the other hand, you don't want too little test data because you need to be confident in your accuracy measurement. So you should ideally get more training and more testing data.
If your data is IID, then you shouldn't worry about which of the 1150 images are used to train your model.
The only danger of reusing the same test data is that you might change the model (e.g., adding another layer, and/or adding more units to an existing layer) because it gives you a better result on your test data. When you alter your model in response to observations of the test error, you risk overfitting to your data. You can mitigate this problem by using a third data set, known as a validation set, for tweaking your model.
IID: The total 1150 images are independently drawn from an identical distribution. In other words, roughly speaking, there's nothing differentiating the 150 from the 1000 aside from the fact that they're new to you, and each image's selection wasn't affected by the selection of any other image.
It does not matter as long as the new 150 images are from the same distribution as that of the previous 1000 samples.

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.
This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

Optimizing Neural Network Input for Convergence

I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.

Leave one out accuracy for multi class classification

I am a bit confused about how to use the leave one out (LOO) method for calculating accuracy in the case of a multi-class, one v/s rest classification.
I am working on the YUPENN Dynamic Scene Recognition dataset which contains 14 categories with 30 videos in each category (a total of 420 videos). Lets name the 14 classes as {A,B,C,D,E,F,G,H,I,J,K,L,M,N}.
I am using linear SVM for one v/s rest classification.
Lets say I want to find the accuracy result for class 'A'. When I perform 'A' v/s 'rest', I need to exclude one video while training and test the model on the video I excluded. This video that I exclude, should it be from class A or should it be from all the classes.
In other words, for finding the accuracy of class 'A', should I perform SVM with LOO 30 times(leaving each video from class 'A' exactly once) or should I perform it 420 times(leaving videos from all the classes exactly once).
I have a feeling that I got this all mixed up ?? Can anyone provide me a short schematic of the right way to perform multi-class classification using LOO ??
Also how do I perform this using libsvm on Matlab ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The no of videos in the dataset is small, and thus I can't afford to create a separate TEST set (which was supposed to be sent to Neptune). Instead I have to ensure that I make full utilization of the dataset, because each video provides some new/unique information. In scenarios like this I have read that people use LOO as a measure of accuracy (when we can't afford an isolated TEST set). They call it as the Leave-One-Video-Out-experiment.
The people who have worked on Dynamic Scene Recognition have used this methodology for testing accuracy. In order to compare the accuracy of my method against their method, I need to use the same evaluation process. But they have just mentioned that they are using LOVO for accuracy. Not much detail apart from that is provided. I am a newbie in this field and thus it is a bit confusing.
According to what I can think of, LOVO can be done in two ways:
1) leave one video out of 420 videos. Train 14 'one-v/s-rest' classifiers using 419 videos as the training set.('A' v/s 'rest', 'B' v/s 'rest', ........'N' v/s 'rest').
Evaluate the left out video using the 14 classifiers. Label it with the class which gives maximum confidence score. Thus one video is classified. We follow the same procedure for labelling all the 420 videos. Using these 420 labels we can find the confusion matrix, find out the false positives/negatives, precision,recall, etc.
2) From each of the 14 classes I leave one video. Which means I choose 406 videos for training and 14 for testing. Using the 406 videos I find out the 14 'one-v/s-rest' classifiers. I evaluate each of the 14 videos in the test set and give them labels based on maximum confidence score. In the next round I again leave out 14 videos, one from each class. But this time the set of 14 is such that, none of them were left out in the previous round. I again train and evaluate the 14 videos and find out labels. In this way, I carry on this process 30 times, with a non-repeating set of 14 videos each time. In the end all 420 videos are labelled. In this case as well, I calculate confusion matrix, accuracy, precision, and recall, etc.
Apart from these two methods, LOVO could be done in many other different style. In the papers on Dynamic Scene Recognition they have not mentioned how they are performing the LOVO. Is it safe to assume that they are using the 1st method ? Is there any way of deciding which method would be better? Would there be significant difference in the accuracies obtained by the two methods ?
Following are some of the recent papers on Dynamic Scene Recognition for reference purpose. In the evaluation section they have mentioned about LOVO.
1)http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf
2)http://www.cse.yorku.ca/~wildes/wildesBMVC2013b.pdf
3)http://www.seas.upenn.edu/~derpanis/derpanis_lecce_daniilidis_wildes_CVPR_2012.pdf
4)http://webia.lip6.fr/~thomen/papers/Theriault_CVPR_2013.pdf
5)http://www.umiacs.umd.edu/~nshroff/DynScene.pdf
When using cross validation it is good to keep in mind that it applies to training a model, and not usually to the honest-to-god, end-of-the-whole-thing measures of accuracy, which are instead reserved for measures of classification accuracy on a testing set that has not been touched at all or involved in any way during training.
Let's focus just on one single classifier that you plan to build. The "A vs. rest" classifier. You are going to separate all of the data into a training set and a testing set, and then you are going to put the testing set in a cardboard box, staple it shut, cover it with duct tape, place it in a titanium vault, and attach it to a NASA rocket that will deposit it in the ice covered oceans of Neptune.
Then let's look at the training set. When we train with the training set, we'd like to leave some of the training data to the side, just for calibrating, but not as part of official Neptune ocean test set.
So what we can do is tell every data point (in your case it appears that a data point is a video-valued object) to sit out once. We don't care if it comes from class A or not. So if there are 420 videos which would be used in the training set for just the "A vs. rest" classifier, the yeah, you're going to fit 420 different SVMs.
And in fact, if you are tweaking parameters for the SVM, this is where you'll do it. For example, if you're trying to choose a penalty term or a coefficient in a polynomial kernel or something, then you will repeat the entire training process (yep, all 420 different trained SVMs) for all of the combinations of parameters you want to search through. And for each collection of parameters, you will associate with it the sum of the accuracy scores from the 420 LOO trained classifiers.
Once that's all done, you choose the parameter set with the best LOO score, and voila, that is you 'A vs. rest' classifier. Rinse and repeat for "B vs. rest" and so on.
With all of this going on, there is rightfully a big worry that you are overfitting the data. Especially if many of the "negative" samples have to be repeated from class to class.
But, this is why you sent that testing set to Neptune. Once you finish with all of the LOO-based parameter-swept SVMs and you've got the final classifier in place, now you execute that classifier across you actual test set (from Neptune) and that will tell you if the entire thing is showing efficacy in predicting on unseen data.
This whole exercise is obviously computationally expensive. So instead people will sometimes use Leave-P-Out, where P is much larger than 1. And instead of repeating that process until all of the samples have spent some time in a left-out group, they will just repeat it a "reasonable" number of times, for various definitions of reasonable.
In the Leave-P-Out situation, there are some algorithms which do allow you sample which points are left out in a way that represents the classes fairly. So if the "A" samples make up 40 % of the data, you might want them to take up about 40% of the leave-out set.
This doesn't really apply for LOO, for two reasons: (1) you're almost always going to perform LOO on every training data point, so trying to sample them in a fancy way would be irrelevant if they are all going to end up being used exactly once. (2) If you plan to use LOO for some number of times that is smaller than the sample size (not usually recommended), then just drawing points randomly from the set will naturally reflect the relative frequencies of the classes, and so if you planned to do LOO for K times, then simple taking a random size-K subsample of the training set, and doing regular LOO on those, would suffice.
In short, the papers you mentioned use second criteria, i.e. leaving one video from each class that makes 14 videos for testing and the rest for training.

Resources