Why is Monk's problems' test set bigger than their train set?

Why is Monk's problems' test set bigger than their train set? - machine-learning

I realized that all the Monk's problems have test set bigger than their train set.
Why is this dataset organized like this? I think it's strange, even if it's a dummy dataset for models comparison.
Monk1
Train samples: 124
Test samples: 432
Monk2
Train samples: 169
Test samples: 432
Monk3
Train samples: 122
Test samples: 432

From the machine learning point of view, it absolutely doesn't matter how big the test set is. Why does it bother you? The real world looks the exact same way: you have N labeled samples for training, but there are N*10, N*1000, N*10^9 or more real cases out there so each (manually labeled, fixed) test set will necessarily be too small. The goal is to have a representative set, covering everything we expect in the real world, and if it means to have a YUGE™ test set, then the best thing you can do is to have a test set larger than training set.
In this particular case (and I'm not familiar with this particular task) it looks like the website you cited reads
There are three MONK's problems. The domains for all MONK's problems are the same (described below). One of the MONK's problems has noise added. For each problem, the domain has been partitioned into a train and test set.
The paper linked below
Wnek, J. and Michalski, R.S., "Comparing Symbolic and Subsymbolic Learning: Three Studies," in Machine Learning: A Multistrategy Approach, Vol. 4., R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, San Mateo, CA, 1993.
on page 20 reads as follows:
So in this particular scenario, the authors have chosen different training conditions, thus the three training sets. According to
Leondes, Cornelius T. Image processing and pattern recognition. Vol. 5. Elsevier, 1998, pp 307
they used all 432 available samples for training and trained on a subset of this data.
Having an overlap between training and test data is considered bad practice, but who am I to judge the research from 25 years ago in a field I'm not familiar with. Maybe it was too difficult to obtain more data and have a clean split.

Related

Different composition for training and test sets

Training and test sets in machine learning, are normally discussed as though they will have the same composition, e.g. take X% of your examples as the training set, and the rest are the test set.
However, suppose you are trying to solve a classification problem - for simplicity, say binary classification, like distinguishing between photographs of horses and zebras. The classes are not equally common. Say 95% of photos are horses and the other 5% are zebras. If you feed that mix into a neural network, or any other machine learning algorithm, it will quickly settle on classifying everything as a horse and thereby achieving 95% accuracy.
There are such things as cost-sensitive neural networks, that can penalize a false negative more heavily than a false positive. But the added complexity increases development time and creates more opportunities for bugs to creep in.
A simpler, more general method is resampling, where you train the network on equal quantities of each class. If you have 10,000 pictures, take 250 zebra pictures, combined with 250 horse pictures, use that as your training set. The other 250 zebras can go with another 4,750 horses to form your test set. That way, you can calculate a confusion matrix on the test set that will reflect the performance that can be expected of the train network in the wild.
This means the training set and test set have deliberately different composition.
So my question: is it indeed normal for training set and test set to have different composition, and this just isn't often mentioned? Or am I missing something?

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?

As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.

This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

Which Machine Learning technique is most valid in this scenario?

I am fairly new to Machine Learning and have recently been working on a new classification problem to which I'm giving the link below. Since cars interest me, I decided to go with a dataset that deals with the classification of cars based on several attributes.
http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Now, I understand that there might be a number of ways to go about this particular case, but the real issue here is - Which particular algorithm might be most effective?
I am considering Regression, SVM, KNN, and Hidden Markov Models. Any suggestions at all would be greatly appreciated.

You have a multi-class classification problem with 1728 samples. The features are in 6 groups:
buying v-high, high, med, low
maint v-high, high, med, low
doors 2, 3, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
what you need to do for features is to create features like this:
buying_v-high, buying-high, buying-med, buying-low, maint-v-high, ...
at the end you'll have
4+4+4+3+3+3 = 21
features. The output classes are:
class N N[%]
-----------------------------
unacc 1210 (70.023 %)
acc 384 (22.222 %)
good 69 ( 3.993 %)
v-good 65 ( 3.762 %)
You need to try several classification algorithms to see which one works better. For evaluation you can use cross-validation or you can put away say 728 or the samples and evaluate on that.
For classification models you iterate over 10 different classification models available in Machine Learning libraries and check which one is better. I suggest using scikit-learn for simplicity.
You can find a simple iterator over several classifiers in this script.
Remember that you need to tune some parameters for each model and you shouldn't tune them on the test set. So it is better to divide your samples into 1000 (training set), 350 (development set), 378 (test set). Use the development set to tune your parameters and to choose the best performing model and then use the test set to evaluate that model over unseen data.

Leave one out accuracy for multi class classification

I am a bit confused about how to use the leave one out (LOO) method for calculating accuracy in the case of a multi-class, one v/s rest classification.
I am working on the YUPENN Dynamic Scene Recognition dataset which contains 14 categories with 30 videos in each category (a total of 420 videos). Lets name the 14 classes as {A,B,C,D,E,F,G,H,I,J,K,L,M,N}.
I am using linear SVM for one v/s rest classification.
Lets say I want to find the accuracy result for class 'A'. When I perform 'A' v/s 'rest', I need to exclude one video while training and test the model on the video I excluded. This video that I exclude, should it be from class A or should it be from all the classes.
In other words, for finding the accuracy of class 'A', should I perform SVM with LOO 30 times(leaving each video from class 'A' exactly once) or should I perform it 420 times(leaving videos from all the classes exactly once).
I have a feeling that I got this all mixed up ?? Can anyone provide me a short schematic of the right way to perform multi-class classification using LOO ??
Also how do I perform this using libsvm on Matlab ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The no of videos in the dataset is small, and thus I can't afford to create a separate TEST set (which was supposed to be sent to Neptune). Instead I have to ensure that I make full utilization of the dataset, because each video provides some new/unique information. In scenarios like this I have read that people use LOO as a measure of accuracy (when we can't afford an isolated TEST set). They call it as the Leave-One-Video-Out-experiment.
The people who have worked on Dynamic Scene Recognition have used this methodology for testing accuracy. In order to compare the accuracy of my method against their method, I need to use the same evaluation process. But they have just mentioned that they are using LOVO for accuracy. Not much detail apart from that is provided. I am a newbie in this field and thus it is a bit confusing.
According to what I can think of, LOVO can be done in two ways:
1) leave one video out of 420 videos. Train 14 'one-v/s-rest' classifiers using 419 videos as the training set.('A' v/s 'rest', 'B' v/s 'rest', ........'N' v/s 'rest').
Evaluate the left out video using the 14 classifiers. Label it with the class which gives maximum confidence score. Thus one video is classified. We follow the same procedure for labelling all the 420 videos. Using these 420 labels we can find the confusion matrix, find out the false positives/negatives, precision,recall, etc.
2) From each of the 14 classes I leave one video. Which means I choose 406 videos for training and 14 for testing. Using the 406 videos I find out the 14 'one-v/s-rest' classifiers. I evaluate each of the 14 videos in the test set and give them labels based on maximum confidence score. In the next round I again leave out 14 videos, one from each class. But this time the set of 14 is such that, none of them were left out in the previous round. I again train and evaluate the 14 videos and find out labels. In this way, I carry on this process 30 times, with a non-repeating set of 14 videos each time. In the end all 420 videos are labelled. In this case as well, I calculate confusion matrix, accuracy, precision, and recall, etc.
Apart from these two methods, LOVO could be done in many other different style. In the papers on Dynamic Scene Recognition they have not mentioned how they are performing the LOVO. Is it safe to assume that they are using the 1st method ? Is there any way of deciding which method would be better? Would there be significant difference in the accuracies obtained by the two methods ?
Following are some of the recent papers on Dynamic Scene Recognition for reference purpose. In the evaluation section they have mentioned about LOVO.
1)http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf
2)http://www.cse.yorku.ca/~wildes/wildesBMVC2013b.pdf
3)http://www.seas.upenn.edu/~derpanis/derpanis_lecce_daniilidis_wildes_CVPR_2012.pdf
4)http://webia.lip6.fr/~thomen/papers/Theriault_CVPR_2013.pdf
5)http://www.umiacs.umd.edu/~nshroff/DynScene.pdf

When using cross validation it is good to keep in mind that it applies to training a model, and not usually to the honest-to-god, end-of-the-whole-thing measures of accuracy, which are instead reserved for measures of classification accuracy on a testing set that has not been touched at all or involved in any way during training.
Let's focus just on one single classifier that you plan to build. The "A vs. rest" classifier. You are going to separate all of the data into a training set and a testing set, and then you are going to put the testing set in a cardboard box, staple it shut, cover it with duct tape, place it in a titanium vault, and attach it to a NASA rocket that will deposit it in the ice covered oceans of Neptune.
Then let's look at the training set. When we train with the training set, we'd like to leave some of the training data to the side, just for calibrating, but not as part of official Neptune ocean test set.
So what we can do is tell every data point (in your case it appears that a data point is a video-valued object) to sit out once. We don't care if it comes from class A or not. So if there are 420 videos which would be used in the training set for just the "A vs. rest" classifier, the yeah, you're going to fit 420 different SVMs.
And in fact, if you are tweaking parameters for the SVM, this is where you'll do it. For example, if you're trying to choose a penalty term or a coefficient in a polynomial kernel or something, then you will repeat the entire training process (yep, all 420 different trained SVMs) for all of the combinations of parameters you want to search through. And for each collection of parameters, you will associate with it the sum of the accuracy scores from the 420 LOO trained classifiers.
Once that's all done, you choose the parameter set with the best LOO score, and voila, that is you 'A vs. rest' classifier. Rinse and repeat for "B vs. rest" and so on.
With all of this going on, there is rightfully a big worry that you are overfitting the data. Especially if many of the "negative" samples have to be repeated from class to class.
But, this is why you sent that testing set to Neptune. Once you finish with all of the LOO-based parameter-swept SVMs and you've got the final classifier in place, now you execute that classifier across you actual test set (from Neptune) and that will tell you if the entire thing is showing efficacy in predicting on unseen data.
This whole exercise is obviously computationally expensive. So instead people will sometimes use Leave-P-Out, where P is much larger than 1. And instead of repeating that process until all of the samples have spent some time in a left-out group, they will just repeat it a "reasonable" number of times, for various definitions of reasonable.
In the Leave-P-Out situation, there are some algorithms which do allow you sample which points are left out in a way that represents the classes fairly. So if the "A" samples make up 40 % of the data, you might want them to take up about 40% of the leave-out set.
This doesn't really apply for LOO, for two reasons: (1) you're almost always going to perform LOO on every training data point, so trying to sample them in a fancy way would be irrelevant if they are all going to end up being used exactly once. (2) If you plan to use LOO for some number of times that is smaller than the sample size (not usually recommended), then just drawing points randomly from the set will naturally reflect the relative frequencies of the classes, and so if you planned to do LOO for K times, then simple taking a random size-K subsample of the training set, and doing regular LOO on those, would suffice.

In short, the papers you mentioned use second criteria, i.e. leaving one video from each class that makes 14 videos for testing and the rest for training.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart