Machine Learning Training & Test data split method - machine-learning

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.
Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.
When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.
I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.

When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.
By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.
For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.
To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.

Actually, what you did is right and this process is something similar to "Stratified sampling".
In your first model,where accuracy was very low the model did not get enough correlations between features and target for positive class(1).Also it model might have somewhat over-fitted for negative class.This is called "High bias -High variance" situation.
"Stratified sampling" is nothing but when you are extracting a sample data from a big population,you make sure that all classes will have some what approximately equal proportion to make the model's training assumptions more accurate and reliable.
In the second case model was able to correlate relationships between features and target and positive and negative class characteristics was well distinguishable.
Eliminating noise is a part of data preparation that should be obviously done before putting data into a model.


How to estimate the accuracy on a large dataset?

Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.
In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions
First of all, I run the model and got the prediction on the whole dataset.
Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.
Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.
How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?
Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?
The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.
One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

What does inconsistent test results mean?

I'm doing some research on CNN for text classification using tensorflow. When I run my model I get a very high training accuracy (arround 100%). However, on test split I get an inconsistent accuracy results (sometimes 11% and sometimes 90%).
Moreover, I noticed also that the loss in training is decreasing until it reaches small numbers like 0.000499564048368, while in testing it is not and sometimes it gets high values like 70. What does this mean? Any ideas?
If you get very high training accuracy and bad testing accuracy, you are almost definitely overfitting. To get a better picture of what your models real accuracy is, use cross-validation.
Cross validation splits the dataset into a training and validation set, and does this multiple times, slightly changing the training and validation data each time. This is beneficial because it can prevent scenarios where you train your model on one label, and it can't accurately identify another one. For example, picture a training set like this:
Feature1, Feature2, Label
x y 0
a y 0
b c 1
If we train the model only on the first two datapoints, it will not be able to identify the third datapoint because it is not built generally.

Correct ratio of positive to negative training examples for training a random forest-based binary classifier

I realized that the related question Positives/negatives proportion in train set suggested that a 1-to-1 ratio of positive to negative training examples is favorable for the Rocchio algorithm.
However, this question differs from the related question in that it concerns a random forest model and also in the following two ways.
1) I have plenty of training data to work with, and the main bottleneck on using more training examples is training iteration time. That is, I'd prefer not to take more than a night to train one ranker because I want to iterate quickly.
2) In practice, the classifier will probably see 1 positive example for every 4 negative examples.
In this situation, should I train using more negative examples than positive examples, or still equal numbers of positive and negative examples?
See the section titled "Balancing prediction error" from the official documentation on random forests here:
I marked some parts in bold.
In summary, this seems to suggest that your training and test data should either
reflect the 1:4 ratio of classes that your real-life data will have
you can have a 1:1 mix, but then you should carefully adjust the
weights per class as demonstrated below till the OOB error rate on
your desired (smaller) class is lowered
Hope that helps.
In some data sets, the prediction error between classes is highly
unbalanced. Some classes have a low prediction error, others a high.
This occurs usually when one class is much larger than another. Then
random forests, trying to minimize overall error rate, will keep the
error rate low on the large class while letting the smaller classes
have a larger error rate. For instance, in drug discovery, where a
given molecule is classified as active or not, it is common to have
the actives outnumbered by 10 to 1, up to 100 to 1. In these
situations the error rate on the interesting class (actives) will be
very high.
The user can detect the imbalance by outputs the error rates for the
individual classes. To illustrate 20 dimensional synthetic data is
used. Class 1 occurs in one spherical Gaussian, class 2 on another. A
training set of 1000 class 1's and 50 class 2's is generated, together
with a test set of 5000 class 1's and 250 class 2's.
The final output of a forest of 500 trees on this data is:
500 3.7 0.0 78.4
There is a low overall test set error (3.73%) but class 2 has over 3/4
of its cases misclassified.
The error balancing can be done by setting different weights for
the classes.
The higher the weight a class is given, the more its error rate is
decreased. A guide as to what weights to give is to make them
inversely proportional to the class populations. So set weights to 1
on class 1, and 20 on class 2, and run again. The output is:
500 12.1 12.7 0.0
The weight of 20 on class 2 is too high. Set it to 10 and try again,
500 4.3 4.2 5.2
This is pretty close to balance. If exact balance is wanted, the
weight on class 2 could be jiggled around a bit more.
Note that in getting this balance, the overall error rate went up.
This is the usual result - to get better balance, the overall error
rate will be increased.
This might seem like a trivial answer but the best thing I can suggest is to try on a small subset of your data (small enough that the algorithm trains quickly), and observe what you accuracy is when you use 1-1, 1-2, 1-3 etc...
Plot the results as you gradually increase the total amount of examples for each ratio and see how the performance responds. Very often you'll find that fractions of the data get very close to the performance of training on the full dataset, in which case you can make an informed decision to your question.
Hope that helps.

Machine Learning - Support Vector Machines

I came across an SVM example, but I didn't understand. I would appreciate it if somebody could explain how the prediction works. Please see the explanation below:
The dataset has 10,000 observations with 5 attributes (Sepal Width, Sepal Length, Petal Width, Petal Length, Label). The label gets positive if it belongs to the I.setosa class, and negative if belongs to some other class.
There are 6000 observations for which the outcome is known (i.e. they belong to the I.setosa class, so they get positive for the label attribute). The labels for the remaining 4000 are unknown, so the label was assumed to be negative. The 6000 observations and 2500 randomly selected observations from the remaining 4000 form the set for the 10-fold cross validation. SVM (10 fold cross validation) is then used for machine learning on the 8500 observations and the ROC is plotted.
Where are we predicting here? The set has 6000 observations for which the values are already known. How did the remaining 2500 get negative labels? When SVM is used, some observations that are positive get negative prediction. The prediction didn't make any sense to me here. Why are those 1500 observations excluded.
I hope my explanation is clear. Please let me know if I haven't explained anything clearly.
I think that the issue is a semantic one: you refer to the set of 4000 samples as being both "unknown" and "negative" -- which of these apply is the critical difference.
If the labels for the 4000 samples are truly unknown, then I'd do a 1-class SVM using the
6000 labelled samples [c.f. validation below]. And then the predictions would be generated by testing the N=4000 set to assess whether or not they belong to the setosa class.
If instead, we have 6000 setosa, and 4000 (known) non-setosa, we could construct a binary
classifier on the basis of this data [c.f. validation below], and then use it to predict setosa vs. non on
any other available non-labelled data.
Validation: Usually as part of the model construction process you will take only a subset of your labelled
training data and use it to configure the model. For the unused subset, you apply the model to the data (ignoring the labels), and compare what your model predicts against what the true labels are in order to assess error rates. This applies both to the 1-class and
the 2-class situations above.
Summary: if all of your data are labelled, then usually one will still make predictions for a subset of them (ignoring the known labels) as part of the model validation process.
Your SVM classifier is trained to tell if a new (unknown) instance is or not an instance of I. Setosa. In order words, you are predicting if the new, unlabeled instance is I.Setosa or not.
You found the incorrectly classified result, probably, because your training data has many more instances of the positive case than of the negative one. Also, it's common to have some error margin.
Summarizing: your SVM classifier learned how to identify I.Setosa instances, however, it was provided with too little examples of non-I.Setosa instances, which is likely to get you a biased model.
