How to estimate the accuracy on a large dataset? - machine-learning

Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.
In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions
First of all, I run the model and got the prediction on the whole dataset.
Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.
Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.
How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?
Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?

The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.
One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.

Related

Semi supervised learning for Regression task with tabular data?

Semi-supervised learning for Regression task with tabular data? I have like 20K labelled data and 200K unlabeled data.
Model gets better and better with semi-supervised learning for eg. self-learning and on top of it if you have just 20K and making predictions on 200K you assume variance and distribution of features remain same which is not always true so you make a prediction with a sizeable error even your model was 90 Mean R2 on K Fold CV. Hope that explains why I am searching for a semi-supervised approach for regression!
A key assumption for most semi-supervised learning (SSL) is that nearby points (e.g. between an unlabelled and labelled point) are likely to share the same label. It seems you're expecting the variance and distribution of your unlabelled set to be different to your labelled set which may violate the above assumption.
If you still wish to go ahead with SSL then there are no conventional techniques for regression methods. But...
The literature on less conventional methods has been reviewed: https://www.researchgate.net/publication/325798856_Semi-supervised_regression_A_recent_review
And...
You could always convert to a classification approach (helps if the labels are relatively uniformly distributed). For instance, just choose the median prediction value in your labelled set, then everything less than that is 0 and everything greater is 1. You can then apply a conventional classification SSL technique, such as label propagation. Then, you get the predicted probabilities on a test set (e.g. use predict_proba()) and scale these back to your regression range. For example, if the minimum value in your labelled set was 0 and the max was 10, then just multiply the predicted probabilities by 10.
Hope the above at least provides food for thought.

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

What does this learning curve show ? And how to handle non representativity of a sample?

==> to see learning curves
I am trying a random forest regressor for a machine learning problem (price estimation of spatial points). I have a sample of spatial points in a city. The sample is not randomly drawn since there are very few observations downtown. And I want to estimate prices for all addresses in the city.
I have a good cross validation score (absolute mean squared error) an also a good test score after splitting the training set. But predictions are very bad.
What could explain this results ?
I plotted the learning curve (link above) : cross validation score increases with number of instances (that sounds logical), training score remains high (should it decrease ?) ... What do these learning curves show ? And in general how do we "read" learning curves ?
Moreover, I suppose that the sample is not representative. I tried to make the dataset for which I want predictions spatially similar to the training set by drawing whitout replacement according to proportions of observations in each district for the training set. But this didn't change the result. How can I handle this non representativity ?
Thanks in advance for any help
There are a few common cases that pop up when looking at training and cross-validation scores:
Overfitting: When your model has a very high training score but a poor cross-validation score. Generally this occurs when your model is too complex, allowing it to fit the training data exceedingly well but giving it poor generalization to the validation dataset.
Underfitting: When neither the training nor the cross-validation scores are high. This occurs when your model is not complex enough.
Ideal fit: When both the training and cross-validation scores are fairly high. You model not only learns to represent the training data, but it generalizes well to new data.
Here's a nice graphic from this Quora post showing how model complexity and error relate to the type a fit a model exhibits.
In the plot above, the errors for a given complexity are the errors found at equilibrium. In contrast, learning curves show how the score progresses throughout the entire training process. Generally you never want to see the score decreasing during training, as this usually means your model is diverging. But the difference between the training and validation scores as they move forward in time (towards equilibrium) indicates how well your model is fitting.
Notice that even when you have an ideal fit (middle of complexity axis) it is common to see a training score that's higher than the cross-validation score, since the model's parameters are updated using the training data. But since you're getting poor predictions, and since validation score is ~10% lower than training score (assuming the score is out of 1), I would guess that your model is overfitting and could benefit from less complexity.
To answer your second point, models will generalize better if the training data is a better representation of validation data. So when splitting the data into training and validation sets, I recommend finding a way to randomly segregate the data. For example, you could generate a list of all the points in the city, iterate of the list, and for each point draw from a uniform distribution to decide which dataset that point belongs to.

Machine Learning Training & Test data split method

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.
Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.
When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.
I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.
When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.
By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.
For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.
To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.
Actually, what you did is right and this process is something similar to "Stratified sampling".
In your first model,where accuracy was very low the model did not get enough correlations between features and target for positive class(1).Also it model might have somewhat over-fitted for negative class.This is called "High bias -High variance" situation.
"Stratified sampling" is nothing but when you are extracting a sample data from a big population,you make sure that all classes will have some what approximately equal proportion to make the model's training assumptions more accurate and reliable.
In the second case model was able to correlate relationships between features and target and positive and negative class characteristics was well distinguishable.
Eliminating noise is a part of data preparation that should be obviously done before putting data into a model.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Resources