Artificially increasing the size of a dataset through duplication? - machine-learning

I'm working on a machine learning project where I'm using a neural network to solve a binary classification problem, however, my dataset(in .csv format) is relatively small. It only has around 60 yes/no cases and although it was able to train, the accuracy wasn't very good. My solution to that was just duplicating the dataset and on each duplication, making tiny changes to the numbers, i.e., adding +-1 or multiplying by 0.999 to each number. By doing this I grew the size of the dataset to around 1100 new cases and it achieved much higher levels of accuracy. I was wondering if this is an actual technique used by ML researchers and if it is, does it have an actual official/academic name?
Thank You!

Yes, the process you are referring to is called data augmentation.
However, I would highly recommend you to not use neural networks on datasets with merely hundred to thousand rows. Ideally Neural networks are used to train models over large datasets.

Related

what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see
https://medium.com/#malay.haldar/how-much-training-data-do-you-need-da8ec091e956

How to deal with low frequency examples in classification?

I'm facing a text classification problem, and I need to classify examples to 34 groups.
The problem is, the size of training data of 34 groups are not balanced. For some groups I have 2000+ examples, while for some I only have 100+ examples.
For some small groups, the classification accuracy is quite high. I guess those groups may have specific key words to recognize and classify. While for some, the accuracy is low, and the prediction always goes to large groups.
I want to know how to deal with the "low frequency example problem". Would simply copy and duplicate the small group data work? Or I need to choose the training data and expand and balance the data size? Any suggestions?
Regularization can sometimes help imbalanced class problems by reducing the effect of spurious correlation, but that depends on your data. One solution is to simply over-sample the smaller classes, or increase the weights of the data points in the smaller classes to force the classifier to pay more attention to it.
You can find more advanced techniques by searching for "class imbalance" problems. Though not as many of them have been applied / created for text classification problems, as it is very common to have huge amounts of data when working with text problems. So I'm not sure how many work well in such high dimensional space.

How easy/fast are support vector machines to create/update?

If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.

OpenCV Iterative random forest training

I'm using the random forest algorithm as the classifier of my thesis project.
The training set consists of thousands of images, and for each image about 2000
pixels get sampled. For each pixel, I've hundred of thousands of features. With
my current hardware limitations (8G of ram, possibly extendable to 16G) I'm able
to fit in memory the samples (i.e. features per pixel) for only one image. My
questions is: is it possible to call multiple times the train method, each time
with a different image's samples, and get the statistical model automatically
updated at each call? I'm particularly interested in the variable importance since, after I
train the full training set with the whole features set, my idea is to reduce
the number of features from hundred of thousands to about 2000, keeping only the
most important ones.
Thank you for any advice,
Daniele
I dont think the algorithm supports incremental training. You could consider reducing the size of your descriptors prior to training, using other feature reduction method. Or estimate the variable importance on a random subset of pixels taken among all your training images, as much as you can stuff into your memory...
See my answer to this post. There are incremental versions of random forests, and they will let you train on much larger data.

Resources