OpenCV Iterative random forest training - opencv

I'm using the random forest algorithm as the classifier of my thesis project.
The training set consists of thousands of images, and for each image about 2000
pixels get sampled. For each pixel, I've hundred of thousands of features. With
my current hardware limitations (8G of ram, possibly extendable to 16G) I'm able
to fit in memory the samples (i.e. features per pixel) for only one image. My
questions is: is it possible to call multiple times the train method, each time
with a different image's samples, and get the statistical model automatically
updated at each call? I'm particularly interested in the variable importance since, after I
train the full training set with the whole features set, my idea is to reduce
the number of features from hundred of thousands to about 2000, keeping only the
most important ones.
Thank you for any advice,
Daniele

I dont think the algorithm supports incremental training. You could consider reducing the size of your descriptors prior to training, using other feature reduction method. Or estimate the variable importance on a random subset of pixels taken among all your training images, as much as you can stuff into your memory...

See my answer to this post. There are incremental versions of random forests, and they will let you train on much larger data.

Related

Artificially increasing the size of a dataset through duplication?

I'm working on a machine learning project where I'm using a neural network to solve a binary classification problem, however, my dataset(in .csv format) is relatively small. It only has around 60 yes/no cases and although it was able to train, the accuracy wasn't very good. My solution to that was just duplicating the dataset and on each duplication, making tiny changes to the numbers, i.e., adding +-1 or multiplying by 0.999 to each number. By doing this I grew the size of the dataset to around 1100 new cases and it achieved much higher levels of accuracy. I was wondering if this is an actual technique used by ML researchers and if it is, does it have an actual official/academic name?
Thank You!
Yes, the process you are referring to is called data augmentation.
However, I would highly recommend you to not use neural networks on datasets with merely hundred to thousand rows. Ideally Neural networks are used to train models over large datasets.

what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

OpenCV poor performance of Haar classifier trained by me

I would like to use the Haar classifier to detect the presence of vehicles in a scene (trying with only cars so far). Since I have not found many trained XML files online, I decided to generate my own.
I found some image sets of vehicles that have been used for similar purposes (training computer vision algorithms) and used these to create my own XML files. It has been almost a week and some of them have finished, so I tried using them but the results were terrible. The classifiers I found online worked decently, at least it appears they are trying to detect vehicles and work fast enough for real-time application (maybe 5-10 FPS or so).
Whereas mine can take several minutes to analyze a frame using detectMultiScale() using the same parameters, and if I pass different parameters (e.g. increase min size, decrease max size, increase scaling factor) it will work faster (maybe 1 FPS) but detects absolutely nothing of note, never detects any vehicle and randomly detects some spots of asphalt as a vehicle.
Where did I go wrong in generating my files? I have limited time to complete this task and these classifiers can take a whole week to train so I have very few attempts remaining. For reference, my methodology is (following this tutorial):
-Take all positive and negative images; if no negative images supplied, take negative images from another data set, at least as many negatives as positives
-Generate as many samples as the number of positives
-Use same parameters as suggested, except image size (set to the size of images in a given data set), and nstages (set to 10 because 20 takes far too long)
-For the npos parameter, I use 1/10th the number of samples, using the full number of samples resulted in "assertion failed" after a few hours, apparently the number of samples cannot be the same as the npos according to this so I gave myself a safety margin.
TL;DR Haar classifier I trained myself performs much worse than one found online (in terms of time and accuracy), need advice on how to improve it and not waste another week training it.
There are two problems here. One, the accuracy of the classifier is low. The other, the classifier runs too slow.
There seems to be no problem with the reference that you used. The steps seem accurate, and I have personally tried them in that order and managed to get good results.
As #Micka mentions, nPos around 90% of the original sample count is good enough. minHitRate is a parameter that you can change. Did you observe the numbers that are displayed while training? How was the accuracy improving, and did your classifier stop training (or are you using the trained parameters before learning ends?)?
For the low speed in detection, the most likely reason is that your training data did not have simple features to learn quickly. Did you trying detection on the data that you used for training? How were the results in that case? Compiler settings or high image resolution can be a problem too, but if you tried the same inputs and settings with other classifiers, this is unlikely.
If you like tor try a different approach (and have a GPU), YOLO V2 should be much faster and more accurate for this task.

How to deal with low frequency examples in classification?

I'm facing a text classification problem, and I need to classify examples to 34 groups.
The problem is, the size of training data of 34 groups are not balanced. For some groups I have 2000+ examples, while for some I only have 100+ examples.
For some small groups, the classification accuracy is quite high. I guess those groups may have specific key words to recognize and classify. While for some, the accuracy is low, and the prediction always goes to large groups.
I want to know how to deal with the "low frequency example problem". Would simply copy and duplicate the small group data work? Or I need to choose the training data and expand and balance the data size? Any suggestions?
Regularization can sometimes help imbalanced class problems by reducing the effect of spurious correlation, but that depends on your data. One solution is to simply over-sample the smaller classes, or increase the weights of the data points in the smaller classes to force the classifier to pay more attention to it.
You can find more advanced techniques by searching for "class imbalance" problems. Though not as many of them have been applied / created for text classification problems, as it is very common to have huge amounts of data when working with text problems. So I'm not sure how many work well in such high dimensional space.

How easy/fast are support vector machines to create/update?

If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.

Resources