Stratified sampling for regression - machine-learning

I need to do regression analysis using SVM kernels on the large sets of data. My laptop is not able to handle and it takes hours to finish running. Is there any good way to reduce the dataset size without affecting the (much) quality of the model? Will stratified sampling work?

There are dozens of ways of reducing SVM complexity, probably the easiest ones involve approximating Kernel space projection. In particular libraries such as scikit-learn provides functions to do this kind of explicit projection, which followed by a linear SVM - can be trained realatively fast.

Related

The impact of number of negative samples used in a highly imbalanced dataset (XGBoost)

I am trying to model a classifier using XGBoost on a highly imbalanced data-set, with a limited number of positive samples and practically infinite number of negative samples.
Is it possible that having too many negative samples (making the data-set even more imbalanced) will weaken the model's predictive power? Is there a reason to limit the number of negative samples aside from running time?
I am aware of the scale_pos_weight parameter which should address the issue but my intuition says even this method has its limits.
To answer your question directly: adding more negative examples will likely decrease the decision power of the trained classifier. For the negative class choose the most representative examples and discard the rest.
Learning from imbalanced dataset can influence the predictive power and even an ability of a classifier to converge at all. Generally recommended strategy is to maintain similar sizes of training examples per each of the classes. Imbalance of classes effect on learning depends on the shape of the decision space and the width of boundaries between classes. The wider they are, and the simpler the decision space the more successful training even for imbalanced datasets.
TL;DR
For a quick overview of the methods of imbalanced learning I recommend these two articles:
SMOTE and AdaSyn by example
How to Handle Imbalanced Data: An Overview
Dealing with Imbalanced Classes in Machine Learning
Learning from Imbalanced Data by Prof. Haibo He (more scientific)
There is a Python package called imbalanced-learn which has an extensive documentation of algorithms that I recommend for in-depth review.

what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

Image classification using Convolutional neural network

I'm trying to classify hotel image data using Convolutional neural network..
Below are some highlights:
Image preprocessing:
converting to gray-scale
resizing all images to same resolution
normalizing image data
finding pca components
Convolutional neural network:
Input- 32*32
convolution- 16 filters, 3*3 filter size
pooling- 2*2 filter size
dropout- dropping with 0.5 probability
fully connected- 256 units
dropout- dropping with 0.5 probability
output- 8 classes
Libraries used:
Lasagne
nolearn
But, I'm getting less accuracy on test data which is around 28% only.
Any possible reason for such less accuracy? Any suggested improvement?
Thanks in advance.
There are several possible reasons for low accuracy on test data, so without more information and a healthy amount of experimentation, it will be impossible to provide a concrete answer. Having said that, there are a few points worth mentioning:
As #lejlot mentioned in the comments, the PCA pre-processing step is suspicious. The fundamental CNN architecture is designed to require minimal pre-processing, and it's crucial that the basic structure of the image remains intact. This is because CNNs need to be able to find useful, spatially-local features.
For detecting complex objects from image data, it's likely that you'll benefit from more convolutional layers. Chances are, given the simple architecture you've described, that it simply doesn't possess the necessary expressiveness to handle the classification task.
Also, you mention you apply dropout after the convolutional layer. In general, the research I've seen indicates that dropout is not particularly effective on convolutional layers. I personally would recommend removing it to see if it has any impact. If you do wind up needing regularization on your convolutional layers, (which in my experience is often unnecessary since the shared kernels often already act as a powerful regularizer), you might consider stochastic pooling.
Among the most important tips I can give is to build a solid mechanism for measuring the quality of the model and then experiment. Try modifying the architecture and then tuning hyper-parameters to see what yields the best results. In particular, make sure to monitor training loss vs. validation loss so that you can identify when the model begins overfitting.
After 2012 Imagenet, all convolutional neural networks which performs good(state of the art) are adding more convolutional neural network, they even use zero padding to increase the convolutional neural network.
Increase the number of convolutional neural network.
Some says that dropout is not that effective on CNN, however it is not bad to use, but
You should lower the dropout value, you should try it(May be 0.2).
Data should be analysed. If it is low,
You should use data augmentation techniques.
If you have more data in one of the labels,
You are stuck with the imbalanced data problem. But you should not consider it for now.
You can
Fine-Tune from VGG-Net or some other CNN's should be considered.
Also, don't convert to grayscale, after image-to-array transformation, you should just divide 225.
I think that you learned CNN from some tutorial(MNIST) and you think that you should turn it to grayscale.

OpenCV: Training a soft cascade classifier

I've built an algorithm for pedestrian detection using openCV tools. To perform classification I use a boosted classifier trained with the CvBoost class.
The problem of this implementation is that I need to feed my classifier the whole set of features I used for training. This makes the algorithm extremely slow, so much that each image takes around 20 seconds to be fully analysed.
I need a different detection structure, and openCV has this Soft Cascade class that seems like exactly what I need. Its basic principle is that there is no need to examine all the features of a testing sample, since a detector can reject most negative samples using a small number of features. The problem is that I have no idea how to train one given a fully labeled set of negative and positive examples.
I find no information about this online, so I am looking for any tips you can give me on how to use this soft cascade to make classification.
Best regards

Testing an image processing algorithm on noisy data

I wrote an image processing program that train some classifier to recognize some object in the image. now I want to test the response of my algorithm to noise. I wish the algorithm have some robustness to noise.
My question is that, should I train the classifier using noisy version of train dataset, or train the classifier using original version of dataset, and see its performance on noisy data.
Thank you.
to show robustness of classifier one might use highly noisy test data on the originally trained classifier. depending on that performance, one can train again using noisy data and then test again. obviously for an application development, if including extremely noisy samples increase accuracy then that's the way to go. literature says to have as large a range of training samples as possible. however sometimes this degrades performances in specific cases.

Resources