Any reason not to use oversampling and undersampling together? - machine-learning

This has been bothering me for quiet some time.
If oversampling and undersampling both have their pros and cons, why not use them together to minimize their weaknesses?
I just couldn't find a paper or an article that says they've used both or we shouldn't use both simultaneously.
Joint usage would allow less oversampling the minority and less undersampling the majority, wouldn't it?

Related

How to handle imbalanced data in general

I have been working on the case study where data is highly imbalanced. we have been taught we can handle the imbalanced data by either under sampling the majority class or over sampling the minority class.
I wanted to ask if there is any other way/method that can be used to handle imbalanced data?
this question is more on conceptual side than programming.
for example,
I was thinking if we could put some weight on minority class (conceptually) to make the model emphasize on identifying pattern in minority class.
I don't know how that can be done but this concept theoretically should work.
feel free to put crazy ideas too.
Your weights idea is not far off. This can be done. In fact, most sklearn Models give you the option to specify class weights. This however is often not enough for very extreme cases (e.g. 95%/5% split or more extreme).
There are specific oversampling techniques such as SMOTE (and related techniques) which go a step further than classic oversampling and generate synthetic samples based on a K Nearest Neighbors Algorithm.
If the classes are extremely imbalanced a "classic" classification approach may not be enough and you might have to look into anomaly detection algorithms.
Just use strong consistent classifier. See https://arxiv.org/abs/2201.08528
Generally speaking I think that you need, before diving into the technical solution (under/upsampling, smote ..) consider the business KPI you are predicting and whether there is a proxy that could help reduce the disparity rate between the classes.
You can think also about the models that have weights parameters and could penalize the majority class
You can check this article, it explains from a conceptual point of view how to deal with imbalanced data in general.

The impact of number of negative samples used in a highly imbalanced dataset (XGBoost)

I am trying to model a classifier using XGBoost on a highly imbalanced data-set, with a limited number of positive samples and practically infinite number of negative samples.
Is it possible that having too many negative samples (making the data-set even more imbalanced) will weaken the model's predictive power? Is there a reason to limit the number of negative samples aside from running time?
I am aware of the scale_pos_weight parameter which should address the issue but my intuition says even this method has its limits.
To answer your question directly: adding more negative examples will likely decrease the decision power of the trained classifier. For the negative class choose the most representative examples and discard the rest.
Learning from imbalanced dataset can influence the predictive power and even an ability of a classifier to converge at all. Generally recommended strategy is to maintain similar sizes of training examples per each of the classes. Imbalance of classes effect on learning depends on the shape of the decision space and the width of boundaries between classes. The wider they are, and the simpler the decision space the more successful training even for imbalanced datasets.
TL;DR
For a quick overview of the methods of imbalanced learning I recommend these two articles:
SMOTE and AdaSyn by example
How to Handle Imbalanced Data: An Overview
Dealing with Imbalanced Classes in Machine Learning
Learning from Imbalanced Data by Prof. Haibo He (more scientific)
There is a Python package called imbalanced-learn which has an extensive documentation of algorithms that I recommend for in-depth review.

Data sampling technique and questions

I'm just a little confused about data sampling, what distribution should I expect for my sampling data? In general, do I want my sampling data has the same distribution as my whole dataset? I wonder what is a reasonable sampling technique and approach?
There are many factors to consider in choosing a sampling technique. Factors such as the purpose or objective of work, your budget, time and even sample size are worth considering when choosing a sampling technique.
Probability Sampling techniques are usually more involving while Non-Probability Sampling techniques may be less demanding.
The sampling technique chosen goes a long way affecting the interpretation of the data, as well as the overall outcome of your work.
These notes may be of interest:
Simple Random Sampling and Other Sampling Methods
I did not understand your question well, but I will try to answer.
Student's ‘t’ Distribution is essentially a normal distribution (with an approximate shape of a bell), that is why the statistical programs often have in them the statistical expressions of the Student 't' distribution instead of the normal distribution.

what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
Thank you, in advance.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

Liblinear vs Pegasos

I have been looking for fast linear SVM library and i came across two of the most important ones Liblinear and Pegasos , from the paper presented of liblinear it looks like for liblinaer outperforms pegasos. however pegasos claims if data is sparse then it works fast.
As pegasos came earlier there is no comparison in it;s documentation.
So for sparse data what should i choose ?
As far as I know, sparse data is handled fine by both. The question is more on the number of data points. Liblinear has solvers for both the primal and the dual, and these solve the problem to a high precision without any need to tune the parameters. For pegasos or similar subgradient descent solvers (if you want one of these, I'd recommend Leon Bottou's sgd) the result is strongly dependant on the initial learning rate and learning rate schedule, which can be tricky to tune.
As a rule of thumb, if I have less than 10k data points, I'd always use liblinear (with the primal solver), maybe even up to 100k. Above that, I'd consider using SGD if I feel liblinear is to slow. Even if liblinear is slightly slower, I prefer using it as it means I don't have to think about learning rate, learning rate decay and number of epochs.
Btw, you can very easily compare these different solvers using a framework like scikit-learn, which includes SGD, Liblinear and LibSVM solvers, or lightning, which includes A LOT of solvers.
Both LIBLINEAR and Pegasos are linear classification techniques that were specifically developed to deal with large sparse data with a huge number of instances and features. They are only faster than the traditional SVM on this kind of data.
I never used Pegasos before, but I can assure you that LIBLINEAR is very fast with this kind of data and the authors say that "it is competitive with or even faster than state of the art linear classifiers such as Pegasos".

Resources