Whenever I start having a bigger number of classes (1000 and more) MultinominalNB gets super slow and takes Gigabytes of RAM. The same is true for all the scikit learn classification algorithms that support .partial_fit() (SGDClassifier, Perceptron).
When working with convolutional neural networks 10000 classes are no problem. But when I want to train MultinominalNB on the same data my 12GB of RAM are not enough and it is very very slow.
From my understanding of Naive Bayes, even with a lot of classes, it should be a lot faster.
Might this be a problem of the scikit-learn implementation (maybe of the .partial_fit() function) ? How can I train MultinominalNB/SGDClassifier/Perceptron on 10000+ classes (batchwise)?
Short answer without much information:
The MultinomialNB fits an independent model to each of the classes, thus, if you have C=10000+ classes it will fit C=10000+ models and therefore, only the model parameters will be [n_classes x n_features], which is quite a lot of memory if n_features is large.
The SGDClassifier of scikits-learn uses OVA (one-versus-all) strategy to train a multiclass model (as the SGDC is not inherently multiclass) and therefore, another C=10000+ models need to be trained.
And Perceptron, from the documentation of scikits-learn:
Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None).
So, all the 3 classifiers you mention don't work well with high number of classes, as an independent model needs to be trained for each of the classes. I would recommend you to try something that inherently support multiclass classification, such as RandomForestClassifier.
Related
I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:
upsampling which means to get more data about the minority classes
downsampling which means to remove data about the majority classes
According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.
So Is there another solution I can try for this problem?
You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.
You could use a combination of both.
It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.
If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.
In many training datasets, the class distribution is skewed, I.e. one class is very frequent (e.g. 95%), while another is rare (e.g. 5%). In such applications, the Naive classifier that assigns all test cases to the majority class achieves very high accuracy. However, the Naive classifier does not predict any cases of the minority class, although it is often more important than the majority class.
So are there any ways to improve the accuracy on the minority class just by modifying the training dataset? or Do we have to modify the classification algorithm as well?
This is the issue of the Imbalanced classification. Below are the methods used to treat the Imbalanced Datasets.
1.Undersampling
2.Oversampling
3.Synthetic Data Generation
4.Cost-Sensitive Learning
1.Undersampling
This method works with the majority class. It reduces the number of observations from the majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles.
Undersampling methods are of 2 types: Random and Informative.
Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data. It is also known as upsampling. Similar to undersampling, this method also can be divided into two types: Random Oversampling and Informative Oversampling.
Synthetic Data Generation
In simple words, instead of replicating and adding observations from the minority class, it overcomes imbalances by generates artificial data. It is also a type of oversampling technique.
In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.
Cost-Sensitive Learning (CSL)
It is another commonly used method to handle classification problems with imbalanced data. It’s an interesting method. In simple words, this method evaluates the cost associated with misclassifying observations.
It does not create a balanced data distribution. Instead, it highlights the imbalanced learning problem by using cost matrices which describe the cost for misclassification in a particular scenario. Recent researches have shown that cost-sensitive learning has many times outperformed sampling methods. Therefore, this method provides a likely alternative to sampling methods.
To build these in R refer the link https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
There are many ways. You can upsample miority class, downsample majority, do some weighting based on loss, or some other measures. As you've tagged your post to be related to TF I suggest you look at Importance Sampling for Keras, http://idiap.ch/~katharas/importance-sampling/ it has some ready to use wrappers that you put around your model.
If a dataset contains multi categories, e.g. 0-class, 1-class and 2-class. Now the goal is to divide new samples into 0-class or non-0-class.
One can
combine 1,2-class into a unified non-0-class and train a binary classifier,
or train a multi-class classifier to do binary classification.
How is the performance of these two approaches?
I think more categories will bring about a more accurate discriminant surface, however the weights of 1- and 2- classes are both lower than non-0-class, resulting in less samples be judged as non-0-class.
Short answer: You would have to try both and see.
Why?: It would really depend on your data and the algorithm you use (just like for many other machine learning questions..)
For many classification algorithms (e.g. SVM, Logistic Regression), even if you want to do a multi-class classification, you would have to perform a one-vs-all classification, which means you would have to treat class 1 and class 2 as the same class. Therefore, there is no point running a multi-class scenario if you just need to separate out the 0.
For algorithms such as Neural Networks, where having multiple output classes is more natural, I think training a multi-class classifier might be more beneficial if your classes 0, 1 and 2 are very distinct. However, this means you would have to choose a more complex algorithm to fit all three. But the fit would possibly be nicer. Therefore, as already mentioned, you would really have to try both approaches and use a good metric to evaluate the performance (e.g. confusion matrices, F-score, etc..)
I hope this is somewhat helpful.
I have data set for classification problem. I have in total 50 classes.
Class1: 10,000 examples
Class2: 10 examples
Class3: 5 examples
Class4: 35 examples
.
.
.
and so on.
I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me
You should provide more information about the data set features and the class distribution, this would help others to advice you.
In any case, I don't think a neural network fits here as this data set is too small for it.
Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution.
For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).
Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.
There are multiple ways to handle with imbalanced datasets, you can try
Up sampling
Down Sampling
Class Weights
I would suggest either Up sampling or providing class weights to balance it
https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c
You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric
https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/
From my experience the most successful ways to deal with unbalanced classes are :
Changing distribiution of inputs: 20000 samples (the approximate number of examples which you have) is not a big number so you could change your dataset distribiution simply by using every sample from less frequent classes multiple times. Depending on a number of classes you could set the number of examples from them to e.g. 6000 or 8000 each in your training set. In this case remember to not change distribiution on test and validation set.
Increase the time of training: in case of neural networks, when changing distribiution of your input is impossible I strongly advise you trying to learn network for quite a long time (e.g. 1000 epochs). In this case you have to remember about regularisation. I usually use dropout and l2 weight regulariser with their parameters learnt by random search algorithm.
Reduce the batch size: In neural networks case reducing a batch size might lead to improving performance on less frequent classes.
Change your loss function: using MAPE insted of Crossentropy may also improve accuracy on less frequent classes.
Feel invited to test different combinations of approaches shown by e.g. random search algorithm.
Data-level methods:
Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data. Many sophisticated sampling techniques have been developed to mitigate these risks.
One such technique is two-phase learning. You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
Another technique is dynamic sampling: oversample the low-performing classes and undersample the high-performing classes during the training process. Introduced by Pouyanfar et al., the method aims to show the model less of what it has already learned and more of what it has not.
Algorithm-level methods
Cost-sensitive learning
Class-balanced loss
Focal loss
References:
esigning Machine Learning Systems
Survey on deep learning with class imbalance
I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.
Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.