SVC classifier taking too much time for training - machine-learning

I am using SVC classifier with Linear kernel to train my model.
Train data: 42000 records
model = SVC(probability=True)
model.fit(self.features_train, self.labels_train)
y_pred = model.predict(self.features_test)
train_accuracy = model.score(self.features_train,self.labels_train)
test_accuracy = model.score(self.features_test, self.labels_test)
It takes more than 2 hours to train my model.
Am I doing something wrong?
Also, what can be done to improve the time
Thanks in advance

There are several possibilities to speed up your SVM training. Let n be the number of records, and d the embedding dimensionality. I assume you use scikit-learn.
Reducing training set size. Quoting the docs:
The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
O(n^2) complexity will most likely dominate other factors. Sampling fewer records for training will thus have the largest impact on time. Besides random sampling, you could also try instance selection methods. For example, principal sample analysis has been proposed recently.
Reducing dimensionality. As others have hinted at in their comments, embedding dimension also impacts runtime. Computing inner products for the linear kernel is in O(d). Dimensionality reduction can, therefore, also reduce runtime. In another question, latent semantic indexing was suggested specifically for TF-IDF representations.
Parameters. Use SVC(probability=False) unless you need the probabilities, because they "will slow down that method." (from the docs).
Implementation. To the best of my knowledge, scikit-learn just wraps around LIBSVM and LIBLINEAR. I am speculating here, but you may be able to speed this up by using efficient BLAS libraries, such as in Intel's MKL.
Different classifier. You may try sklearn.svm.LinearSVC, which is...
[s]imilar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Moreover, a scikit-learn dev suggested the kernel_approximation module in a similar question.

I had the same issue, but scaling the data solved the problem
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
For SVM you for sure would be able to get higher compute efficiency.
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()

Try using the following code. I had similar issue with similar size of the training data.
I changed it to following and the response was way faster
model = SVC(gamma='auto')

Related

How to Customize Metric for GridSearchCV in Scikit Learn to tune for specific class?

I have a use case in ML where I have 2 classes, 0 and 1 for a given text.
Class-0: Can afford some misclassifications
Class-1: Very Important, can't afford any misclassifications
There's a huge imbalance in samples for both classes,
about 30000 for class-0, and only 1000 for class-1
While doing the train-test split, I'm stratifying the split based on the labels, such that, the ratio of 70% train and 30% test is maintained for each label class.
I want to tune parameters in such a way that Precision or Recall for class-1 is improved. I tried using 'f1_macro', 'precision', 'recall' as individual metrics and all combined as well to tune using GridSearchCV, but it's less helpful due to majority samples being Class-0.
I'm exploring the safer ways to reduce class 0 data, although, there's only small degree we can reduce, anyways even without tuning, or with any parameters, class-0 always have above 98% f1-score.
So all I care about tuning is for class-1.
Can you please suggest, perhaps a customized callable metric such that it only focuses on Class-1's Precision, Recall or F1-Score?
I'm using scikit-learn latest stable version.
Similar Problem here, the author is trying to Tune Class-1's F1 Score using Neural Networks (MLP) in Keras
Its been suggested to try customizing metric, just didn't mention how.
The one who can answer here for Scikit-Learn, can also answer below link for Keras.
Hyperparameter tuning in Keras (MLP) via RandomizedSearchCV
Using class_weight='balanced' is helping here.
I referred these articles in Scikit-Learn's official documentation pages.
Understanding how parameter class_weights works:
https://scikit-learn.org/stable/modules/svm.html#unbalanced-problems
https://stackoverflow.com/a/30982811/3149277
Understanding what parameters to use for class_weights:
https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use
How does the class_weight parameter in scikit-learn work?
Although, due to time limits, I didn't bother defining the custom function as this seemed working close to my expectations.

One Class SVM and Isolation Forest for novelty detection

My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM.
I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference
2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based
anomaly detection." ACM Transactions on Knowledge Discovery from
Data (TKDD), (2012)
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit do here? Is it training? If yes, what is trained?
In Isolation Forest:
First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.
contamination will determine your threshold. if it is 0, then what is your threshold?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.

Training on imbalanced data using TensorFlow

The Situation:
I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.
One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.
Questions:
(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?
(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?
(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.
(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.
(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.
(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow
and the answer.
Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).
Upweighting positive samples
This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Batch Sampling
This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample
I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.
1) Use cost function calculating 0 and 1 labels at the same time like below.
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))
2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278
Both strategy worked when I tried to make credit rating model.
Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.
1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.
Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.
Not sure about 2.

Logistic regression on huge dataset

I need to run a logistic regression on a huge dataset (many GBs of data). I am currently using using Julia's GLM package for this. Although my regression works on subsets of the data, I am running out of memory when I try to run this on the full dataset.
Is there a way to compute logistic regressions on huge, non-sparse datasets without using a prohibitive amount of memory? I thought about separating the data into chunks, calculating regressions on each of these and aggregating them somehow, but I'm not sure this would give valid results.
Vowpal Wabbit is designed for that:
linear models when the data (or even the model) does not fit in memory.
You can do the same thing by hand, using
stochastic gradient descent (SGD): write the "loss function" of your logistic regression
(the opposite of the likelihood),
minimize it just a bit on a chunk of the data (perform a single gradient descent step),
do the same thing on another chunk of data, and continue.
After several passes on the data, you should have a good solution.
It works better if the data arrives in a random order.
Another idea (ADMM, I think),
similar to what you suggest, would be to split the data into chunks,
and minimize the loss function on each chunk.
Of course, the solutions on the different chunks are not the same.
To address this problem, we can change the objective functions
by adding a small penalty for the difference between the solution on a chunk of data and the average solution, and re-optimize everything.
After a few iterations, the solutions become closer and closer and eventually converge.
This has the added advantage of being parallelizable.
I have not personally used it, but the StreamStats.jl package is designed for this use case. It supports linear and logistic regression, as well as other streaming statistic functions.
Keep an eye on Josh Day's awesome package OnlineStats. In addition to tons of online algorithms for various statistic, regression, classification, dimensionality reduction, and distribution estimation, we are also actively working on porting all missing functionality from StreamStats and merging the two.
Also, I've been working on a very experimental package OnlineAI (extending OnlineStats) which will extend some online algorithms into the machine learning space.
several scikit estimators, including logistic regression, implement partial_fit, which allow for batch-wise training of large, out-of-core datasets.
such models can be used for classification using an out-of-core approach: learning from data that doesn’t fit into main memory.
pseudo code:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log')
for batch_x, batch_y in some_generator: # lazily read data in chunks
clf.partial_fit(batch_x, batch_y)
To add to Tom's answer, OnlineStats.jl has a statistical learning type (StatLearn) which relies on stochastic approximation algorithms, each of which use O(1) memory. Logistic Regression and Support Vector Machines are available for binary response data. The model can be updated with new batches of data, so you don't need to load your whole dataset at once. It's also extremely fast. Here's a basic example:
using OnlineStats, StatsBase
o = StatLearn(n_predictors, LogitMarginLoss())
# load batch 1
fit!(o, (x1, y1))
# load batch 2
fit!(o, (x2, y2))
# load batch 3
fit!(o, (x3, y3))
...
coef(o)

Scalable or online out-of-core multi-label classifiers

I have been blowing my brains out over the past 2-3 weeks on this problem.
I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.
I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though.
Not sure about speeding up hashing vectorizer. You gotta ask #Larsmans and #ogrisel for that ;)
Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.
The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
The trick in (1.) applies to any classification algorithm.
As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.
My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.linear_model import SGDClassifier
# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)
# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).
The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.

Resources