I am wondering if there is a best practice for exponentially weight the training samples for random forest by time (putting more weights on more recent samples)? One way I can think of is to sample the full dataset with replacements according to the weights given time. Are there any other methods I should consider? It would be great if anyone knows some python packages that could help me accomplish this goal. Any help is much appreciated!
The sklearn implementation of Random Forests allows to specify sample weights in the fit function.
from sklearn.ensemble import RandomForestClassifier
# fill sample_weight with the desired weighting
sample_weights = numpy.ones(y.shape)
estimator = RandomForestClassifier
estimator.fit(X, y, sample_weights)
Related
My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM.
I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference
2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based
anomaly detection." ACM Transactions on Knowledge Discovery from
Data (TKDD), (2012)
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit do here? Is it training? If yes, what is trained?
In Isolation Forest:
First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.
contamination will determine your threshold. if it is 0, then what is your threshold?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.
I am using SVC classifier with Linear kernel to train my model.
Train data: 42000 records
model = SVC(probability=True)
model.fit(self.features_train, self.labels_train)
y_pred = model.predict(self.features_test)
train_accuracy = model.score(self.features_train,self.labels_train)
test_accuracy = model.score(self.features_test, self.labels_test)
It takes more than 2 hours to train my model.
Am I doing something wrong?
Also, what can be done to improve the time
Thanks in advance
There are several possibilities to speed up your SVM training. Let n be the number of records, and d the embedding dimensionality. I assume you use scikit-learn.
Reducing training set size. Quoting the docs:
The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
O(n^2) complexity will most likely dominate other factors. Sampling fewer records for training will thus have the largest impact on time. Besides random sampling, you could also try instance selection methods. For example, principal sample analysis has been proposed recently.
Reducing dimensionality. As others have hinted at in their comments, embedding dimension also impacts runtime. Computing inner products for the linear kernel is in O(d). Dimensionality reduction can, therefore, also reduce runtime. In another question, latent semantic indexing was suggested specifically for TF-IDF representations.
Parameters. Use SVC(probability=False) unless you need the probabilities, because they "will slow down that method." (from the docs).
Implementation. To the best of my knowledge, scikit-learn just wraps around LIBSVM and LIBLINEAR. I am speculating here, but you may be able to speed this up by using efficient BLAS libraries, such as in Intel's MKL.
Different classifier. You may try sklearn.svm.LinearSVC, which is...
[s]imilar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Moreover, a scikit-learn dev suggested the kernel_approximation module in a similar question.
I had the same issue, but scaling the data solved the problem
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
For SVM you for sure would be able to get higher compute efficiency.
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()
Try using the following code. I had similar issue with similar size of the training data.
I changed it to following and the response was way faster
model = SVC(gamma='auto')
The dataset is extremely imbalanced the positive results were only 10% approximately compared to negative results. Eg: (0 - 11401, 1- 1280).
I have tried
1. RandomForestClassifier with GridSearchCV - hyper parameter tuning.
2. Weighted RandomForest with class_weight="balanced"
3. Penalised SVC
4. UpSampling and DownSampling
Still I don't get good precision or recall in any of the above methods.
Im aware prevalence is related PPV. And my dataset has very low class -1. Also Random Forest may lean to majority class.
But i was hoping sampling should work but it didn't. Am I missing something? Any suggestion would be really appreciated.
a few methodes should help you:
predict probabilities and do a manual thresholding.
change the loss/metric you are using.
for an imbalance dataset (outliers detection) you shoudln't use class_weight=balance but put more weight on the outliers.
try other algorithm to see if some do better (XGBoost,catboost,lightgbm if you want to stick with tree based solutions)
we can also use tpot to find the best algo in sklearn for your particular dataset
tell me if any of those helped you
I have a OneVsRestClassifier (scikit-learn) which has been trained.
clf = OneVsRestClassifier(LogisticRegression(C=1.2, penalty='l1')).fit(X_train, y_train)
I want to find out the loss for my test data. I used log_loss function but it does not seem to work because I have multiple classes as outputs for each test case. What do I do?
The classification problem that you are referring to is known as a Multi-Label Classification problem. You have made a good decision of using the OneVsRestClassifier for this purpose. By default the score method uses the subset accuracy which is a very harsh metric as it requires you to guess the entire subset of labels correctly.
Some other loss functions, provided by scikit-learn, that you can use are as follows:
Hamming Loss - This measures the hamming distance between your prediction of labels and the true label. This is an intuitive formula to understand the hamming distance.
Jaccard Similarity Coefficient Score - This measures the Jaccard similarity between your predicted labels and the true labels.
Precision, Recall and F-Measures - In the case of multi-label classification, the notion of Precision, Recall and F-Measures can be applied to each class independently. The following guide explains how to combine them across all labels in multi-label classification.
If you need to also rank the labels as it is done in multi-label ranking problems, then there are other more advanced techniques available in scikit-learn which are very well documented with examples here. If you are dealing with this kind of a problem, then let me know in the comments, I will explain each of these metrics in more details.
Hope this helps!
I have trained and tested the data using two classifiers such as naivebayes and SMO. Now I need to combine them using stacking.I need to know how I can perform stacking and what should be my base level classifier and meta level classifier.
It sounds like what you are want is ensemble learning rather than stacking. In an ensemble, you would use both classifiers to make decisions and combine those decisions.
Stacking is a process where the output of one level of classifiers is used as input for the next level. That is, the predictions of some classifiers are the features for other classifiers. For this, you would need to retrain one of the models with the outputs of the first classifier as input.
Which one of the classifiers should used where depends on your specific application. Similarly, how to do it depends on what system you've used to train these classifiers.
To choose the base-level classifiers for stacking consider diverse classifiers that can potentially learn on a subset of features or a subset of data. For example, your base-level classifiers could be K-NN, Random Forest and Naive Bayes. For meta-level classifier we would like to choose a classifier that will learn well based on base-level predictions as features. A good candidate is Logistic Regression.
Using mlxtend library as an example, we have:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
For examples on stacked classifiers and regressors see mlxtend documentation page.