Recently to get best features I have used SelectKBest(score_func=, k=20). f_classif computes anova f-value between feature for classification tasks. I have used it and got best results. I learnt anova f-test computes ratio of 'between class variance' to 'within class variance'. But I have following questions:
1) Does f_classif use combination of features to give f-score ?.
2) Can I have a pseudo code of how fit function of SelectKBest works ?.
3) How does f_classif work in sklearn ?. Thanks in advance.
f_classif is done on a feature by feature basis, i.e. individually. So no combination of features.
Related
I have a use case in ML where I have 2 classes, 0 and 1 for a given text.
Class-0: Can afford some misclassifications
Class-1: Very Important, can't afford any misclassifications
There's a huge imbalance in samples for both classes,
about 30000 for class-0, and only 1000 for class-1
While doing the train-test split, I'm stratifying the split based on the labels, such that, the ratio of 70% train and 30% test is maintained for each label class.
I want to tune parameters in such a way that Precision or Recall for class-1 is improved. I tried using 'f1_macro', 'precision', 'recall' as individual metrics and all combined as well to tune using GridSearchCV, but it's less helpful due to majority samples being Class-0.
I'm exploring the safer ways to reduce class 0 data, although, there's only small degree we can reduce, anyways even without tuning, or with any parameters, class-0 always have above 98% f1-score.
So all I care about tuning is for class-1.
Can you please suggest, perhaps a customized callable metric such that it only focuses on Class-1's Precision, Recall or F1-Score?
I'm using scikit-learn latest stable version.
Similar Problem here, the author is trying to Tune Class-1's F1 Score using Neural Networks (MLP) in Keras
Its been suggested to try customizing metric, just didn't mention how.
The one who can answer here for Scikit-Learn, can also answer below link for Keras.
Hyperparameter tuning in Keras (MLP) via RandomizedSearchCV
Using class_weight='balanced' is helping here.
I referred these articles in Scikit-Learn's official documentation pages.
Understanding how parameter class_weights works:
https://scikit-learn.org/stable/modules/svm.html#unbalanced-problems
https://stackoverflow.com/a/30982811/3149277
Understanding what parameters to use for class_weights:
https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use
How does the class_weight parameter in scikit-learn work?
Although, due to time limits, I didn't bother defining the custom function as this seemed working close to my expectations.
I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.
What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .
sklearn provides two SVM based regression, SVR and NuSVR. The latter claims to be using libsvm. However, other than that I don't see any description of when to use what.
Does anyone have an idea?
I am trying to do regression on 3m X 21 matrix using 5 fold cross validation using SVR, but it is taking forever to finish. I've aborted the job and I'm now considering using NuSVR. But I'm not sure what advantage it provides.
NuSVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html#sklearn.svm.NuSVR
SVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR
They are equivalent but slightly different parametrizations of the same implementation.
Most people use SVR.
You can not use that many samples with a kernel SVR. You could try SVR(kernel="Linear") but that would probably also be infeasible. I recommend using SGDRegressor. You might need to adjust the learning rate and number of epochs, though.
You can also try RandomForestRegressor which should work just fine.
Check out the github code for nuSVR. It says that it is based on libSVM as well. NuSVR allows you to limit the number of support vectors used.
I have been blowing my brains out over the past 2-3 weeks on this problem.
I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.
I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though.
Not sure about speeding up hashing vectorizer. You gotta ask #Larsmans and #ogrisel for that ;)
Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.
The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
The trick in (1.) applies to any classification algorithm.
As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.
My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.linear_model import SGDClassifier
# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)
# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).
The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.