pytorch class weights for multi class classification - machine-learning

I am using class weights for multiclass classification using sklearn's compute_weight function and pytorch for training the model. For to compute the class weight, whether we need to use all data (training, validation and test) or only training set data for calculating the class weights. Thanks

When training your model you can only assume training data is available for you.
Estimating the class_weights is part of the training -- it defines your loss function.

Related

xgboost trained model make random predictions for same data

I have a problem with xgboost predictions.
I have trained a xgboost model for my regression problem in python but when max_depth parameter is given different than default value, some of predictions changes if it is predicted again with the same model.
So far I tried changing basic parameters like learning rate, reg_lamda and so on but only max_depth causes randomness in predictions for same data.

What Does tf.estimator.LinearClassifier() Do?

In TensorFlow library, what does the tf.estimator.LinearClassifier class do in linear regression models? (In other words, what is it used for?)
Linear Classifier is nothing but Logistic Regression.
According to Tensorflow documentation, tf.estimator.LinearClassifier is used to
Train a linear model to classify instances into one of multiple
possible classes. When number of possible classes is 2, this is binary
classification
Linear regression predicts a value while the linear classifier predicts a class. Classification aims at predicting the probability of each class given a set of inputs.
For implementation of tf.estimator.LinearClassifier, please follow this tutorial by guru99.
To know about the linear classifiers, read this article.

Knn classifier for Imbalanced dataset

I want to get an estimate on how well the classifiers would work on an imbalance dataset of mine. When I try to fit KNN classifier from sklearn it learns nothing for the minority class. So what I did was I fit the classifier with k = R (where r is the imbalance ratio 1: R) and I predict probabilities for each test point and assign a point to minority class if the probability output of the classifier for the minority class is great than R (where r is the imbalance ratio 1: R). I do this to get an estimate of how the classifier performs(F1-score). I don't need the classifier in production. Is what I'm doing right?
Since you have mentioned in the comments that you dont want to use resampling, the one way out is batching. Create multiple dataset from your majority class so that they will be 1:1 ratio with minority class. Train multiple models with each model getting one part of the majority set and all of the minority. Make a prediction with all the models and take a vote from them and decide your final outcome.
But I would suggest using SMOTE over this method.

How to perform incremental training of large data set using (scikit) Adaboost classifier?

I have a large size of the training dataset, so in order to fit it into the AdaBoost classifier, I would like to do incremental training.
Like in xgb we have a parameter called xgb_model to use the trained xgb model for further fitting the new data, I am looking for such parameters in AdaBoost classifier.
Currently, I have am trying to use fit function to iteratively train the model but it seems my classifier is not using the previous weights. How can I solve this?
It's not possible out-of-the-box. sklearn supports incremental/online training in some estimators, but not AdaBoostClassifier.
The estimators that support incremental training are listed here and have a special method named partial_fit().
see gaenari
it is c++ incremental decision tree.
support:
insert(csv), update(), and rebuild().
report()

Determine most important feature per class

Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?
When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example
Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.

Resources