adding classif.imbalanced.rfsrc in mlr3 - mlr3

First of all, many thanks to the guys #mlr3!
The package randomForestSRC in R has a new function called imbalanced.rfsrc to help deal with class imbalance in classification. Will this learner be accessible in mlr3? imbalanced.rfsrc seems to work very well and also seems to implement state of the art approaches to dealing with class imbalance.
Thank you

If you open a learner request issue in mlr3extralearners and fill in the details then we'd be happy to consider adding this implementation!
https://github.com/mlr-org/mlr3extralearners/issues/new?assignees=&labels=new+learner&template=learner-request-template.md&title=%5BLRNRQ%5D+Add+%3Calgorithm%3E+from+package+%3Cpackage%3E

Related

CV or train/predict in mlr3

In a post "The "Cross-Validation - Train/Predict" misunderstanding" by Patrick Schratz
https://mlr-org.com/docs/cv-vs-predict/
mentioned that:
(a) CV is done to get an estimate of a model’s performance.
(b) Train/predict is done to create the final predictions (which your boss might use to make some decisions on).
It means in mlr3, if we are in academia, need to publish papers, we need to use the CV as we intend to compare the performance of different algorithms. And in industry, if our plan is to train a model and then have to use again and again on industry data to make predictions, we need to use the train/predict methods provided by mlr3 ?
Is it something which I completely picked wrong?
Thank you
You always need a CV if you want to make a statement about a model's performance.
If you want to use the model to make predictions to unknown data, do a single fit and then predict.
So in practice, you need both: CV + "train+predict".
PS: Your post does not really fit to Stackoverflow since it is not related to a coding problem. For statistical questions please see https://stats.stackexchange.com/.
PS2: If you talk about a post, please include the link. I am the author of the post in this case but most other people might not know what you are talking about ;)

Which data visualization techniques to use to analyse data while solving a classification problem?

I am solving a classification problem and I cannot find a good visualization method to analyse my data. Usually while dealing with prediction problems I use barplot, distplot, scatterplot, linegraph,etc. I want to know some common data visualization techniques for classification problems.
Hi guys I figured that countplot is the the equivalent of histogram https://seaborn.pydata.org/generated/seaborn.countplot.html
Example of countplot
Example of catplot
Update : catplot is actually the combination of FacetGrid and countplot.
So if you want to do something simple then countplot will do the work for you but if you want grids then use catplot.

Feature Selection Process For Regression

I am trying to solve a regression problem (determine next month expected revenue).I came to know about different feature selection technique like
Filter Method
Wrapper Method
Embedded Method
Q1: Now the problem is, i think those methods are for classification type problem. So how can we use feature selection for regression problem?
Q2: I came to know about "Regularization". Is it the only way to use feature selection for regression problem?
I don't know these filter selections you mentioned, but you can use:
Scikit-learn.selection_feature.RFE (Recursive Feature Elimination)
or
Scikit-learn.selection_feature.PCA (Principal Component Analisys)
I'm pretty sure you can use them for classification or regression.
Here's an example of use of RFE and LinearRegression: https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

How can re-train my logistic model using pymc3?

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.

Mallet Trained Model Load

Has anyone had any luck with loading a previously trained Model? Looking through its API, the CRFWriter class is 1/2 of the puzzle, but how exactly do you CRFRead(class doesn't exist)
Thanks for the help.
Depending on the trainer that you used, you should be able to cast the object to a CRF or ACRF. I just posted a question that might help you too: How do I load and use a CRF trained with Mallet?

Resources