Is It Possible to Find an Optimal Cut Point that Maximizes C-index - mlr3

If I am using a survival model that is not tree-base and I want to dichotomize influential continuous variables (age, weight) to simplify my final model for clinical use, is that possible to do with mlr3 framework?
If so, I would appreciate an example

Related

Evaluation of generative models like variational autoencoder

i hope everyone is doing well
I need some help with generative models.
So im working on a project where the main task is to build a binary classification model. In the dataset which contains 300000 sample and 100 feature, there is an imbalance between the 2 classes where majority class is too much bigger than the minory class.
To handle this problem, i'm using VAE (variational autoencoders) to solve this problem.
So i started training the VAE on the minority class and then use the decoder part of the VAE to generate new or fake samples that are similars to the minority class then concatenate this new data with training set in order to have a new balanced training set.
My question is : is there anyway to evalutate generative models like vae, like is there a way to know if the data generated is similar to the real one ??
I have read that there is some metrics to evaluate generated data like inception distance and Frechet inception distance but i saw that they have been only used on image data
I wanna know if i can use them too on my dataset ?
Thanks in advance
I believe your data is not image as you say there are 100 features. What I believe that you can check the similarity between the synthesised features and the original features (the ones belong to minority class), and keep only the ones with certain similarity. Cosine similarity index would be useful for this problem.
That would be also very nice to check a scatter plot of the synthesised features with the original ones to see if they are close to each other. tSNE would be useful at this point.

VotingRegressor vs. StackingRegressor

Here's one. In what situation would you use one vs. the other. Let me run a hypothetical.
Let's say I'm training a few different Regressors, and I get the final score from each regressor's training run. If I wanted to use the VotingRegressor to ensemble each model, I could use those scores as potential weight parameters to get a weighted average of each model's prediction right?
So what's the benefit of doing that, vs. using the StackingRegressor to get the final prediction? As I understand it, a final model is used to make its predictions based on each individual model's prediction, so in effect, wouldn't that final StackingRegressor model learn that some predictions are better than others? Almost like it's doing a sort of weight voting of its own?
Short of running both examples and seeing the differences in predictions, wondering if anyone else has experience with both of these and could provide some insight as to which might be a better way to go? I don't see a question like this on SO yet. Thanks!

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

Find Important predictors in a model

I want to analyze and solve a few questions from the very famous project called red wine quality analysis which is freely available in the following link:
https://www.kaggle.com/piyushgoyal443/red-wine-analysis/data
The problem is to Find the 2 most important predictors of red wine quality.
I have proceeded with using ols_all_step_possible() function in olsrr package in R. In the result part it gives a dataframe of each and every combination of the predictors with the model and its rsquare, Adj.rsquare, AIC, fpe,.......
I have found out that alcohol and volatile acidity are the two best predictors based on high Adj.rsquare and low (AIC, fpe) from the result:
Results
Result file image
My question is whether looking at rsquare and AIC of the model is enough to say that those variables(included in the model and there p-values are significant) are important predictors ? Or, we have to divide it into train test sets and see the test MAPE and then decide it's important predictor or not ?
I believe you are asking about the methods to find out the best predictors. You can make use of various methods to find out the best predictors. For finding the predictors, you should use feature selection. You can follow the link:
https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
One more thing R-square, Adj R - square are the parameters which define the quality of model not the indibidual predictors. Yes you can see that on the basis of P-value.
Same goes for AIC. These are most useful for selection between two models.
Model with higher r-square is better than with the lesser one.

How to discover new classes in a classification machine learning algorithm?

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

Resources