I need good model serialization. Default R serialization is nighter safe nor effective from the point of model size - mlr3

MLR3 model includes a lot of redundant data not needed when applying the model. The traditional R approach is to save all the data used for model training. It leads to the growth of used memory. What leads to the growth of used memory. In the traditional R model, it usually can be fixed easily by just assigning NULL to redundant fields. But it is not so clear for mlr3.

You can directly access the underlying model in mlr3 using the $model slot, see e.g. the basics chapter in the mlr3 book. This is where the trained model is put and what's used to make the predictions, so you can modify this in exactly the same way as you would modify the model directly.
Of course, some of this may break other mlr3 functionality, e.g. information on feature importance that is used by some other functions. But in principle, you can perform exactly the same model customization that you can do for the "raw" model.

Related

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

When is entity replacement necessary for relation extraction?

In this tutorial for "Training a Machine Learning Classifier for Relation Extraction from Medical Literature" the author does Entity replacement because that "we don’t want the model to learn according to a specific entity name, but we want it to learn according to the structure of the text".
Is this generally true or does it depend on the dataset or the used models?
Entity replacement, much like other text transformation techniques including stemming and lemmatization, is usually a part of the relationship extraction process because it increases the number of observation per feature. That increase in ratio might help your problem, depending on the size of the dataset, quality of features, type of feature extraction, and complexity of the model.
A good rule of thumb is to define your objective and subsequently your acceptable representation based on your understanding of the dataset. For instance, the given tutorial sets out to understand the relationship between miRNA and genes. The author is okay with grouping miRNA-335, miRNA-342, miRNA-100 and others, under the same entity name.
In scenarios, where you don't have domain understanding of the corpora, you can start out without entity replacement, inspect the result and understand the model's bias-variance tradeoff. Then, if required, try entity replacement after experimenting with some clustering techniques.

How to update Logistic Regression Model?

I have trained a logistic regression model. Now I have to update(partial fit) the model with new set of training data. Is it possible ?
You cannot use partial_fit on LogisticRegression.
But you can:
use warm_start=True, which reuse the solution of the previous call to fit as initialization, to speed up convergence.
use SGDClassifier with loss='log' which is equivalent to LogisticRegression, and which supports partial_fit.
Note the difference between partial_fit and warm_start. Both methods starts from the previous model and update it, but partial_fit updates only slightly the model, while warm_start goes all the way to convergence on the new training data, forgetting the previous model. warm_start is only used to speed up convergence.
See also the glossary:
warm_start
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
__
partial_fit
Facilitates fitting an estimator in an online fashion. Unlike fit, repeatedly calling partial_fit does not clear the model, but updates it with respect to the data provided. The portion of data provided to partial_fit may be called a mini-batch. Each mini-batch must be of consistent shape, etc.
partial_fit may also be used for out-of-core learning, although usually limited to the case where learning can be performed online, i.e. the model is usable after each partial_fit and there is no separate processing needed to finalize the model. cluster.Birch introduces the convention that calling partial_fit(X) will produce a model that is not finalized, but the model can be finalized by calling partial_fit() i.e. without passing a further mini-batch.
Generally, estimator parameters should not be modified between calls to partial_fit, although partial_fit should validate them as well as the new mini-batch of data. In contrast, warm_start is used to repeatedly fit the same estimator with the same data but varying parameters.

What is the use of train_on_batch() in keras?

How train_on_batch() is different from fit()? What are the cases when we should use train_on_batch()?
For this question, it's a simple answer from the primary author:
With fit_generator, you can use a generator for the validation data as
well. In general, I would recommend using fit_generator, but using
train_on_batch works fine too. These methods only exist for the sake of
convenience in different use cases, there is no "correct" method.
train_on_batch allows you to expressly update weights based on a collection of samples you provide, without regard to any fixed batch size. You would use this in cases when that is what you want: to train on an explicit collection of samples. You could use that approach to maintain your own iteration over multiple batches of a traditional training set but allowing fit or fit_generator to iterate batches for you is likely simpler.
One case when it might be nice to use train_on_batch is for updating a pre-trained model on a single new batch of samples. Suppose you've already trained and deployed a model, and sometime later you've received a new set of training samples previously never used. You could use train_on_batch to directly update the existing model only on those samples. Other methods can do this too, but it is rather explicit to use train_on_batch for this case.
Apart from special cases like this (either where you have some pedagogical reason to maintain your own cursor across different training batches, or else for some type of semi-online training update on a special batch), it is probably better to just always use fit (for data that fits in memory) or fit_generator (for streaming batches of data as a generator).
train_on_batch() gives you greater control of the state of the LSTM, for example, when using a stateful LSTM and controlling calls to model.reset_states() is needed. You may have multi-series data and need to reset the state after each series, which you can do with train_on_batch(), but if you used .fit() then the network would be trained on all the series of data without resetting the state. There's no right or wrong, it depends on what data you're using, and how you want the network to behave.
Train_on_batch will also see a performance increase over fit and fit generator if youre using large datasets and don't have easily serializable data (like high rank numpy arrays), to write to tfrecords.
In this case you can save the arrays as numpy files and load up smaller subsets of them (traina.npy, trainb.npy etc) in memory, when the whole set won't fit in memory. You can then use tf.data.Dataset.from_tensor_slices and then using train_on_batch with your subdataset, then loading up another dataset and calling train on batch again, etc, now you've trained on your entire set and can control exactly how much and what of your dataset trains your model. You can then define your own epochs, batch sizes, etc with simple loops and functions to grab from your dataset.
Indeed #nbro answer helps, just to add few more scenarios, lets say you are training some seq to seq model or a large network with one or more encoders. We can create custom training loops using train_on_batch and use a part of our data to validate on the encoder directly without using callbacks. Writing callbacks for a complex validation process could be difficult. There are several cases where we wish to train on batch.
Regards,
Karthick
From Keras - Model training APIs:
fit: Trains the model for a fixed number of epochs (iterations on a dataset).
train_on_batch: Runs a single gradient update on a single batch of data.
We can use it in GAN when we update the discriminator and generator using a batch of our training data set at a time. I saw Jason Brownlee used train_on_batch in on his tutorials (How to Develop a 1D Generative Adversarial Network From Scratch in Keras)
Tip for quick search: Type Control+F and type in the search box the term that you want to search (train_on_batch, for example).

Penalties on variables in scikit-learn GradientBoostingClassifier?

Is there a way to penalize a feature in so that it doesn't dominate the model? (In Salford Predictive Modeller, there is a setting called "Penalties on Variables")
The situation is that I have one categorical feature which I want to include in the model, but I don't want to have as the most important feature, since then the model doesn't properly capture the variance explained by the other predictors.
I think you cannot do that. Although I don't really understand why you would want to do that, you could try the following:
Train a model on the whole dataset, train a separate model on the dataset after removing this feature. Then, combine the results of the two models (maybe simple averaging or stacking etc.)

Resources