Save scikit-learn model without datasets

Save scikit-learn model without datasets - machine-learning

I've trained a RandomForestClassifier model with the sklearn library and saved it with joblib. Now, I have a joblib file of nearly 1GB which I'm deploying on a Nginx/Flask/Guincorn stack. The issue is I have to find an efficient way to load this model from file and serve API requests. Is it possible to save the model without the datasets when doing:
joblib.dump(model, '/kaggle/working/mymodel.joblib')
print("random classifier saved")

The persistent representation of Scikit-Learn estimators DOES NOT include any training data.
Speaking about decision trees and their ensembles (such as random forests), then the size of the estimator object scales quadratically to the depth of decision trees (ie. the max_depth parameter). This is so, because decision tree configuration is represented using (max_depth, max_depth) matrices (float64 data type).
You can make your random forest objects smaller by limiting the max_depth parameter. If you're worried about potential loss of predictive performance, you may increase the number of child estimators.
Longer term, you may wish to explore alternative representations for Scikit-Learn models. For example, converting them to PMML data format using the SkLearn2PMML package.

Related

Is it possible to use JSON format input for BERT model?

I am trying to create one knowledge base (single source of truth) gathered from multiple web sources. (ex. wiki <-> fandom)
So I want to try a Siamese network or calculate cosine similarity with BERT embedded documents.
Then, can I ignore those json structures and train them anyway?

Although BERT wasn't specifically trained to find similarity between JSON data, you could always extract and concatenate the values of your JSON into a long sentence and leave it to BERT to capture the context as you expect.
Alternatively, you could generate a cosine similarity score for each key-value dependency between the JSONs and aggregate them to generate a net similarity score for the JSON data pair.
Also, see Sentence-BERT (SBERT), a modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

Extract trees and weights from trained xgboost model

I have already trained an xgboost model with about X trees. I want to create some replicas of the model with exact same hyper parameters, but prune the number of trees. for example i want to create a model with same weight and parameters with just half the number of trees . Is it possible to do this using xgboost api .
I tried a naive quick approach of de-serializing a trained xgboost model and resetting the booster_params['num_boost_round'] to half of what it was. But this didnt seem to impact any of the model quality and prediction scores, implying this parameter is not used when scoring/evaluation.
The only options left is to dump a text or pmml file , parse it back with a subset of trees. Wondering if it is possible to do it with xgboost api itself (like changing a parameter that would bring the same effect) without converting to a separate representation/format and parsing myself.

How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real.
I have created two sets of features:
A) Bigram Term Frequency-Inverse Document Frequency
B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...
Which is the best way to combine the TFIDF features with the other features for a single prediction?
Thanks a lot to everyone.

Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.
Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).
That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.

What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?

I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Differences between Training Data and Vocabulary - Bag Of Words

When creating a Bag Of Words, you need to create a Vocabulary to give to the BOWImgDescriptorExtractor to which you use on the images you wish to input. This creates the Testing Data.
So where does the Training Data come from, and where do you use it?
Whats the difference between Vocabulary and Training Data?
Isn't the Vocabulary the same thing as the Training Data?

Training data is a set of images you collected for your application as the input of BOWTrainer, and vocabulary is the output of the BOWTrainer. Once you have the vocabulary, you can extract features of images using BOWImgDescriptorExtractor with the words defined in the vocabulary.
An image can be described by tons of features (words), however only some of them are important. The first job to do is to find those important words, that is, to train a vocabulary. After the vocabulary is obtained, images can be described more precisely.
So where does the Training Data come from, and where do you use it?
You should provide the Training data, and use it to train the vocabulary with BOWTrainer. The Training data is a set of images (descriptors), depends on your application domain.
What's the difference between Vocabulary and Training Data?
Vocabulary is cooked, while training data is raw, unorganized.
Isn't the Vocabulary the same thing as the Training Data?
No.

There is an add function that is used to specify training data. docs on opencv bow module

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Save scikit-learn model without datasets - machine-learning

Related

Is it possible to use JSON format input for BERT model?

Extract trees and weights from trained xgboost model

How to combine TFIDF features with other features

How to correctly combine my classifiers?

Differences between Training Data and Vocabulary - Bag Of Words

Categories

Resources