ScikitLearn Extracting Feature Names from FeatureUnion inside a pipeline - machine-learning

I'm using SKlearn's Pipeline model to extract and construct a united feature which is then being sent to a random forest classifier, while some feature extractors can be removed or added later, consider the following structure:
model = Pipeline([
('feature_extract',
FeatureUnion([
('feature A', extracorA()),
('feature B', ExtractorB()),
('feature C', FeatureUnion([
('c1', C1Extractor())
('c2', C2Extractor())]))
)]),
('random_forest', RandomForestRegressor(...)))])
I would like to improve the predictions of the random forest by inspecting the
feature_importances_
property of the RandomForstRegressor
I managed to get the list using:
model._final_estimator.feature_importances_
And now I would like to dynamically link between the column number in the feature_importances_ index to the feature name/step in the pipeline.
Is there a preferred way to save/retrieve the feature name inside a feature union? How would you address this issue?

To keep everything in dynamic form, you can use the below function as the transform implementation of a separate class and make the class's object a part of your pipeline. You can even change scoring parameter. I think Grid Search as a part of the pipeline is what you are looking for...
def best_config(model, parameters, train_instances, judgements):
clf = GridSearchCV(model, parameters, cv=5,
scoring="accuracy", verbose=5, n_jobs=4)
clf.fit(train_instances, judgements)
best_estimator = clf.best_estimator_
return [str(clf.best_params_), clf.best_score_,
best_estimator]

Related

Categorical features encoding in H2O

I train GBM models with H2O and want to use them in my backend (not Java). To do so, I download the MOJOs, convert it to ONNX and run it in my apps.
In order to make inference, I need to know how categorical columns transformed to their one-hot encoded versions. I was able to find it in the POJO:
static final void fill(String[] sa) {
sa[0] = "Age";
sa[1] = "Fare";
sa[2] = "Pclass.1";
sa[3] = "Pclass.2";
sa[4] = "Pclass.3";
sa[5] = "Pclass.missing(NA)";
sa[6] = "Sex.female";
sa[7] = "Sex.male";
sa[8] = "Sex.missing(NA)";
}
So, here is the workflow for non-Java backend as I see it:
Encode categorical features with OneHotExplicit.
Train GBM model.
Download MOJO and convert to ONNX.
Download POJO and find feature alignment in the source code.
Implement the inference in your backend.
Is it the most straightforward and correct way?
Thank you for your question.
Can you access the stored categorical values here?
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoModel.java#L72
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoReader.java#L34
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/tree/SharedTreeMojoWriter.java#L61
The index in the array means the translated categorical value.
The EasyPredictModelWrapper did it this way:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/RowToRawDataConverter.java#L44
Can you access the model.ini inside of the zip? There is [domains] tag and under the tag is a list of files in domains/ directory which correspond the categorical encoding for each feature.
e.g:
[columns]
AGE
RACE
DPROS
DCAPS
PSA
VOL
GLEASON
CAPSULE
[domains]
7: 2 d000.txt
means 7th column (CAPSULE) has 2 categorical variables in d000.txt
or there is a experimental/modelDetails.json file that has categorical values under output.domains. The index in the list correspond to the feature in the output.names list.
e.g output.domains[7] are domains for output.names[7] feature.

How to create a language model with 2 different heads in huggingface?

I know I can create a language model with 1 head:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained("distilbert-base-cased").to(device)
But how can I create the same base model structure (e.g., distilbert-base-cased) with 2 heads? Say, one is AutoModelForMultipleChoice and the second is AutoModelForSequenceClassification. I need the only difference between the 2 models (1 head vs 2 heads) to be the additional head (from parameters perspective).
So now my input for the 2 heads model is something like [sequence_label, multiple_choice_labels]
In general case you will need to create a custom class derived from the DistilBertPreTrainedModel. Inside __init__() you will need to define your desired heads architectures. Then you will need to create your own forward() function and define inside it a custom loss involving both heads, and return result.
But if you are talking specifically about DistilBertForMultipleChoice and DistilBertForSequenceClassification, there is a shortcut, as the heads architecture happen to be identical (see source) and the difference is only in loss function. So you can try to train your model as multi label sequence classification problem, where the label per sequence will be [sequence_label, multiple_choice_label_0, multiple_choice_label_1, ...] . For example, in case you have an entry like {sequence, choice0, choice1, seq_label:True, correct_choice:0}
your dataset will be
[ {'text':(sequence, choice0), 'label':(1 1 0)},
{'text':(sequence, choice1), 'label':(1 0 0)} ]
This way the result of the sequence classification will be in the first position and to get the correct choice probability you will need to apply softmax function on the rest of the logits.

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

Azure Machine Learning Studio Conditional Training Data

I have built an Microsoft Azure ML Studio workspace predictive web service, and have a scernario where I need to be able to run the service with different training datasets.
I know I can setup multiple web services via Azure ML, each with a different training set attached, but I am trying to find a way to do it all within the same workspace and passing a Web Input Parameter as the input value to choose which training set to use.
I have found this article, which describes almost my scenario. However, this article relies on the training dataset that is being pulled from the Load Trained Data module, as having a static endpoint (or blob storage location). I don't see any way to dynamically (or conditionally) change this location based on a Web Input Parameter.
Basically, does Azure ML support a "conditional training data" loading?
Or, might there be a way to combine training datasets, then filter based on the passed Web Input Parameter?
This probably isn't exactly what you need, but hopefully, it helps you out.
To combine data sets, you can use the Join Data module.
To filter, that may be accomplished by executing a Python script. Here's an example.
Using the Adult Census Income Binary Classification dataset, on the age column, there's a minimum age of 17.
If I wanted to filter the data set by age, connect it to an Execute Python Script module and here's the filtering code with the pandas query method.
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
# Return value must be of a sequence of pandas.DataFrame
return dataframe1.query("age >= 25")
And looking at that output it filters out the data set where the minimum age is now 25.
Sure, you can do that. What you would want is to use an Execute R Script or SQL Transformation module to determine, based on your input data, what model to use. Something like this:
Notice, your input data is cleaned/updated/feature engineered, then it's passed to two different SQL transforms which will tell it to go to one of two paths.
Each path has it's own training data.
Note: I am not exactly sure what your use case is, but if it were me, I would instead train two different models using the two different training data, then try to just use the models in my web service, not actually train on the web service as that would likely be quite slow.

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.
I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.
I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:
p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
But I get the following error:
model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given
My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.
What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?
Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.
The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.
class MyMLB(MultiLabelBinarizer):
def fit_transform(self, X, y=None):
return super(MultiLabelBinarizer, self).fit_transform(X)
Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.
I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!
Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.
X = <training_data>
y = <training_labels>
# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)
p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)
# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
pickle.dump((mlb, p), f)
Then, after shipping the .pkl files to the remote box...
# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
mlb, p = pickle.load(f)
X = <input data> # Get your input data from somewhere.
# Predict the classes using the pipeline
mlb_predictions = p.predict(X)
# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)
# Do something with predicted classes.
<...>
Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.
Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

Resources