Why the following partial fit is not working property? - machine-learning

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
Hello I have the following list of comments:
comments = ['I am very agry','this is not interesting','I am very happy']
These are the corresponding labels:
sents = ['angry','indiferent','happy']
I am using tfidf to vectorize these comments as follows:
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing
I am using label encoder to vectorize the labels:
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
Here I am using passive aggressive to fit the model:
clf2 = PassiveAggressiveClassifier()
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
Here I am trying to test the usage of partial fit as follows with three new comments and their corresponding labels:
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)
The problem is that I am not getting the right results after the partial fit as follows:
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
however I am getting this output:
[2 2 2]
So I really appreciate support to find, why the model is not being updated if I am testing it with the same examples that it has used to be trained the desired output should be:
[1,0,2]
I would like to appreciate support to ajust maybe the hyperparameters to see the desired output.
this is the complete code, to show the partial fit:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
clf2 = PassiveAggressiveClassifier()
clf2.fit(tfidf, labels)
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
clf2.partial_fit(vec_new_comments, new_labels)
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
However I got:
AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

Well there are multiple problems with your code. I will start by stating the obvious ones to more complex ones:
You are pickling the clf2 before it has learnt anything. (ie. you pickle it as soon as it is defined, it doesnt serve any purpose). If you are only testing, then fine. Otherwise they should be pickled after the fit() or equivalent calls.
You are calling clf2.fit() before the clf2.partial_fit(). This defeats the whole purpose of partial_fit(). When you call fit(), you essentially fix the classes (labels) that the model will learn about. In your case it is acceptable, because on your subsequent call to partial_fit() you are giving the same labels. But still it is not a good practice.
See this for more details
In a partial_fit() scenario, dont call the fit() ever. Always call the partial_fit() with your starting data and new coming data. But make sure that you supply all the labels you want the model to learn in the first call to parital_fit() in a parameter classes.
Now the last part, about your tfidf_vectorizer. You call fit_transform()(which is essentially fit() and then transformed() combined) on tfidf_vectorizer with comments array. That means that it on subsequent calls to transform() (as you did in transform(new_comments)), it will not learn new words from new_comments, but only use the words which it saw during the call to fit()(words present in comments).
Same goes for LabelEncoder and sents.
This again is not prefereble in a online learning scenario. You should fit all the available data at once. But since you are trying to use the partial_fit(), we assume that you have very large dataset which may not fit in memory at once. So you would want to apply some sort of partial_fit to TfidfVectorizer as well. But TfidfVectorizer doesnt support partial_fit(). In fact its not made for large data. So you need to change your approach. See the following questions for more details:-
Updating the feature names into scikit TFIdfVectorizer
How can i reduce memory usage of Scikit-Learn Vectorizers?
All things aside, if you change just the tfidf part of fitting the whole data (comments and new_comments at once), you will get your desired results.
See the below code changes (I may have organized it a bit and renamed vec_new_comments to new_tfidf, please go through it with attention):
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()
# The below lines are important
# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)
# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)
Below is the Not so preferred code (which you are using, and about which I talked in point 2), but results are good as long as you make the above changes.
tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)
clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]
new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2] As you wanted
Correct approach, or the way partial_fit() is intended to be used:
# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only
all_classes = le.transform(le.classes_)
# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]
# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

Related

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

Why I get different expected_value when I include the training data in TreeExplainer?

Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.
Reproducible example (run in Google Colab):
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635
So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).
Both approaches are additive and consistent when used with the (obviously different) respective shap_values:
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.
What am I missing here?
Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.
The data is passed to an Independent masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216
Running
from shap import maskers
another_gbt_explainer = shap.TreeExplainer(
gbt,
data=maskers.Independent(X_train, max_samples=800),
feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value
gets back to
-11.534353657511172
Though #Ben did a great job in digging out how the data gets passed through Independent masker, his answer does not show exactly (1) how base values are calculated and where do we get the different base value from and (2) how to choose/lower the max_samples param
Where the different value comes from
The masker object has a data attribute that holds data after masking process. To get the value showed in gbt_explainer.expected_value:
from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963
one would need to do:
masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963
What about choosing max_samples?
Setting max_samples to the original dataset length seem to work with other explainers too:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit
corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)
vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)
model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)
explainer = shap.Explainer(model
,masker = Independent(X_train,100)
,feature_names=vectorizer.get_feature_names()
)
explainer.expected_value
# -0.18417413671991964
This value comes from:
masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])
max_samples=100 seem to be a bit off for a true base_value (just feeding feature means):
logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])
By increasing max_samples one might get reasonably close to true baseline, while keeping num of samples low:
masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238
So, to get base value for an explainer of interest (1) pass explainer.data (or masker.data) through your model and (2) choose max_samples so that base_value on sampled data is close enough to the true base value. You may also try to observe if the values and order of shap importances converge.
Some people may notice that to get to the base values sometimes we average feature inputs (LogisticRegression) and sometimes outputs (GBT)

Integrate the ImageDataGenerator in own customized fit_generator

I want to fit a Siamese CNN with multiple inputs that are stored in my memory and no label (just an arbitrary dummy label). Therefore, I had to write my own data_generator function for using a CNN model in Keras.
My data generator is of the following form
class DataGenerator(keras.utils.Sequence):
def __init__(self, train_data, train_triplets, batch_size=32, dim=(128,128), n_channels=3, shuffle=True):
self.dim = dim
self.batch_size = batch_size
#Added
self.train_data = train_data
self.train_triplets = train_triplets
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
n_row = self.train_triplets.shape[0]
return int(np.floor(n_row / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
#print(index)
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = self.train_triplets.iloc[indexes,]
# Generate data
[anchor, positive, negative] = self.__data_generation(list_IDs_temp)
y_train = np.random.randint(2, size=(1,2,self.batch_size)).T
return [anchor,positive, negative], y_train
def on_epoch_end(self):
'Updates indexes after each epoch'
n_row = self.train_triplets.shape[0]
self.indexes = np.arange(n_row)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples'
# anchor positive and negatives: (n_samples, *dim, n_channels)
# Initialization
anchor = np.zeros((self.batch_size,*self.dim,self.n_channels))
positive = np.zeros((self.batch_size,*self.dim,self.n_channels))
negative = np.zeros((self.batch_size,*self.dim,self.n_channels))
nrow_temp = list_IDs_temp.shape[0]
# Generate data
for i in range(nrow_temp):
list_ind = list_IDs_temp.iloc[i,]
anchor[i] = self.train_data[list_ind[0]]
positive[i] = self.train_data[list_ind[1]]
negative[i] = self.train_data[list_ind[2]]
return [anchor, positive, negative]
where train_data is a list of all images and train triplets a data frame containing image indices to create my inputs containing of a triplet of images.
Now, I want to do some data augmenting for each mini batch supplied to my CNN. I have tried to integrate the ImageDataGenarator of Keras but I couldn't implement it in my code. Is it somehow possible to do it ? I am not very experienced with python and would appreciate any help.
Does this article answer your question?
To put it in a nutshell, Kera's ImageDataGenerator lacks flexibility when it comes to personalized batch generators, and the easiest way to still use data augmentation is simply to switch to another data augmentation tool (like the albumentations library described in the previous article, but you could also use imgaug as well).
I just want to warn you that I encountered several issues with albumentations (that I described in this question on GitHub, but for now I still have had no answers), so maybe using imgaug is a better idea.
Hope this helps, good luck with your model !

How to do a single value prediction in NLP

My dataset was restaurants review with two columns review and liked.
Based on the review it shows if they liked the restaurant or not
I cleaned up the data in NLP as the first step.Then as second step used bag of words model as below.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
This above gave X as 1500 columns with 0 and 1 with 1000 rows according to my dataset.
I predicted as below
y_pred = classifier.predict(X_test)
So now I have review as "Food was good",how do I predict if they like it or not.A single value to predict.
Please can you help me out.Please let me know if additional information is required.
Thanks
All you need is to apply cv.transform first just like so:
>>> test = ['Food was good']
>>> test_vec = cv.transform(test)
>>> classifier.predict(test_vec)
# returns predicted class
For training and testing here is simple example:
Training:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = ["This is good place","Hyatt is awesome hotel"]
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())
# Now apply any classification u want to on top of this data-set
Now Testing:
Note: use the same transformation as done in training:
new = ["I like the ambiance of this hotel "]
pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new)).todense(),
columns = count_vect.get_feature_names())
Apply model.predict on top of this now.
you can also use sklearn pipeline.
from sklearn.pipeline import Pipeline
model_pipeline = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('model', classifier())]) #call the Model which you want to use
model_pipeline.fit_transform(x,y) # here x is your text data, and y is going to be your target
model_pipeline.predict(['Food was good"']) # predict your new sentence

'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

I am basically clustering some of my documents using mini_batch_kmeans and kmeans algorithm. I simply followed the tutorial is the scikit-learn website the link for that is given below:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
They are using some of the method for the vectorizing one of which is HashingVectorizer. In the hashingVectorizer they are making a pipeline with TfidfTransformer() method.
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
Once doing so, the vectorizer what I get from that does not have the method get_feature_names(). But since I am using it for clustering, I need to get the "terms" using this "get_feature_names()"
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
How do I solve this error?
My whole code is show below:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
The count vectorizor piped with tfidf.
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
and the mini_batch_kmeans class is as below:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
Pipeline doesn't have get_feature_names() method, as it is not straightforward to implement this method for Pipeline - one needs to consider all pipeline steps to get feature names. See https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425, etc. - there is a lot of related tickets and several attempts to fix it.
If your pipeline is simple (TfidfVectorizer followed by MiniBatchKMeans) then you can get feature names from TfidfVectorizer.
If you want to use HashingVectorizer, it is more complicated, as HashingVectorizer doesn't provide feature names by design. HashingVectorizer doesn't store vocabulary, and uses hashes instead - it means it can be applied in online setting, and that it dosn't require any RAM - but the tradeoff is exactly that you don't get feature names.
It is still possible to get feature names from HashingVectorizer though; to do this you need to apply it for a sample of documents, store which hashes correspond to which words, and this way learn what these hashes mean, i.e. what are the feature names. There may be collisions, so it is not possible to be 100% sure the feature name is correct, but usually this approach works ok. This approach is implemented in eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()
Then you can use hashing_feat_names as your feature names, as TfidfTransformer doesn't change input vector size and just scales the same features.
From the make_pipeline documentation:
This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.
so, in order to access the feature names, after you have fitted to data, you can:
# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)
tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...
# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()
# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik
Hope this helps, good luck!
Edit: After seeing your updated question with the example you follow, #Vivek Kumar is correct, this code terms = vectorizer.get_feature_names() will not run for the pipeline but only when:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)

Resources