What is the difference between CountVectorizer+TfidfTransformer and TfidfVectorizer - machine-learning

To apply ML algorithm on text, it has to be represented numerically. Some ways to do this using sklearn are:
CountVectorizer
CountVectorizer + TfidfTransformer
TfidfVectorizer
What is the difference between CountVectorizer+TfidfTransformer and TfidfVectorizer?

None, see the top of the documentation page:
sklearn.feature_extraction.text.TfidfVectorizer
...
Equivalent to CountVectorizer followed by TfidfTransformer.

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:
If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.
Reference: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YHybLOhKhPY

The following code demonstrates what the documentation, per mbatchkarov, means by "follow up": you multiply the outputs of the two functions and then normalize.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import (
CountVectorizer, TfidfTransformer, TfidfVectorizer
)
corpus = ['apple banana orange onion corn',
'banana banana orange pineapple coffee',
'orange lemon lime orange',
'lime vodka gin orange apple apple',
'potato potato tomato pineapple',
'coffee']
tf = CountVectorizer()
idf = TfidfTransformer()
tf_ft = tf.fit_transform(corpus)
idf.fit(tf_ft)
vocab = [ti[0] for ti in sorted(list(tf.vocabulary_.items()),
key=lambda x: x[1])]
tf = pd.DataFrame(tf_ft.toarray(), columns=vocab)
idf = pd.Series(idf.idf_, index=vocab)
tfidf_manual = tf * idf
tfidf_manual /= np.sqrt(np.sum(np.square(tfidf_manual.values),
axis=1,
keepdims=True))
tfidf_function = pd.DataFrame(TfidfVectorizer()
.fit_transform(corpus)
.toarray(),
columns=vocab)
assert np.allclose(tfidf_manual, tfidf_function)
tfidf_manual

Related

SpaCy-transformers regression output

I would like to have a regression output instead of the classification. For instance: instead of n classes I want a floating point output value from 0 to 1.
Here is the minimalistic example from the package github page:
import spacy
from spacy.util import minibatch
import random
import torch
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
torch.set_default_tensor_type("torch.cuda.FloatTensor")
nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
optimizer = nlp.resume_training()
for i in range(10):
random.shuffle(TRAIN_DATA)
losses = {}
for batch in minibatch(TRAIN_DATA, size=8):
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
print(i, losses)
nlp.to_disk("/bert-textcat")
Is there an easy way to make trf_textcat work as a regressor? Or would it mean extending the library?
I have figured out a workaround: extract vector representations from the nlp pipeline as:
vector_repres = nlp('Test text').vector
After doing so for all the text entries, You end up with a fixed-dimensional representation of the texts. Assuming You have the continuous output values, feel free to use any estimator, including Neural Network with a linear output.
Note that the vector representation is an average of the vector embeddings of all the words in the text - it might be a sub-optimal solution for Your case.

How do I indicate that certain features (words) and certain documents are more important than others in Naive Bayes?

I am developing a binary document classifier using sklearn. I want to indicate that certain features (words) are more (or less) important and certain documents are more (or less) important for learning.
If you are using CountVectorizer you can get the number of occurrences (frequency) of each word in the text corpus:
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
You can also get importance of each feature (word) based on univariate statistical test, using SelectKBest:
from sklearn.feature_selection import SelectKBest, chi2
...
skb = SelectKBest(chi2, k="all").fit(X, y)
feature_importnace = skb.scores_
By training a RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
model = clf.fit(X, y)
# Calculate feature importances
importances = model.feature_importances_

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.

Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?

I'm trying to figure out how to implement Principal Coordinate Analysis with various distance metrics. I stumbled across both skbio and sklearn with implementations. I don't understand why sklearn's implementation is different everytime while skbio is the same? Is there a degree of randomness to Multidimensional Scaling and in particular Principal Coordinate Analysis? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?
Running Principal Coordinate Analysis using Scikit-bio (i.e. Skbio) always gives the same results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)
Now with sklearn's Multidimensional Scaling:
from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
A = M.fit(Ar_dist).embedding_
ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])
scikit-bio's PCoA (skbio.stats.ordination.pcoa) and scikit-learn's MDS (sklearn.manifold.MDS) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].
scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state is used. random_state appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state may produce slightly different embeddings with arbitrary rotation [4].
References: [1], [2], [3], [4]
MDS is a probabalistic algorithm, there is a parameter random_state that you can use to fix the random seed, you can pass this if you want to get the same results each time. PCA on the other hand is a deterministic algorithm, if you use sklearn.decomposition.PCA, you should get the same results each time.

Bagging using random forest classifier in sklearn

I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_

Resources