Related
I did embeddings with fasttext and I have clusters thanks to KMeans.
I would like to calculate similarities inside each cluster to check if the sentences inside are well clustered. I want to keep sentences with good similarities in each clusters. If the similarity is not good, I want to exit sentence that not belong to a cluster, and next group similar sentences not belonging to clusters.
How can I do it in a good manner ? I thought using cosine similarity but don't know how to compare all sentences inside a cluster
Maybe something like this...
# clustering words into similar groups:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
See these links for additional guidance on how to cluster text.
https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52
https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html
https://pythonprogramminglanguage.com/kmeans-text-clustering/
http://brandonrose.org/clustering
Here are a couple examples using Cosine Similarity.
d1 = "plot: two teen couples go to a church party, drink and then drive."
d2 = "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "
d3 = "every now and then a movie comes along from a suspect studio , with every indication that it will be a stinker , and to everybody's surprise ( perhaps even the studio ) the film becomes a critical darling . "
d4 = "damn that y2k bug . "
documents = [d1, d2, d3, d4]
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
print(LemVectorizer.vocabulary_)
tf_matrix = LemVectorizer.transform(documents).toarray()
print(tf_matrix)
tf_matrix.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print(tfidfTran.idf_)
import math
def idf(n,df):
result = math.log((n+1.0)/(df+1.0)) + 1
return result
print("The idf for terms that appear in one document: " + str(idf(4,1)))
print("The idf for terms that appear in two documents: " + str(idf(4,2)))
tfidf_matrix = tfidfTran.transform(tf_matrix)
print(tfidf_matrix.toarray())
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print(cos_similarity_matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/
# Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['doc_trump', 'doc_election', 'doc_putin'])
df
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))
https://www.machinelearningplus.com/nlp/cosine-similarity/
I am performing a classification task which is essentially doing algorithm configuration, i.e. trying to pick a configuration (or 'mode') which is likely to make the problem-solving algorithm finish in the quickest time.
I am learning to classify the "best" configuration based on features of problem instances. I see that scikit-learn enables you to create your own scoring function to use in tuning the models. However the score_func only takes the true label and the predicted label as input.
Is it possible to identify which row in the dataset a prediction came from (when passing to this custom scorer)? That way I could figure out the performance hit of a predicted ("wrong") config and score the model accordingly. Basically sometimes a "wrong" selection can still be very good and close to the best, but a naive classification has no way of knowing this when the classification labels are purely based on the best config.
Here's a contrived example to illustrate what I'm trying to do
import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].\
rename(columns={'config':'best_config'})
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem config time
0 instance_0 analytic 15.307044
1 instance_0 bruteforce 36.742846
2 instance_0 hybrid 35.053416
3 instance_1 analytic 57.781358
4 instance_1 bruteforce 31.723275
5 instance_1 hybrid 8.080238
6 instance_2 analytic 4.211297
7 instance_2 bruteforce 24.034830
8 instance_2 hybrid 39.073023
9 instance_3 analytic 36.325485
10 instance_3 bruteforce 14.717841
11 instance_3 hybrid 57.103908
12 instance_4 analytic 7.358539
13 instance_4 bruteforce 10.805536
14 instance_4 hybrid 2.605044
15 instance_5 analytic 0.489870
16 instance_5 bruteforce 42.888858
17 instance_5 hybrid 58.634073
dataset:
best_config feature_0 feature_1 feature_2 feature_3 feature_4
0 analytic 0.645388 0.641626 0.975619 0.680713 0.209235
5 hybrid 0.993443 0.221038 0.893763 0.408532 0.254791
6 analytic 0.263872 0.142887 0.264538 0.166985 0.800054
10 bruteforce 0.155023 0.601300 0.258767 0.614732 0.850529
14 hybrid 0.766183 0.993692 0.597047 0.401482 0.275133
15 analytic 0.386327 0.065699 0.349115 0.370136 0.357329
I am using sklearn with the dataset where the X would be the feature columns and the y would be the best_config column. In this example, the "bad" choices for instance_0 are both almost equally bad, but for instance_1, the two wrong choices are not equally bad. So I'd like my custom scorer to be able to reflect this somehow. Is that possible?
In the end I did find a way to get the information I was after in the original question. If you're passing a pandas.Series as your target labels, the index attribute is available, so you can look up whatever you want in the full dataset.
In the solution below, the first part is pretty much the same as the original minimal working example - i.e. generating a fake dataset.
In the second part, a custom scorer function is defined, which is then passed to the cross-validating hyperparameter tuner, RandomizedSearchCV. Please bear in mind the data is garbage, so the "results" are meaningless; this is just a demo of how to refer back to a fuller set of results so that you can evaluate the quality of predictions made during hyperparameter tuning based on more specialised information rather than just "match / fail" when doing a classification.
import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES = 5
HP_ITER = 10
SEED = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]\
.rename(columns={'config':'target'})\
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn
data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds
def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)
print("Timings for all configs:", df_times, "", sep="\n")
print("Labelled dataset:", dataset, "", sep="\n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))
Here's the output:
** Timings for all configs **
problem config time
0 p_000 analytic 21.811701
1 p_000 bruteforce 29.652341
2 p_000 hybrid 20.376605
3 p_001 analytic 12.989269
4 p_001 bruteforce 51.759137
.. ... ... ...
595 p_198 bruteforce 10.874092
596 p_198 hybrid 14.723661
597 p_199 analytic 24.984775
598 p_199 bruteforce 4.899111
599 p_199 hybrid 36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem feature_0 feature_1 feature_2 feature_3 feature_4
0 hybrid p_000 0.864952 0.487293 0.946654 0.863503 0.310866
1 analytic p_001 0.514093 0.007643 0.948784 0.582419 0.258159
2 bruteforce p_002 0.319059 0.872320 0.321495 0.807644 0.158471
3 analytic p_003 0.421063 0.955742 0.114808 0.980013 0.900057
4 hybrid p_004 0.325935 0.125824 0.697967 0.037196 0.923626
.. ... ... ... ... ... ... ...
195 hybrid p_195 0.179126 0.578338 0.391535 0.632501 0.442677
196 bruteforce p_196 0.827637 0.641567 0.710201 0.833341 0.215357
197 hybrid p_197 0.116661 0.480170 0.253893 0.623913 0.465419
198 bruteforce p_198 0.670555 0.037084 0.954332 0.408546 0.935973
199 bruteforce p_199 0.371541 0.463060 0.549176 0.581093 0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 9.1s finished
Test loss with custom CV scorer : 522.3236277796698
I am trying to classify a dataset of reviews in to two classes say class A and class B. I am using LightGBM to classify.
I have changed the parameters for the classifier many times but I can't get a huge difference in the results.
I think the problem is with the pre-processing step. I defined a function as shown below to take care of pre-processing. I used Stemming and removed stopwords. I don't know what I am missing. I have tried LancasterStemmer and PorterStemmer
stops = set(stopwords.words("english"))
def cleanData(text, lowercase = False, remove_stops = False, stemming = False, lemm = False):
txt = str(text)
txt = re.sub(r'[^A-Za-z0-9\s]',r'',txt)
txt = re.sub(r'\n',r' ',txt)
if lowercase:
txt = " ".join([w.lower() for w in txt.split()])
if remove_stops:
txt = " ".join([w for w in txt.split() if w not in stops])
if stemming:
st = PorterStemmer()
txt = " ".join([st.stem(w) for w in txt.split()])
if lemm:
wordnet_lemmatizer = WordNetLemmatizer()
txt = " ".join([wordnet_lemmatizer.lemmatize(w) for w in txt.split()])
return txt
Are there any more pre-processing steps to be done to get a better accuracy.?
URL for the dataset : Dataset
EDIT :
Parameters that I used are as mentioned below.
params = {'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'learning_rate': 0.01,
'max_depth': 22,
'num_leaves': 78,
'feature_fraction': 0.1,
'bagging_fraction': 0.4,
'bagging_freq': 1}
I have altered the depth and num_leaves parameters along with others. But the accuracy is kind of stuck at a certain level..
There are a few things to consider. First of all your training set is not balanced - the class distribution is ~ 70%/30%. You need to consider this fact in training. What types of features are you using? Using the right set of features could improve your performance.
I am basically clustering some of my documents using mini_batch_kmeans and kmeans algorithm. I simply followed the tutorial is the scikit-learn website the link for that is given below:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
They are using some of the method for the vectorizing one of which is HashingVectorizer. In the hashingVectorizer they are making a pipeline with TfidfTransformer() method.
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
Once doing so, the vectorizer what I get from that does not have the method get_feature_names(). But since I am using it for clustering, I need to get the "terms" using this "get_feature_names()"
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
How do I solve this error?
My whole code is show below:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
The count vectorizor piped with tfidf.
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
and the mini_batch_kmeans class is as below:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
Pipeline doesn't have get_feature_names() method, as it is not straightforward to implement this method for Pipeline - one needs to consider all pipeline steps to get feature names. See https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425, etc. - there is a lot of related tickets and several attempts to fix it.
If your pipeline is simple (TfidfVectorizer followed by MiniBatchKMeans) then you can get feature names from TfidfVectorizer.
If you want to use HashingVectorizer, it is more complicated, as HashingVectorizer doesn't provide feature names by design. HashingVectorizer doesn't store vocabulary, and uses hashes instead - it means it can be applied in online setting, and that it dosn't require any RAM - but the tradeoff is exactly that you don't get feature names.
It is still possible to get feature names from HashingVectorizer though; to do this you need to apply it for a sample of documents, store which hashes correspond to which words, and this way learn what these hashes mean, i.e. what are the feature names. There may be collisions, so it is not possible to be 100% sure the feature name is correct, but usually this approach works ok. This approach is implemented in eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()
Then you can use hashing_feat_names as your feature names, as TfidfTransformer doesn't change input vector size and just scales the same features.
From the make_pipeline documentation:
This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.
so, in order to access the feature names, after you have fitted to data, you can:
# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)
tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...
# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()
# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik
Hope this helps, good luck!
Edit: After seeing your updated question with the example you follow, #Vivek Kumar is correct, this code terms = vectorizer.get_feature_names() will not run for the pipeline but only when:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)
I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/