In following code, I know that my naivebayes classifier is working correctly because it is working correctly on trainset1 but why is it not working on trainset2? I even tried it on two classifiers, one from TextBlob and other directly from nltk.
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk
trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
('hello i was there and no one came', 'class2'),
('all negative terms like sad angry etc', 'class2')]
def nltk_naivebayes(trainset, test_sentence):
all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
classifier = nltk.NaiveBayesClassifier.train(t)
test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
return classifier.classify(test_sent_features)
def textblob_naivebayes(trainset, test_sentence):
cl = NaiveBayesClassifier(trainset)
blob = TextBlob(test_sentence,classifier=cl)
return blob.classify()
test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"
print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)
Output:
neg
class2
neg
class2
Although test_sentence2 clearly belongs to class1.
I will assume your understand that you cannot expect a classifier to learn a good model with only 3 examples, and that your question is more to understand why it does that in this specific example.
The likely reason it does that is that naive bayes classifier uses a prior class probability. That is, the probability of neg vs pos, regardless of the text. In your case, 2/3 of the examples are negative, thus the prior is 66% for neg and 33% for pos. The positive words in your single positive instance are 'anniversary' and 'soaring', which are unlikely to be enough to compensate this prior class probability.
In particular, be aware that the calculation of word probabilities involve various 'smoothing' functions (for instance, it will be log10(Term Frequency + 1) in each class, not log10(Term Frequency) to prevent low frequency words to impact too much the classification results, divisions by zero, etc. Thus the probabilities for "anniversary" and "soaring" are not 0.0 for neg and 1.0 for pos, unlike what you may have expected.
Related
I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.
Here's my code:
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['Tim is smart!',
'Joy is the best',
'Lisa is dumb',
'Fred is lazy',
'Lisa is lazy'])
# Create target vector
y = np.array([1,1,0,0,0])
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data) #
# Create feature matrix
X = bag_of_words.toarray()
mnb = MultinomialNB(alpha = 1, fit_prior = True, class_prior = None)
mnb.fit(X,y)
print(count.get_feature_names())
# output:['best', 'dumb', 'fred', 'is', 'joy', 'lazy', 'lisa', 'smart', 'the', 'tim']
print(mnb.feature_log_prob_)
# output
[[-2.94443898 -2.2512918 -2.2512918 -1.55814462 -2.94443898 -1.84582669
-1.84582669 -2.94443898 -2.94443898 -2.94443898]
[-2.14006616 -2.83321334 -2.83321334 -1.73460106 -2.14006616 -2.83321334
-2.83321334 -2.14006616 -2.14006616 -2.14006616]]
My question is:
Let's say for word: "best": the probability for class 1 : -2.14006616.
What is the formula to calculate to get this score.
I am using LOG (P(best|y=class=1)) -> Log(1/2) -> can't get the -2.14006616
From the documentation we can infer that feature_log_prob_ corresponds to the empirical log probability of features given a class. Let's take an example feature "best" for the purpose of this illustration, the log probability of this feature for class 1 is -2.14006616 (as you pointed out), now if we were to convert it into actual probability score it will be np.exp(1)**-2.14006616 = 0.11764. Let's take one more step back to see how and why the probability of "best" in class 1 is 0.11764. As per the documentation of Multinomial Naive Bayes, we see that these probabilities are computed using the formula below:
Where, the numerator roughly corresponds to the number of times feature "best" appears in the class 1 (which is of our interest in this example) in the training set, and the denominator corresponds to the total count of all features for class 1. Also, we add a small smoothing value, alpha to prevent from the probabilities going to zero and n corresponds to the total number of features i.e. size of vocabulary. Computing these numbers for the example we have,
N_yi = 1 # "best" appears only once in class `1`
N_y = 7 # There are total 7 features (count of all words) in class `1`
alpha = 1 # default value as per sklearn
n = 10 # size of vocabulary
Required_probability = (1+1)/(7+1*10) = 0.11764
You can do the math in a similar fashion for any given feature and class.
After running MultinomialNB multiple times I'm getting same features for +ve and -ve class BoW, TfIdf.
I even tried it on bi-grams, tri-grams still the same features for both classes.
best_alpha = 6
clf = MultinomialNB( alpha=best_alpha )
clf.fit(X_tr, y_train)
y_train_pred = batch_predict(clf, X_tr)
y_test_pred = batch_predict(clf, X_te)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
This is the code for getting top 10 features for positive and negative classes of text data Tf-Idf.
feats_tfidf contains the features of categorical, numerical and text data.
For Positive class
sorted_idx = np.argsort( clf.feature_log_prob_[1] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[1][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Mathematics :-7.134937347073638
Literacy :-6.910334729871051
Grades_3_5 :-6.832969821702653
Ms :-6.791634814736902
Math_Science :-6.748584860699069
Grades_PreK_2 :-6.664767807632341
Literacy_Language :-6.4833650280402875
Mrs :-6.404885953106168
Teacher number of previously posted projects :-3.285663623429455
price :-0.09775430166978438
For negative class
sorted_idx = np.argsort( clf.feature_log_prob_[0] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[0][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Literacy :-7.31906682336635
Mathematics :-7.318545582802034
Grades_3_5 :-7.088236519755028
Ms :-6.970453484098645
Math_Science :-6.887189615718408
Grades_PreK_2 :-6.85882128589294
Literacy_Language :-6.8194613665941155
Mrs :-6.648860662073821
Teacher number of previously posted projects :-4.008908256269724
price :-0.08131982830664697
Please help me someone is it correct way of doing.
It should be like this
sorted_idx = np.argsort(-1 * clf_bow.feature_log_prob_[0] )[0:11]
for i in sorted_idx:
print(count_vect.get_feature_names()[i])
When you say [-10:] you would be printing elements in position (n-10), (n-9)....n
but we would want elements to be printed are n, n-1, n-2,... n-10
I'm working on the same problem, and yes I too got many top features that are common in both the classes, though it's not exactly in same order as yours.
Here's how I did it -
I first chained all the features and the probability values(exponential of log-probability) together and then sorted in descending order.
top 10 Positive class features
top 10 Negative class features
So yes, I think what you're getting is correct.
I experience big discrepancies when calculating melting temperature of RNA 7-mers with Biopython over values generated by a popular algorithm.
I tried the nearest neighbour algorithm with RNA and salt concentrations as described in a respective paper (thermodynamic table used as in paper below from: Freier et al 1986). Yet, the values largely differ (execute code below to see).
I tried all seven salt correction methods provided by Biopython, still I never get close to the values generated by siRNA design algorithm for the same 7-mers.
Can someone tell me how accurate Biopython's melting temperature nearest neighbour algorithm is? Especially for short oligomers like my 7-mers? Is there maybe something I am implementing wrong? Any suggestions?
Values derived from executing sample input:
http://sidirect2.rnai.jp/
Tm is given for the seed duplex of the guide strand: bases 2-7
Literature:
"Thermodynamic stability and Watson–Crick
base pairing in the seed duplex are major
determinants of the efficiency of the
siRNA-based off-target effect"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602766/pdf/gkn902.pdf
from Bio.Seq import Seq
from Bio.SeqUtils import MeltingTemp
test_list = [
('GGAUUUG', 21.5),
('CUCAUUG', 18.1),
('CAUAUUC', 8.7),
('UUUGAGU', 19.2),
('UUUUGAG', 12.2),
('GUUUCAA', 14.9),
('AGUUUCG', 19.7),
('GAAGUUU', 13.3)
]
for t in test_list:
myseq = Seq(t[0])
tm = MeltingTemp.Tm_NN(myseq, dnac1=100, Na=100, nn_table=MeltingTemp.RNA_NN1, saltcorr=7) # NN1 = Freier et al (1986)
tm = round(tm, 1) # round to one decimal
print 'BioPython Tm: ' + str(tm) + ' siDirect Tm: ' + str(t[1])
I answered the question at biology.stackexchange and Biostars. In short: It seems that siDirect calculates the Tm wrong due to using a 1000fold higher primer concentration.
I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/