Log likelihood function for GDA(Gaussian Discriminative analysis)

Log likelihood function for GDA(Gaussian Discriminative analysis) - machine-learning

I am having trouble understanding the likelihood function for GDA given in Andrew Ng's CS229 notes.
l(φ,µ0,µ1,Σ) = log (product from i to m) {p(x(i)|y(i);µ0,µ1,Σ)p(y(i);φ)}
The link is http://cs229.stanford.edu/notes/cs229-notes2.pdf Page 5.
For Linear regression the function was product from i to m p(y(i)|x(i);theta)
which made sense to me.
Why is there a change here saying it is given by p(x(i)|y(i) and that is multiplied by p(y(i);phi)?
Thanks in advance

The starting formula on page 5 is
l(φ,µ0,µ1,Σ) = log <product from i to m> p(x_i, y_i;µ0,µ1,Σ,φ)
leaving out the parameters φ,µ0,µ1,Σ for now, that can be simplified to
l = log <product> p(x_i, y_i)
using the chain rule you can convert that to either
l = log <product> p(x_i|y_i)p(y_i)
or
l = log <product> p(y_i|x_i)p(x_i).
In the page 5 formula, the φ is moved to p(y_i), because only p(y) depends on it.
The likelihood starts with the joint probability distribution p(x,y) instead of the conditional probability distribution p(y|x), which is why GDA is called a generative model (models from x to y and from y to x), while logistic regression is considered a discriminatory model (models from x to y, one-way). Both have their advantages and disadvantages. There seems to be a chapter about that further below.

Related

Where is perplexity calculated in the Huggingface gpt2 language model code?

I see some github comments saying the output of the model() call's loss is in the form of perplexity:
https://github.com/huggingface/transformers/issues/473
But when I look at the relevant code...
https://huggingface.co/transformers/_modules/transformers/modeling_openai.html#OpenAIGPTLMHeadModel.forward
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), lm_logits, (all hidden states), (all attentions)
I see cross entropy being calculated, but no transformation into perplexity. Where does the loss finally get transformed? Or is there a transformation already there that I'm not understanding?

Ah ok, I found the answer. The code is actually returning cross entropy. In the github comment where they say it is perplexity...they are saying that because the OP does
return math.exp(loss)
which transforms entropy to perplexity :)

No latex no problem. By definition the perplexity (triple P) is:
PP(p) = e^(H(p))
Where H stands for chaos (Ancient Greek: χάος) or entropy. In general case we have the cross entropy:
PP(p) = e^(H(p,q))
e is the natural base of the logarithm which is how PyTorch prefers to compute the entropy and cross entropy.

Confusion about probability density about p(x;w) and p(x|w), where w is the parameter

I can't understand the difference between p(x|w) and p(x;w),like this:
p(x|mu,sigma)=N(x|mu,sigma) and p(x;mu,sigma)=N(x;mu,sigma), where N is normal distribution.

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.

Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

Naive Bayes classifier performance is unexpected

I have just started using the Naive Bayes for text classification. I have coded it from the pseudo code snapshot attached.
I have two classes i.e. positive and negative. I have total of 2000 samples(IMDB Movie Reviews) out of which 1800 (900 positive, 900 negative) are used to train the classifier whereas 200 (100 negative, 100 positive) are used to test the system.
It marks the positive class documents but failed for classifying negative class documents propely. All documents belonging from negative classes are misclassified into positive class and thereby give accuracy of 50%.
If i documents from each class individually like first test all document belonging from negative classes and then from positive test samples then it give me accuracy of 100% but when i feed it mixed test samples it fails and classify all in one class (in my case positive).
Is there any mistake i am doing or is unavailable in this algorithm ?
Are training sample too less and classifier performance will increase upon increase training samples?
I have tested same samples with weka and rapid miner both are giving much better accuracy. I know that i have made a mistake but what is that i can't grab it ?Its the most simple one in understanding but accuracy result was totally unexpected and driving me crazy.Here is my code algorithm pseudo code. I have generating document vector using tf-idf for term weighting and document vector is used for calculations.
TrainMultinomialNB(C, D)
1. V = ExtractVocabulary(D)
2. N = CountDocs(D)
3. For each c E C
4. Do Nc = CountDocsInClass (D, c)
5. Prior[c] = Nc/N
6. Textc = ConcatenateTextOfAllDocsInClass (D,c)
7. For each t E V
8. Do Tct = CountTokensOfTerm(textc, t)
9. For each t E V
10. Do condprob[t][c] = (Tct + 1) /(Sum(Tct) + |V|)
11. Return V, prior, condprob
ApplyMultinomialNB(C, V, prior, condprob, d)
1. W = ExtractTokensFromDoc (V, d)
2. For each c E C
3. Do score [c] = log (prior)
4. For each t E W
5. Do score [c] + = log (condprob[t][c])
6. Return argmax(cEC) score [c]

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?

For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.

You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.

Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Log likelihood function for GDA(Gaussian Discriminative analysis) - machine-learning

Related

Where is perplexity calculated in the Huggingface gpt2 language model code?

Confusion about probability density about p(x;w) and p(x|w), where w is the parameter

How tf.gradients work in TensorFlow

Naive Bayes classifier performance is unexpected

Supervised Latent Dirichlet Allocation for Document Classification?

Categories

Resources