Where is perplexity calculated in the Huggingface gpt2 language model code? - machine-learning

I see some github comments saying the output of the model() call's loss is in the form of perplexity:
https://github.com/huggingface/transformers/issues/473
But when I look at the relevant code...
https://huggingface.co/transformers/_modules/transformers/modeling_openai.html#OpenAIGPTLMHeadModel.forward
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), lm_logits, (all hidden states), (all attentions)
I see cross entropy being calculated, but no transformation into perplexity. Where does the loss finally get transformed? Or is there a transformation already there that I'm not understanding?

Ah ok, I found the answer. The code is actually returning cross entropy. In the github comment where they say it is perplexity...they are saying that because the OP does
return math.exp(loss)
which transforms entropy to perplexity :)

No latex no problem. By definition the perplexity (triple P) is:
PP(p) = e^(H(p))
Where H stands for chaos (Ancient Greek: χάος) or entropy. In general case we have the cross entropy:
PP(p) = e^(H(p,q))
e is the natural base of the logarithm which is how PyTorch prefers to compute the entropy and cross entropy.

Related

FedProx with TensorFlow Federated

Would anyone know how to implement the FedProx optimisation algorithm with TensorFlow Federated? The only implementation that seems to be available online was developed directly with TensorFlow. A TFF implementation would enable an easier comparison with experiments that utilise FedAvg which the framework supports.
This is the link to the FedProx repo: https://github.com/litian96/FedProx
Link to the paper: https://arxiv.org/abs/1812.06127
At this moment, FedProx implementation is not available. I agree it would be a valuable algorithm to have.
If you are interested in contributing FedProx, the best place to start would be simple_fedavg which is a minimal implementation of FedAvg meant as a starting point for extensions -- see the readme there for more details.
I think the major change would need to happen to the client_update method, where you would add the proximal term depending on model_weights and initial_weights to the loss computed in forward pass.
I provide below my implementation of FedProx in TFF. I am not 100% sure that this is the right implementation; I post this answer also for discussing on actual code example.
I tried to follow the suggestions in the Jacub Konecny's answer and comment.
Starting from the simple_fedavg (referring to the TFF Github repo), I just modified the client_update method, and specifically changing the input argument for calculating the gradient with the GradientTape, i.e. instaead of just passing in input the outputs.loss, the tape calculates the gradient considering the outputs.loss + proximal_term previosuly (and iteratively) calculated.
#tf.function
def client_update(model, dataset, server_message, client_optimizer):
"""Performans client local training of "model" on "dataset".Args:
model: A "tff.learning.Model".
dataset: A "tf.data.Dataset".
server_message: A "BroadcastMessage" from server.
client_optimizer: A "tf.keras.optimizers.Optimizer".
Returns:
A "ClientOutput".
"""
def difference_model_norm_2_square(global_model, local_model):
"""Calculates the squared l2 norm of a model difference (i.e.
local_model - global_model)
Args:
global_model: the model broadcast by the server
local_model: the current, in-training model
Returns: the squared norm
"""
model_difference = tf.nest.map_structure(lambda a, b: a - b,
local_model,
global_model)
squared_norm = tf.square(tf.linalg.global_norm(model_difference))
return squared_norm
model_weights = model.weights
initial_weights = server_message.model_weights
tf.nest.map_structure(lambda v, t: v.assign(t), model_weights,
initial_weights)
num_examples = tf.constant(0, dtype=tf.int32)
loss_sum = tf.constant(0, dtype=tf.float32)
# Explicit use `iter` for dataset is a trick that makes TFF more robust in
# GPU simulation and slightly more performant in the unconventional usage
# of large number of small datasets.
for batch in iter(dataset):
with tf.GradientTape() as tape:
outputs = model.forward_pass(batch)
# ------ FedProx ------
mu = tf.constant(0.2, dtype=tf.float32)
prox_term =(mu/2)*difference_model_norm_2_square(model_weights.trainable, initial_weights.trainable)
fedprox_loss = outputs.loss + prox_term
# Letting GradientTape dealing with the FedProx's loss
grads = tape.gradient(fedprox_loss, model_weights.trainable)
client_optimizer.apply_gradients(zip(grads, model_weights.trainable))
batch_size = tf.shape(batch['x'])[0]
num_examples += batch_size
loss_sum += outputs.loss * tf.cast(batch_size, tf.float32)
weights_delta = tf.nest.map_structure(lambda a, b: a - b,
model_weights.trainable,
initial_weights.trainable)
client_weight = tf.cast(num_examples, tf.float32)
return ClientOutput(weights_delta, client_weight, loss_sum / client_weight)

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

DL4J Prediction Formatting

I have two questions on deeplearning4j that are somewhat related.
When I execute “INDArray predicted = model.output(features,false);” to generate a prediction, I get the label predicted by the model; it is either 0 or 1. I tried to search for a way to have a probability (value between 0 and 1) instead of strictly 0 or 1. This is useful when you need to set a threshold for what your model should consider as a 0 and what it should consider as a 1. For example, you may want your model to output '1' for any prediction that is higher than or equal to 0.9 and output '0' otherwise.
My second question is that I am not sure why the output is represented as a two-dimensional array (shown after the code below) even though there are only two possibilities, so it would be better to represent it with one value - especially if we want it as a probability (question #1) which is one value.
PS: in case relevant to the question, in the Schema the output column is defined using ".addColumnInteger". Below are snippets of the code used.
Part of the code:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(org.deeplearning4j.nn.conf.Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInputs)
.nOut(numHiddenNodes)
.weightInit(WeightInit.XAVIER)
.activation("relu")
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.weightInit(WeightInit.XAVIER)
.activation("softmax")
.weightInit(WeightInit.XAVIER)
.nIn(numHiddenNodes)
.nOut(numOutputs)
.build()
)
.pretrain(false).backprop(true).build();
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.setListeners(new ScoreIterationListener(10));
for (int n=0; n<nEpochs; n++) {
model.fit(trainIter);
}
Evaluation eval = new Evaluation(numOutputs);
while (testIter.hasNext()){
DataSet t = testIter.next();
INDArray features = t.getFeatureMatrix();
System.out.println("Input features: " + features);
INDArray labels = t.getLabels();
INDArray predicted = model.output(features,false);
System.out.println("Predicted output: "+ predicted);
System.out.println("Desired output: "+ labels);
eval.eval(labels, predicted);
System.out.println();
}
System.out.println(eval.stats());
Output from running the code above:
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: [1.00, 0.00]
Desired output: [1.00, 0.00]
*What I want the output to look like (i.e. a one-value probability):**
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: 0.14
Desired output: 0.0
I will answer your questions inline but I just want to note:
I would suggest taking a look at our docs and examples:
https://github.com/deeplearning4j/dl4j-examples
http://deeplearning4j.org/quickstart
A 100% 0 or 1 is just a badly tuned neural net. That's not at all how things work. A softmax by default returns probabilities. Your neural net is just badly tuned. Look at updating dl4j too. I'm not sure what version you're on but we haven't used strings in activations for at least a year now? You seem to have skipped a lot of steps when starting with us. I'll reiterate again, at least take a look above for a starting point rather than using year old code.
What you're seeing there is just standard deep learning 101. So the advice I'm about to give you can be found on the internet and is applicable for any deep learning software. A two label softmax sums each row to 1. If you want 1 label, use sigmoid with 1 output and a different loss function. We use softmax because it can work for any number of ouputs and all you have to do is change the number of outputs rather than having to change the loss function and activation function on top of that.

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/

How to do multi class classification using Support Vector Machines (SVM)

In every book and example always they show only binary classification (two classes) and new vector can belong to any one class.
Here the problem is I have 4 classes(c1, c2, c3, c4). I've training data for 4 classes.
For new vector the output should be like
C1 80% (the winner)
c2 10%
c3 6%
c4 4%
How to do this? I'm planning to use libsvm (because it most popular). I don't know much about it. If any of you guys used it previously please tell me specific commands I'm supposed to use.
LibSVM uses the one-against-one approach for multi-class learning problems. From the FAQ:
Q: What method does libsvm use for multi-class SVM ? Why don't you use the "1-against-the rest" method ?
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
Commonly used methods are One vs. Rest and One vs. One.
In the first method you get n classifiers and the resulting class will have the highest score.
In the second method the resulting class is obtained by majority votes of all classifiers.
AFAIR, libsvm supports both strategies of multiclass classification.
You can always reduce a multi-class classification problem to a binary problem by choosing random partititions of the set of classes, recursively. This is not necessarily any less effective or efficient than learning all at once, since the sub-learning problems require less examples since the partitioning problem is smaller. (It may require at most a constant order time more, e.g. twice as long). It may also lead to more accurate learning.
I'm not necessarily recommending this, but it is one answer to your question, and is a general technique that can be applied to any binary learning algorithm.
Use the SVM Multiclass library. Find it at the SVM page by Thorsten Joachims
It does not have a specific switch (command) for multi-class prediction. it automatically handles multi-class prediction if your training dataset contains more than two classes.
Nothing special compared with binary prediction. see the following example for 3-class prediction based on SVM.
install.packages("e1071")
library("e1071")
data(iris)
attach(iris)
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y)
print(model)
summary(model)
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
col = as.integer(iris[,5]),
pch = c("o","+")[1:150 %in% model$index + 1])
data=load('E:\dataset\scene_categories\all_dataset.mat');
meas = data.all_dataset;
species = data.dataset_label;
[g gn] = grp2idx(species); %# nominal class to numeric
%# split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/10);
%# 1-vs-1 pairwise models
num_labels = length(gn);
clear gn;
num_classifiers = num_labels*(num_labels-1)/2;
pairwise = zeros(num_classifiers ,2);
row_end = 0;
for i=1:num_labels - 1
row_start = row_end + 1;
row_end = row_start + num_labels - i -1;
pairwise(row_start : row_end, 1) = i;
count = 0;
for j = i+1 : num_labels
pairwise( row_start + count , 2) = j;
count = count + 1;
end
end
clear row_start row_end count i j num_labels num_classifiers;
svmModel = cell(size(pairwise,1),1); %# store binary-classifers
predTest = zeros(sum(testIdx),numel(svmModel)); %# store binary predictions
%# classify using one-against-one approach, SVM with 3rd degree poly kernel
for k=1:numel(svmModel)
%# get only training instances belonging to this pair
idx = trainIdx & any( bsxfun(#eq, g, pairwise(k,:)) , 2 );
%# train
svmModel{k} = svmtrain(meas(idx,:), g(idx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test
predTest(:,k) = svmclassify(svmModel{k}, meas(testIdx,:));
end
pred = mode(predTest,2); %# voting: clasify as the class receiving most votes
%# performance
cmat = confusionmat(g(testIdx),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)
For multi class classification using SVM;
It is NOT (one vs one) and NOT (one vs REST).
Instead learn a two-class classifier where the feature vector is (x, y) where x is data and y is the correct label associated with the data.
The training gap is the Difference between the value for the correct class and the value of the nearest other class.
At Inference choose the "y" that has the maximum
value of (x,y).
y = arg_max(y') W.(x,y') [W is the weight vector and (x,y) is the feature Vector]
Please Visit link:
https://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html#:~:text=It%20is%20also%20a%20simple,the%20label%20of%20structural%20SVMs%20.

Resources