KL Divergence for two probability distributions in PyTorch

KL Divergence for two probability distributions in PyTorch - machine-learning

I have two probability distributions. How should I find the KL-divergence between them in PyTorch? The regular cross entropy only accepts integer labels.

Yes, PyTorch has a method named kl_div under torch.nn.functional to directly compute KL-devergence between tensors. Suppose you have tensor a and b of same shape. You can use the following code:
import torch.nn.functional as F
out = F.kl_div(a, b)
For more details, see the above method documentation.

function kl_div is not the same as wiki's explanation.
I use the following:
# this is the same example in wiki
P = torch.Tensor([0.36, 0.48, 0.16])
Q = torch.Tensor([0.333, 0.333, 0.333])
(P * (P / Q).log()).sum()
# tensor(0.0863), 10.2 µs ± 508
F.kl_div(Q.log(), P, None, None, 'sum')
# tensor(0.0863), 14.1 µs ± 408 ns
compare to kl_div, even faster

If you have two probability distribution in form of pytorch distribution object. Then you are better off using the function torch.distributions.kl.kl_divergence(p, q). For documentation follow the link

If working with Torch distributions
mu = torch.Tensor([0] * 100)
sd = torch.Tensor([1] * 100)
p = torch.distributions.Normal(mu,sd)
q = torch.distributions.Normal(mu,sd)
out = torch.distributions.kl_divergence(p, q).mean()
out.tolist() == 0
True

If you are using the normal distribution, then the following code will directly compare the two distributions themselves:
p = torch.distributions.normal.Normal(p_mu, p_std)
q = torch.distributions.normal.Normal(q_mu, q_std)
loss = torch.distributions.kl_divergence(p, q)
p and q are two tensor objects.
This code will work and won't give any NotImplementedError.

Related

MCMC for multi coefficients with normal gaussian distribution

I have a linear model as follows: Acceleration= (C1V)+ C2(X- D) where D= alpha-(betaV)+(gammaM).
note that the values for V,X,M are given in dataset.
My goal is to run 350 times MCMC for each following coefficients: C1,C2,alpha,beta,gamma.
1- I have mean and standard deviation for C1,C2,alpha,beta,gamma.
2- All coefficients (C1,C2,alpha,beta,gamma) are normal distribution.
I have tried two methods to find MCMC for each of coefficients, one is by pymc3(which I'm not sure if I have done correctly), another is defining likelihood function based on the method mentioned in following link by Jonny Homfmeister (in my case, I changed the distribution from binomial to normal gaussian):
https://towardsdatascience.com/bayesian-inference-and-markov-chain-monte-carlo-sampling-in-python-bada1beabca7
The problem is that, after running MCMC for C1,C2,alpha,beta and gamma, and using the mean of posterior (output of MCMC) in my main model, I see that the absolute error has increased! it means the coefficients have not optimized by MCMC and my method does not work properly!
I do appreciate it if someone help me with correct algorithm for MCMC for Normal distribution.
####First method: pymc3##########
import pymc3 as pm
import scipy.stats as st
import arviz as az
for row in range(350):
X_c1 = st.norm(loc=-0.06, scale=0.47).rvs(size=100)
with pm.Model() as model:
prior = pm.Normal('c1', mu=-0.06, sd=0.47) ###### prior #####weights
obs = pm.Normal('obs', mu=prior, sd=0.47, observed=X_c1) #######likelihood
step = pm.Metropolis()
trace_c1 = pm.sample(draws=30, chains=2, step=step, return_inferencedata=True)
###calculate the mean of output (posterior distribution)#################3
mean_c1= az.summary(trace_c1, var_names=["c1"], round_to=2).iloc[0][['mean']]
mean_c1 = mean_c1.to_numpy()
Acceleration= (mean_C1*V)+ C2*(X- D) ######apply model
######second method in the given link##########
## Define the Likelihood P(x|p) - normal distribution
def likelihood(p):
return scipy.stats.norm.cdf(C1, loc=-0.06, scale=0.47)
def prior(p):
return scipy.stats.norm.pdf(p)
def acceptance_ratio(p, p_new):
# Return R, using the functions we created before
return min(1, ((likelihood(p_new) / likelihood(p)) * (prior(p_new) / prior(p))))
p = np.random.normal(C1, 0.47) # Initialzie a value of p
#### Define model parameters
n_samples = 790 ################# I HAVE NO IDEA HOW TO CHOOSE THIS VALUE????###########
burn_in = 99
lag = 2
##### Create the MCMC loop
for i in range(n_samples):
p_new = np.random.random_sample() ###Propose a new value of p randomly from a normal distribution
R = acceptance_ratio(p, p_new) #### Compute acceptance probability
u= np.random.uniform(0, 1) ####Draw random sample to compare R to
if u < R: ##### If R is greater than u, accept the new value of p
p = p_new
if i > burn_in and i%lag == 0: #### Record values after burn in - how often determined by lag
results_1.append(p)

Efficient pseudo-inverse for PyTorch 2D convolution

Background:
Thanks for your attention! I am learning the basic knowledge of 2D convolution, linear algebra and PyTorch. I encounter the implementation problem about the psedo-inverse of the convolution operator. Specifically, I have no idea about how to implement it in an efficient way. Please see the following problem statements for details. Any help/tip/suggestion is welcomed.
(Thanks a lot for your attention!)
The Original Problem:
I have an image feature x with shape [b,c,h,w] and a 3x3 convolutional kernel K with shape [c,c,3,3]. There is y = K * x. How to implement the corresponding pseudo-inverse on y in an efficient way?
There is [y = K * x = Ax], how to implement [x_hat = (A^+)y]?
I guess that there should be some operations using torch.fft. However, I still have no idea about how to implement it. I do not know if there exists an implementation previously.
import torch
import torch.nn.functional as F
c = 32
K = torch.randn(c, c, 3, 3)
x = torch.randn(1, c, 128, 128)
y = F.conv2d(x, K, padding=1)
print(y.shape)
# How to implement pseudo-inverse for y = K * x in an efficient way?
Some of My Efforts:
I may know that the 2D convolution is a linear operator. It is equivalent to a "matrix product" operator. We can actually write out the matrix form of the convolution and calculate its psedo-inverse. However, I think this type of operation will be inefficient. And I have no idea about how to implement it in an efficient way.
According to Wikipedia, the psedo-inverse may satisfy the property of A(A_pinv(x))=x, where A is the convolutional operator, A_pinv is its psedo-inverse, and x may be any image feature.
(Thanks again for reading such a long post!)

This takes the problem to another level.
The convolution itself is a linear operation, you can determine the matrix of the operation and solve a least square problem directly [1], or compute the pseudo-inverse as you mentioned, and then apply to different outputs and predicting a projection of the input.
I am changing your code to using padding=0
import torch
import torch.nn.functional as F
# your code
c = 32
K = torch.randn(c, c, 1, 1)
x = torch.randn(4, c, 128, 128)
y = F.conv2d(x, K, bias=torch.zeros((c,)))
Also, as you probably already suggested the convolution can be computed as ifft(fft(h)*fft(x)). However, the conv2d function is a cross-correlation, so you have to conjugate the filter leading to ifft(fft(h)*fft(x)), also you have to apply this to two axes, and you have to make sure the FFT is calcuated using the same representation (size), since the data is real, we can apply multi-dimensional real FFT. To be complete, conv2d works on multiple channels, so we have to calculate summations of convolutions. Since the FFT is linear, we can simply compute the summations on the frequency domain
using einsum.
s = y.shape[-2:]
K_f = torch.fft.rfftn(K, s)
x_f = torch.fft.rfftn(x, s)
y_f = torch.einsum('jkxy,ikxy->ijxy', K_f.conj(), x_f)
y_hat = torch.fft.irfftn(y_f, s)
Except for the borders it should be accurate (remember FFT computes a cyclic convolution).
torch.max(abs(y_hat[:,:,:-2,:-2] - y[:,:,:,:]))
Now, notice the pattern jk,ik->ij on the einsum, that means y_f[i,j] = sum(K_f[j,k] * x_f[i,k]) = x_f # K_f.T, if # is the matrix product on the first two dimensions. So to invert this operation we have to can interpret the first two dimensions as matrices. The function pinv will compute pseudo-inverses on the last two axes, so in order to use that we have to permute the axes. If we right multiply the output by the pseudo-inverse of transposed K_f we should invert this operation.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
Nottice that I am using the full convolution, but the conv2d actually cropped the images. Let's apply that
y_hat[:,:,128-(k-1):,:] = 0
y_hat[:,:,:,128-(k-1):] = 0
Repeating the calculation you will see that the input is not accurate anymore, so you have to be careful about what you do with your convolution, but in some situations where you can get this to work it will be in fact efficient.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))

FedProx with TensorFlow Federated

Would anyone know how to implement the FedProx optimisation algorithm with TensorFlow Federated? The only implementation that seems to be available online was developed directly with TensorFlow. A TFF implementation would enable an easier comparison with experiments that utilise FedAvg which the framework supports.
This is the link to the FedProx repo: https://github.com/litian96/FedProx
Link to the paper: https://arxiv.org/abs/1812.06127

At this moment, FedProx implementation is not available. I agree it would be a valuable algorithm to have.
If you are interested in contributing FedProx, the best place to start would be simple_fedavg which is a minimal implementation of FedAvg meant as a starting point for extensions -- see the readme there for more details.
I think the major change would need to happen to the client_update method, where you would add the proximal term depending on model_weights and initial_weights to the loss computed in forward pass.

I provide below my implementation of FedProx in TFF. I am not 100% sure that this is the right implementation; I post this answer also for discussing on actual code example.
I tried to follow the suggestions in the Jacub Konecny's answer and comment.
Starting from the simple_fedavg (referring to the TFF Github repo), I just modified the client_update method, and specifically changing the input argument for calculating the gradient with the GradientTape, i.e. instaead of just passing in input the outputs.loss, the tape calculates the gradient considering the outputs.loss + proximal_term previosuly (and iteratively) calculated.
#tf.function
def client_update(model, dataset, server_message, client_optimizer):
"""Performans client local training of "model" on "dataset".Args:
model: A "tff.learning.Model".
dataset: A "tf.data.Dataset".
server_message: A "BroadcastMessage" from server.
client_optimizer: A "tf.keras.optimizers.Optimizer".
Returns:
A "ClientOutput".
"""
def difference_model_norm_2_square(global_model, local_model):
"""Calculates the squared l2 norm of a model difference (i.e.
local_model - global_model)
Args:
global_model: the model broadcast by the server
local_model: the current, in-training model
Returns: the squared norm
"""
model_difference = tf.nest.map_structure(lambda a, b: a - b,
local_model,
global_model)
squared_norm = tf.square(tf.linalg.global_norm(model_difference))
return squared_norm
model_weights = model.weights
initial_weights = server_message.model_weights
tf.nest.map_structure(lambda v, t: v.assign(t), model_weights,
initial_weights)
num_examples = tf.constant(0, dtype=tf.int32)
loss_sum = tf.constant(0, dtype=tf.float32)
# Explicit use `iter` for dataset is a trick that makes TFF more robust in
# GPU simulation and slightly more performant in the unconventional usage
# of large number of small datasets.
for batch in iter(dataset):
with tf.GradientTape() as tape:
outputs = model.forward_pass(batch)
# ------ FedProx ------
mu = tf.constant(0.2, dtype=tf.float32)
prox_term =(mu/2)*difference_model_norm_2_square(model_weights.trainable, initial_weights.trainable)
fedprox_loss = outputs.loss + prox_term
# Letting GradientTape dealing with the FedProx's loss
grads = tape.gradient(fedprox_loss, model_weights.trainable)
client_optimizer.apply_gradients(zip(grads, model_weights.trainable))
batch_size = tf.shape(batch['x'])[0]
num_examples += batch_size
loss_sum += outputs.loss * tf.cast(batch_size, tf.float32)
weights_delta = tf.nest.map_structure(lambda a, b: a - b,
model_weights.trainable,
initial_weights.trainable)
client_weight = tf.cast(num_examples, tf.float32)
return ClientOutput(weights_delta, client_weight, loss_sum / client_weight)

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?

The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put
C1 = C / n1
C2 = C / n2
where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.
Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.
Example using python and sklearn
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.

This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:
http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

How to do multi class classification using Support Vector Machines (SVM)

In every book and example always they show only binary classification (two classes) and new vector can belong to any one class.
Here the problem is I have 4 classes(c1, c2, c3, c4). I've training data for 4 classes.
For new vector the output should be like
C1 80% (the winner)
c2 10%
c3 6%
c4 4%
How to do this? I'm planning to use libsvm (because it most popular). I don't know much about it. If any of you guys used it previously please tell me specific commands I'm supposed to use.

LibSVM uses the one-against-one approach for multi-class learning problems. From the FAQ:
Q: What method does libsvm use for multi-class SVM ? Why don't you use the "1-against-the rest" method ?
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.

Commonly used methods are One vs. Rest and One vs. One.
In the first method you get n classifiers and the resulting class will have the highest score.
In the second method the resulting class is obtained by majority votes of all classifiers.
AFAIR, libsvm supports both strategies of multiclass classification.

You can always reduce a multi-class classification problem to a binary problem by choosing random partititions of the set of classes, recursively. This is not necessarily any less effective or efficient than learning all at once, since the sub-learning problems require less examples since the partitioning problem is smaller. (It may require at most a constant order time more, e.g. twice as long). It may also lead to more accurate learning.
I'm not necessarily recommending this, but it is one answer to your question, and is a general technique that can be applied to any binary learning algorithm.

Use the SVM Multiclass library. Find it at the SVM page by Thorsten Joachims

It does not have a specific switch (command) for multi-class prediction. it automatically handles multi-class prediction if your training dataset contains more than two classes.

Nothing special compared with binary prediction. see the following example for 3-class prediction based on SVM.
install.packages("e1071")
library("e1071")
data(iris)
attach(iris)
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y)
print(model)
summary(model)
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
col = as.integer(iris[,5]),
pch = c("o","+")[1:150 %in% model$index + 1])

data=load('E:\dataset\scene_categories\all_dataset.mat');
meas = data.all_dataset;
species = data.dataset_label;
[g gn] = grp2idx(species); %# nominal class to numeric
%# split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/10);
%# 1-vs-1 pairwise models
num_labels = length(gn);
clear gn;
num_classifiers = num_labels*(num_labels-1)/2;
pairwise = zeros(num_classifiers ,2);
row_end = 0;
for i=1:num_labels - 1
row_start = row_end + 1;
row_end = row_start + num_labels - i -1;
pairwise(row_start : row_end, 1) = i;
count = 0;
for j = i+1 : num_labels
pairwise( row_start + count , 2) = j;
count = count + 1;
end
end
clear row_start row_end count i j num_labels num_classifiers;
svmModel = cell(size(pairwise,1),1); %# store binary-classifers
predTest = zeros(sum(testIdx),numel(svmModel)); %# store binary predictions
%# classify using one-against-one approach, SVM with 3rd degree poly kernel
for k=1:numel(svmModel)
%# get only training instances belonging to this pair
idx = trainIdx & any( bsxfun(#eq, g, pairwise(k,:)) , 2 );
%# train
svmModel{k} = svmtrain(meas(idx,:), g(idx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test
predTest(:,k) = svmclassify(svmModel{k}, meas(testIdx,:));
end
pred = mode(predTest,2); %# voting: clasify as the class receiving most votes
%# performance
cmat = confusionmat(g(testIdx),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)

For multi class classification using SVM;
It is NOT (one vs one) and NOT (one vs REST).
Instead learn a two-class classifier where the feature vector is (x, y) where x is data and y is the correct label associated with the data.
The training gap is the Difference between the value for the correct class and the value of the nearest other class.
At Inference choose the "y" that has the maximum
value of (x,y).
y = arg_max(y') W.(x,y') [W is the weight vector and (x,y) is the feature Vector]
Please Visit link:
https://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html#:~:text=It%20is%20also%20a%20simple,the%20label%20of%20structural%20SVMs%20.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

KL Divergence for two probability distributions in PyTorch - machine-learning

I have two probability distributions. How should I find the KL-divergence between them in PyTorch? The regular cross entropy only accepts integer labels.

If you have two probability distribution in form of pytorch distribution object. Then you are better off using the function torch.distributions.kl.kl_divergence(p, q). For documentation follow the link

If working with Torch distributions mu = torch.Tensor([0] * 100) sd = torch.Tensor([1] * 100) p = torch.distributions.Normal(mu,sd) q = torch.distributions.Normal(mu,sd) out = torch.distributions.kl_divergence(p, q).mean() out.tolist() == 0 True

Related

MCMC for multi coefficients with normal gaussian distribution

Efficient pseudo-inverse for PyTorch 2D convolution

FedProx with TensorFlow Federated

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

How to do multi class classification using Support Vector Machines (SVM)

Categories

Resources