FedProx with TensorFlow Federated - tensorflow-federated

Would anyone know how to implement the FedProx optimisation algorithm with TensorFlow Federated? The only implementation that seems to be available online was developed directly with TensorFlow. A TFF implementation would enable an easier comparison with experiments that utilise FedAvg which the framework supports.
This is the link to the FedProx repo: https://github.com/litian96/FedProx
Link to the paper: https://arxiv.org/abs/1812.06127

At this moment, FedProx implementation is not available. I agree it would be a valuable algorithm to have.
If you are interested in contributing FedProx, the best place to start would be simple_fedavg which is a minimal implementation of FedAvg meant as a starting point for extensions -- see the readme there for more details.
I think the major change would need to happen to the client_update method, where you would add the proximal term depending on model_weights and initial_weights to the loss computed in forward pass.

I provide below my implementation of FedProx in TFF. I am not 100% sure that this is the right implementation; I post this answer also for discussing on actual code example.
I tried to follow the suggestions in the Jacub Konecny's answer and comment.
Starting from the simple_fedavg (referring to the TFF Github repo), I just modified the client_update method, and specifically changing the input argument for calculating the gradient with the GradientTape, i.e. instaead of just passing in input the outputs.loss, the tape calculates the gradient considering the outputs.loss + proximal_term previosuly (and iteratively) calculated.
#tf.function
def client_update(model, dataset, server_message, client_optimizer):
"""Performans client local training of "model" on "dataset".Args:
model: A "tff.learning.Model".
dataset: A "tf.data.Dataset".
server_message: A "BroadcastMessage" from server.
client_optimizer: A "tf.keras.optimizers.Optimizer".
Returns:
A "ClientOutput".
"""
def difference_model_norm_2_square(global_model, local_model):
"""Calculates the squared l2 norm of a model difference (i.e.
local_model - global_model)
Args:
global_model: the model broadcast by the server
local_model: the current, in-training model
Returns: the squared norm
"""
model_difference = tf.nest.map_structure(lambda a, b: a - b,
local_model,
global_model)
squared_norm = tf.square(tf.linalg.global_norm(model_difference))
return squared_norm
model_weights = model.weights
initial_weights = server_message.model_weights
tf.nest.map_structure(lambda v, t: v.assign(t), model_weights,
initial_weights)
num_examples = tf.constant(0, dtype=tf.int32)
loss_sum = tf.constant(0, dtype=tf.float32)
# Explicit use `iter` for dataset is a trick that makes TFF more robust in
# GPU simulation and slightly more performant in the unconventional usage
# of large number of small datasets.
for batch in iter(dataset):
with tf.GradientTape() as tape:
outputs = model.forward_pass(batch)
# ------ FedProx ------
mu = tf.constant(0.2, dtype=tf.float32)
prox_term =(mu/2)*difference_model_norm_2_square(model_weights.trainable, initial_weights.trainable)
fedprox_loss = outputs.loss + prox_term
# Letting GradientTape dealing with the FedProx's loss
grads = tape.gradient(fedprox_loss, model_weights.trainable)
client_optimizer.apply_gradients(zip(grads, model_weights.trainable))
batch_size = tf.shape(batch['x'])[0]
num_examples += batch_size
loss_sum += outputs.loss * tf.cast(batch_size, tf.float32)
weights_delta = tf.nest.map_structure(lambda a, b: a - b,
model_weights.trainable,
initial_weights.trainable)
client_weight = tf.cast(num_examples, tf.float32)
return ClientOutput(weights_delta, client_weight, loss_sum / client_weight)

Related

Where is perplexity calculated in the Huggingface gpt2 language model code?

I see some github comments saying the output of the model() call's loss is in the form of perplexity:
https://github.com/huggingface/transformers/issues/473
But when I look at the relevant code...
https://huggingface.co/transformers/_modules/transformers/modeling_openai.html#OpenAIGPTLMHeadModel.forward
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), lm_logits, (all hidden states), (all attentions)
I see cross entropy being calculated, but no transformation into perplexity. Where does the loss finally get transformed? Or is there a transformation already there that I'm not understanding?
Ah ok, I found the answer. The code is actually returning cross entropy. In the github comment where they say it is perplexity...they are saying that because the OP does
return math.exp(loss)
which transforms entropy to perplexity :)
No latex no problem. By definition the perplexity (triple P) is:
PP(p) = e^(H(p))
Where H stands for chaos (Ancient Greek: χάος) or entropy. In general case we have the cross entropy:
PP(p) = e^(H(p,q))
e is the natural base of the logarithm which is how PyTorch prefers to compute the entropy and cross entropy.

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

Xgboost-How to use "mae" as objective function?

I know xgboost need first gradient and second gradient, but anybody else has used "mae" as obj function?
A little bit of theory first, sorry! You asked for the grad and hessian for MAE, however, the MAE is not continuously twice differentiable so trying to calculate the first and second derivatives becomes tricky. Below we can see the "kink" at x=0 which prevents the MAE from being continuously differentiable.
Moreover, the second derivative is zero at all the points where it is well behaved. In XGBoost, the second derivative is used as a denominator in the leaf weights, and when zero, creates serious math-errors.
Given these complexities, our best bet is to try to approximate the MAE using some other, nicely behaved function. Let's take a look.
We can see above that there are several functions that approximate the absolute value. Clearly, for very small values, the Squared Error (MSE) is a fairly good approximation of the MAE. However, I assume that this is not sufficient for your use case.
Huber Loss is a well documented loss function. However, it is not smooth so we cannot guarantee smooth derivatives. We can approximate it using the Psuedo-Huber function. It can be implemented in python XGBoost as follows,
import xgboost as xgb
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
param = {'max_depth': 5}
num_round = 10
def huber_approx_obj(preds, dtrain):
d = preds - dtrain.get_labels() #remove .get_labels() for sklearn
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
bst = xgb.train(param, dtrain, num_round, obj=huber_approx_obj)
Other function can be used by replacing the obj=huber_approx_obj.
Fair Loss is not well documented at all but it seems to work rather well. The fair loss function is:
It can be implemented as such,
def fair_obj(preds, dtrain):
"""y = c * abs(x) - c**2 * np.log(abs(x)/c + 1)"""
x = preds - dtrain.get_labels()
c = 1
den = abs(x) + c
grad = c*x / den
hess = c*c / den ** 2
return grad, hess
This code is taken and adapted from the second place solution in the Kaggle Allstate Challenge.
Log-Cosh Loss function.
def log_cosh_obj(preds, dtrain):
x = preds - dtrain.get_labels()
grad = np.tanh(x)
hess = 1 / np.cosh(x)**2
return grad, hess
Finally, you can create your own custom loss functions using the above functions as templates.
Warning: Due to API changes newer versions of XGBoost may require loss functions for the form:
def custom_objective(y_true, y_pred):
...
return grad, hess
For the Huber loss above, I think the gradient is missing a negative sign upfront. Should be as
grad = - d / scale_sqrt
I am running the huber/fair metric from above on ~normally distributed Y, but for some reason with alpha <0 (and all the time for fair) the result prediction will equal to zero...

LSTM RNN Backpropagation

Could someone give a clear explanation of backpropagation for LSTM RNNs?
This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:
In the code, it's simply three matrices and two bias vectors:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.
What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
d_c_next_t = d_c_prev_t
d_h_next_t = d_h_prev_t
d_x[:,t,:] = d_x_t
d_h0 = d_h_prev_t
d_Wx += d_Wx_t
d_Wh += d_Wh_t
d_b += d_b_t
# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
"""
Backward pass for a single timestep of an LSTM.
Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache
d_z = o * d_next_h
d_o = z * d_next_h
d_next_c += (1 - z * z) * d_z
d_f = d_next_c * prev_c
d_prev_c = d_next_c * f
d_i = d_next_c * g
d_g = d_next_c * i
d_a_g = (1 - g * g) * d_g
d_a_o = o * (1 - o) * d_o
d_a_f = f * (1 - f) * d_f
d_a_i = i * (1 - i) * d_i
d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)
d_prev_h = d_a.dot(Wh.T)
d_Wh = prev_h.T.dot(d_a)
d_x = d_a.dot(Wx.T)
d_Wx = x.T.dot(d_a)
d_b = np.sum(d_a, axis=0)
return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b
Summary
Now, back to your questions.
My question is how is LSTM backpropagation different then regular Neural Networks
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])
dy[targets[t]] -= 1
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.
I think your questions could not be answered in a short response. Nico's simple LSTM has a link to a great paper from Lipton et.al., please read this. Also his simple python code sample helps to answer most of your questions.
If you understand Nico's last sentence
ds = self.state.o * top_diff_h + top_diff_s
in detail, please give me a feed back. At the moment I have a final problem with his "Putting all this s and h derivations together".

Implementing a linear, binary SVM (support vector machine)

I want to implement a simple SVM classifier, in the case of high-dimensional binary data (text), for which I think a simple linear SVM is best. The reason for implementing it myself is basically that I want to learn how it works, so using a library is not what I want.
The problem is that most tutorials go up to an equation that can be solved as a "quadratic problem", but they never show an actual algorithm! So could you point me either to a very simple implementation I could study, or (better) to a tutorial that goes all the way to the implementation details?
Thanks a lot!
Some pseudocode for the Sequential Minimal Optimization (SMO) method can be found in this paper by John C. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. There is also a Java implementation of the SMO algorithm, which is developed for research and educational purpose (SVM-JAVA).
Other commonly used methods to solve the QP optimization problem include:
constrained conjugate gradients
interior point methods
active set methods
But be aware that some math knowledge is needed to understand this things (Lagrange multipliers, Karush–Kuhn–Tucker conditions, etc.).
Are you interested in using kernels or not? Without kernels, the best way to solve these kinds of optimization problems is through various forms of stochastic gradient descent. A good version is described in http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf and that has an explicit algorithm.
The explicit algorithm does not work with kernels but can be modified; however, it would be more complex, both in terms of code and runtime complexity.
Have a look at liblinear and for non linear SVM's at libsvm
The following paper "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM" top of page 11 describes the Pegasos algorithm also for kernels.It can be downloaded from http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
It appears to be a hybrid of coordinate descent and subgradient descent. Also, line 6 of the algorithm is wrong. In the predicate the second appearance of y_i_t should be replaced with y_j instead.
I would like to add a little supplement to the answer about original Platt's work.
There is a bit simplified version presented in Stanford Lecture Notes, but derivation of all the formulas should be found somewhere else (e.g. this random notes I found on the Internet).
If it's ok to deviate from original implementations, I can propose you my own variation of the SMO algorithm that follows.
class SVM:
def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
self.kernel = {'poly':lambda x,y: np.dot(x, y.T)**degree,
'rbf':lambda x,y:np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
'linear':lambda x,y: np.dot(x, y.T)}[kernel]
self.C = C
self.max_iter = max_iter
def restrict_to_square(self, t, v0, u):
t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]
def fit(self, X, y):
self.X = X.copy()
self.y = y * 2 - 1
self.lambdas = np.zeros_like(self.y, dtype=float)
self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
for _ in range(self.max_iter):
for idxM in range(len(self.lambdas)):
idxL = np.random.randint(0, len(self.lambdas))
Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]]
v0 = self.lambdas[[idxM, idxL]]
k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)
u = np.array([-self.y[idxL], self.y[idxM]])
t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15)
self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
idx, = np.nonzero(self.lambdas > 1E-15)
self.b = np.sum((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx])/len(idx)
def decision_function(self, X):
return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b
In simple cases it works not much worth than sklearn.svm.SVC, comparison shown below (I have posted code that generates these images on GitHub)
I used quite different approach to derive formulas, you may want to check my preprint on ResearchGate for details.

Resources