How does gradient update work in SimOTA label assignment in YOLOX? - machine-learning

I am confused about how gradient update works for the SimOTA label assignment part in YOLOX.
In Megvii's implementation of the yolo_head.py, there is get_losses function.
A part of the function calls get_assignments function, which implements the SimOTA label assignment strategy mentioned in the original YOLOX paper:
try:
(
gt_matched_classes,
fg_mask,
pred_ious_this_matching,
matched_gt_inds,
num_fg_img,
) = self.get_assignments( # noqa
batch_idx,
num_gt,
total_num_anchors,
gt_bboxes_per_image,
gt_classes,
bboxes_preds_per_image,
expanded_strides,
x_shifts,
y_shifts,
cls_preds,
bbox_preds,
obj_preds,
labels,
imgs,
)
My understanding is:
The get_assignments function has the #torch.no_grad() decorator which would prevent the gradient calculation from taking place in this function during back propagation.
(I believe) This would mean that the return values of the get_assignments function would be treated as pre-computed constants, except that they will be varying for each image & groundtruth input.
Above points suggest that the neural network would be trying to learn something from an (paradoxically) ever-changing pre-computed "constants" for every image input which does not seem to make much sense. Intuition leads me to think that whatever calculation( that could vary across inputs) that results in a loss should be differentiable & BP'ed.
Is there something inaccurate in my understanding of the YOLOX architecture / how BP works?

Upon thinking over my question, I realized that the matching cost matrix obtained from dynamic_k_matching() (inside get_assignments) serves merely as another proxy groundtruth target. There is no reason to compute gradients within a function that creates a target groundtruth.

Related

Using sample_weights with fit_generator()

Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)

Polynomial regression for one input feature

I am new to machine learning. I am having a question regarding polynomial regression using one feature.
My understanding is that if there is one input feature, we can create a hypothesis function by taking the squares and cubes the feature.
Suppose x1 is the input feature and our hypothesis function becomes something like this :
htheta(x) = theta0 + (theta1)x1 + (theta2)x1^2 + (theta3)x1^3.
My question is what is the use case of such scenario ? In what type of data, this type of hypothesis function will help ?
This scenario is for simple curve fitting problems. For example, you might have a spring and want to know how far the spring is stretched as a function of how much force you apply (the spring needn't be a linear spring obeying Hooke's law). You could build a model by collecting a bunch of measurements of different forces applied on the spring (measured in Newtons) and the resulting spring extension (also called displacement) in centimeters. You could then build a model of the form F(x) = theta_1 * x + theta_2 * x^3 + theta_3 * x^5 and fit the three theta parameters. You could of course do this with any other single variable problem (height vs. age, weight vs. blood pressure, current vs. voltage). In practice, you generally have many more than just a one dependent variable though.
Also worth pointing out that the transformations needn't be polynomial in the dependent variable (x in this case). You could just as well try logs, square roots, exponentials etc. If you're asking why is it always a parameter times a function of the input variable, this is more of a modeling choice than anything (specifically a linear model since it's linear in theta). It does not have to be this way and is a simple assumption that restricts the class of functions. Linear models also satisfy some intuitive statistical properties which also justify their use (see here)

What is the way to understand Proximal Policy Optimization Algorithm in RL?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?
What is the roadmap to learn and use PPO ?
To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".
From the original PPO paper:
We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region [TRPO] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
1. The Clipped Surrogate Objective
The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.
For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:
This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.
The vanilla policy gradient method uses the log probability of your action (log π(a | s)) to trace the impact of the actions, but you could imagine using another function to do this. Another such function, introduced in this paper, uses the probability of the action under the current policy (π(a|s)), divided by the probability of the action under your previous policy (π_old(a|s)). This looks a bit similar to importance sampling if you are familiar with that:
This r(θ) will be greater than 1 when the action is more probable for your current policy than it is for your old policy; it will be between 0 and 1 when the action is less probable for your current policy than for your old.
Now to build an objective function with this r(θ), we can simply swap it in for the log π(a|s) term. This is what is done in TRPO:
But what would happen here if your action is much more probable (like 100x more) for your current policy? r(θ) will tend to be really big and lead to taking big gradient steps that might wreck your policy. To deal with this and other issues, TRPO adds several extra bells and whistles (e.g., KL Divergence constraints) to limit the amount the policy can change and help guarantee that it is monotonically improving.
Instead of adding all these extra bells and whistles, what if we could build these stabilizing properties into the objective function? As you might guess, this is what PPO does. It gains the same performance benefits as TRPO and avoids the complications by optimizing this simple (but kind of funny looking) Clipped Surrogate Objective:
The first term (blue) inside the minimization is the same (r(θ)A) term we saw in the TRPO objective. The second term (red) is a version where the (r(θ)) is clipped between (1 - e, 1 + e). (in the paper they state a good value for e is about 0.2, so r can vary between ~(0.8, 1.2)). Then, finally, the minimization of both of these terms is taken (green).
Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.
Great.
Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:
On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.
Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).
On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.
So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.
But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)
These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.
Here is a diagram summarizing this:
And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.
2. Multiple epochs for policy updating
Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.
PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):
The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.
The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.
For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.
....
And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.
[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.
PPO, and including TRPO tries to update the policy conservatively, without affecting its performance adversely between each policy update.
To do this, you need a way to measure how much the policy has changed after each update. This measurement is done by looking at the KL divergence between the updated policy and the old policy.
This becomes a constrained optimization problem, we want change the policy in the direction of maximum performance, following the constraints that the KL divergence between my new policy and old do not exceed some pre defined (or adaptive) threshold.
With TRPO, we compute the KL constraint during update and finds the learning rate for this problem (via Fisher Matrix and conjugate gradient). This is somewhat messy to implement.
With PPO, we simplify the problem by turning the KL divergence from a constraint to a penalty term, similar to for example to L1, L2 weight penalty (to prevent a weights from growing large values). PPO makes additional modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio (ratio of updated policy with old) to be within a small range around 1.0, where 1.0 means the new policy is same as old.
PPO is a simple algorithm, which falls into policy optimization algorithms class (as opposed to value-based methods such as DQN). If you "know" RL basics (I mean if you have at least read thoughtfully some first chapters of Sutton's book for example), then a first logical step is to get familiar with policy gradient algorithms. You can read this paper or chapter 13 of Sutton's book new edition. Additionally, you may also read this paper on TRPO, which is a previous work from PPO's first author (this paper has numerous notational mistakes; just note). Hope that helps. --Mehdi
I think implementing for a discrete action space such as Cartpole-v1 is easier than for continuous action spaces. But for continuous action spaces, this is the most straight-forward implementation I found in Pytorch as you can clearly see how they get mu and std where as I could not with more renowned implementations such as Openai Baselines and Spinning up or Stable Baselines.
RL-Adventure PPO
These lines from the link above:
class ActorCritic(nn.Module):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, num_outputs),
)
self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)
self.apply(init_weights)
def forward(self, x):
value = self.critic(x)
mu = self.actor(x)
std = self.log_std.exp().expand_as(mu)
dist = Normal(mu, std)
return dist, value
and the clipping:
def ppo_update(ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
for _ in range(ppo_epochs):
for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
dist, value = model(state)
entropy = dist.entropy().mean()
new_log_probs = dist.log_prob(action)
ratio = (new_log_probs - old_log_probs).exp()
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
I found the link above the comments on this video on Youtube:
arxiv insights PPO

Caffe: what backward function should calculate?

I'm trying to define custom loss function for Caffe using Python layer but I can't clarify what is a required output.
Let's a function for the layer is defined as L = sum(F(xi, yi))/batch_size, where L is loss function to be minimized (i.e. top[0]), x is a network output (bottom[0]), y is ground truth label (i.e. bottom[1]) and xi,yi are i-th samples in a batch.
Widely known example with EuclideanLossLayer (https://github.com/BVLC/caffe/blob/master/examples/pycaffe/layers/pyloss.py) shows that backward level in this case must return bottom[0].diff[i] = dL(x,y)/dxi. Another reference I've found shows the same: Implement Bhattacharyya loss function using python layer Caffe
But in other examples I have seen that it should be multiplied by top[0].diff.
1. What is correct? bottom[0][i] = dL/dx or bottom[0].diff[i] = dL/dxi*top[0].diff[i]
Each loss layer may have loss_weight: indicating the "importance" of this specific loss (in case there are several loss layers for the net). Caffe implements this weight as top[0].diff to be multiplied by the gradients.
Let's back off to basic principles: the purpose of back-propagation is to adjust the layer weights according to the ground-truth feedback. The most basic parts of this include "how far off is my current guess" and "how hard should I yank the change lever?" These are formalized as top.diff and learning_rate, respectively.
At a micro level, the ground truth for each layer is that top feedback, so top.diff is the local avatar of "how far off ...". Thus at some point, you need to include top[0].diff as a primary factor in your adjustment computation.
I know this isn't a complete, direct answer -- but I hope it continues to help even after you solve the immediate problem.

How does keras(or any other ML framework) calculate the gradient of a lambda function layer for backpropagation?

Keras enables adding a layer which calculates a user defined lambda function.
What I don't get is how Keras knows to calculate the gradient of this user defined function for the backpropagation.
That one of the benefit of using Theano/Tensorflow and libraries build on top of them. They can give you automatic gradient calculation of the mathematical functions and operations.
Keras gets them by calling:
# keras/theano_backend.py
def gradients(loss, variables):
return T.grad(loss, variables)
# keras/tensorflow_backend.py
def gradients(loss, variables):
'''Returns the gradients of `variables` (list of tensor variables)
with regard to `loss`.
'''
return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
which are in turn called by the optimizers(keras/optimizers.py) grads = self.get_gradients(loss, params) to get the gradients which are used to write the update rule for all the params. params here are the trainable weights of the layers. But layers created by the Lambda functional layers don't have any trainable weights. But they affect the loss function though the forward prob and hence indirectly affect the calculation of the gradients of trainable weights of other layers.
The only time you need to write new gradient calculation is when you are defining a new basic mathematical operation/function. Also, when you write a custom loss function the auto grad almost always takes care of the gradient calculation. But optionally you can optimize training (not always) if you implement analytical gradient of your custom functions. For example softwax function can be expressed in exp, sum and div and auto grad can take care of it, but its analytical/symbolic grad is usually implemented in Theano/Tensorflow.
For implementing new Ops you can see the below links for that:
http://deeplearning.net/software/theano/extending/extending_theano.html
https://www.tensorflow.org/versions/r0.12/how_tos/adding_an_op/index.html

Resources