I am working on WGAN and would like to implement WGAN-GP.
In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by definition is equivalent to 1-Lipschitiz constraint), so why do we bother to penalize the gradient? Why don't we just clip the gradient?
The reason is that clipping in general is a pretty hard constraint in a mathematical sense, not in a sense of implementation complexity. If you check original WGAN paper, you'll notice that clip procedure inputs model's weights and some hyperparameter c, which controls range for clipping.
If c is small then weights would be severely clipped to a tiny values range. The question is how to determine an appropriate c value. It depends on your model, dataset in a question, training procedure and so on and so forth. So why not to try soft penalizing instead of hard clipping? That's why WGAN-GP paper introduces additional constraint to a loss function that forces gradient's norm to be as much close to 1 as possible, avoiding hard collapsing to a predefined values.
The answer by CaptainTrunky is correct but I also wanted to point out one, really important, aspect.
Citing the original WGAN-GP paper:
Implementing k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in [Corollary 1], the optimal WGAN critic has unit gradient norm almost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.
So as You can see weight clipping may (it depends on the data You want to generate - autors of this article stated that it doesn't always behave like that) lead to undesired behaviour. When You will try to train WGAN to generate more complex data the task has high possibility of failure.
Related
If I want to optimize a function with respect to some constrained value, I can find a bijective map between an unconstrained space and the constrained space, then optimize the composition of the original function and the bijective map with respect to the unconstrained value.
Does optimizing in a different space affect the performance or accuracy of optimization? And does it vary between bijective maps?
My use case is training constrained Gaussian process model hyperparameters in GPflow using TensorFlow Probability's bijectors.
If I understand you correctly, you might have for example some variable that is constrained to be positive and want to optimize it. And for that you train the variable in the unconstrained space?
That would be pretty common in machine learning where you for example enforce a variance (of let's say the likelihood) to be positive by taking the exponent of the unconstrained value.
I guess the effect on the optimization very much depends on how you optimize it. For gradient based methods it does have an effect, and sometimes small tricks are helpful to improve those issues (e.g. shifting, so that your transformation is tf.exp(shift_val + unconstrained_variable) ).
And yes afaik it varies inbetween different mappings. In my example, the softplus and exponential transformation result in different gradients. Tough I'm not sure if there's a consent on which one is preferable.
I'd just try a few different ones. As long as it doesn't lead to numerical issues, either transformation/bijection should be fine.
EDIT: just to clarify. The bijection should not affect the solution space, just the optimization path itself.
I'm going through CS231n to understand the basics of neural networks.
Attached is the slide in which Justin (the tutor) gives the reasoning for why data preprocessing is required and I don't completely understand. The explanation given is similar to the one given on the slide and I don't get it. The slide is below.
The second question I have is: is it actually normalisation or standardisation? This link implies that it is standardisation, whereas the course material says it is normalisation.
Any help will be appreciated.
A) The meaning of "less sensitive to small changes in weights" can easily be visualized. Imagine to operate a little change in the weights of the drawn hyperplane, i.e. rotate it a bit. If the samples are located around the origin, you'll notice that they can still be correctly classified. If they're far away from the origin, the same little change in weights will lead to bigger misclassifications.
B) Sometimes standardization and normalization are used interchangeably.
Standardization: I quote from Machine Learning and Pattern Recognition by Bishop : "For the purposes of this example, we have made a linear re-scaling of the data, known as standardizing, such that each of the variables has zero mean and unit standard deviation."
Normalization could be e.g. min-max normalization when you scale all feature values to the [0,1] range, or feature vector normalization when you divide the feature vector by its modulus.
I am currently training an LSTM RNN for time-series forecasting. I understand that it is common practice to clip the gradients of the RNN when it crosses a certain threshold. However, I am not completely clear on whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid(connected_weights*h + bias). I know that the gradients for the weights for determining the hidden layer are clipped, but does the same go for the output layer?
In other words, are the gradients for the connected_weights also clipped in gradient clipping?
While nothing prevents you from clipping them as well, there is no reason to do so. A nice paper with reasons is here, I'll try to give you an overview.
The problem we're trying to solve by gradient clipping is that of exploding gradients: Let's assume that your RNN layer is computed like this:
h_t = sigmoid(U * x + W * h_tm1 + b)
So forgetting about the nonlinearity for a while, you could say that a current state h_t depends on some earlier state h_{t-T} as h_t = W^T * h_tmT + input. So if the matrix W inflates the hidden state, the influence of that old hidden state is growing exponentially with time. And the same happens as you backpropagate the gradient, resulting in gradients that will most likely get you to to some useless point in the parameter space.
On the other hand, the output layer is applied just once during both forward and backward pass, so while it may complicate the learning, it will only be by a 'constant' factor, independent of the unrolling in time.
To get a bit more technical: The crucial quantity which determines whether you get exploding gradient is the largest eigenvalue of W. If it is larger than one (or smaller than -1, then it's real fun :-)), then you get exploding gradients. Conversely, if it's smaller than one, you'll suffer from vanishing gradients, making it difficult to learn long-term dependencies. You can find a nice discussion of these phenomena here, with pointers to classical literature.
If we take the sigmoid back into the picture, it becomes more difficult to get exploding gradients, as the gradients get dampened by at least a factor of 4 when being backpropagated through it. But still, have an eigenvalue larger than 4 and you'll have adventures :-) It's rather important to initialize carefully, the second paper gives some hints. With tanh, there is little dampening around zero and ReLU just propagates the gradient through, so these are rather prone to gradient explodions and thus sensitive to initialization and gradient clipping.
Overall, LSTMs have better learning properties than vanilla RNNs, esp. with regard to the vanishing gradients. Though from my experience, gradient clipping is usually necessary with them as well.
EDIT: When to clip?
Right before the update of the weights, i.e. you do the backprop unaltered. The thing is that gradient clipping is kind of a dirty hack. You still want your gradient as precise as possible, so you better don't distort it in the middle of the backprop. Just that if you see the gradient become very large, you say Nah, this smells. I better make a tiny step. and clipping is an easy way to do it (it may be that only some elements of the gradient are exploded while the others are still well behaved and informative). With most of the toolkits, you don't have the choice anyway, because the backpropagation happens atomically.
I'm interested in taking advantage of some partially labeled data that I have in a deep learning task. I'm using a fully convolutional approach, not sampling patches from the labeled regions.
I have masks that outline regions of definite positive examples in an image, but the unmasked regions in the images are not necessarily negative - they may be positive. Does anyone know of a way to incorporate this type of class in a deep learning setting?
Triplet/contrastive loss seems like it may be the way to go, but I'm not sure how to accommodate the "fuzzy" or ambiguous negative/positive space.
Try label smoothing as described in section 7.5.1 of Deep Learning book:
We can assume that for some small constant eps, the training set label y is correct with probability 1 - eps, and otherwise any of the other possible labels might be correct.
Label smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of eps / k and 1 - (k - 1) / k * eps, respectively.
See my question about implementing label smoothing in Pandas.
Otherwise if you know for sure, that some areas are negative, other are positive while some are uncertain, then you can introduce a third uncertain class. I have worked with data sets that contained uncertain class, which corresponded to samples that could belong to any of the available classes.
I'm assuming that you are struggling with a data segmantation task with a problem of a ill-definied background (e.g. you are not sure if all examples are correctly labeled). Recently I came across the similiar problem and this is what I came across during my research:
In old days before deep learning and at the begining of deep learning era - the common way to deal with that is to smooth your output with some kind of a probability model which would take into account the possibility of a noisy labels (you could read about this in a Learning to Label from Noisy Data chapter from this book. It's important to discriminate this probabilistic models from models used to smooth your labels w.r.t. to image or label structure like classical CRFs for bilateral smoothing.
What we finally used (and worked really well) is the Channel Inhibited Softmax idea from this paper. In terms of a mathematical properties - it makes your network much more robust to some objects not labeled - because it makes your network to output much higher positive valued logits at correctly labeled objects.
You could treat this as a semi-supervised problem. Use the full dataset without labels to train a bottleneck autoencoder structure (or a GAN approach). This pretrained model can then be adjusted (e.g. removing the last layers, adding a better layer structure at the end on top of the bottleneck features) and finetuned on the labeled data.
I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".