I have a cost function of the following form as part of the computational graph:
cost = term_1 - alpha * term_2
I want to dynamically anneal the value of alpha during training but I cannot find a straightforward way to do it. Do you have any suggestions?
Thanks
You can create a placeholder for the alpha. Your problem is similar to setting an adaptive learning rate, so check out this: How to set adaptive learning rate for GradientDescentOptimizer?
Related
I am working on WGAN and would like to implement WGAN-GP.
In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by definition is equivalent to 1-Lipschitiz constraint), so why do we bother to penalize the gradient? Why don't we just clip the gradient?
The reason is that clipping in general is a pretty hard constraint in a mathematical sense, not in a sense of implementation complexity. If you check original WGAN paper, you'll notice that clip procedure inputs model's weights and some hyperparameter c, which controls range for clipping.
If c is small then weights would be severely clipped to a tiny values range. The question is how to determine an appropriate c value. It depends on your model, dataset in a question, training procedure and so on and so forth. So why not to try soft penalizing instead of hard clipping? That's why WGAN-GP paper introduces additional constraint to a loss function that forces gradient's norm to be as much close to 1 as possible, avoiding hard collapsing to a predefined values.
The answer by CaptainTrunky is correct but I also wanted to point out one, really important, aspect.
Citing the original WGAN-GP paper:
Implementing k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in [Corollary 1], the optimal WGAN critic has unit gradient norm almost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.
So as You can see weight clipping may (it depends on the data You want to generate - autors of this article stated that it doesn't always behave like that) lead to undesired behaviour. When You will try to train WGAN to generate more complex data the task has high possibility of failure.
I'm interested in taking advantage of some partially labeled data that I have in a deep learning task. I'm using a fully convolutional approach, not sampling patches from the labeled regions.
I have masks that outline regions of definite positive examples in an image, but the unmasked regions in the images are not necessarily negative - they may be positive. Does anyone know of a way to incorporate this type of class in a deep learning setting?
Triplet/contrastive loss seems like it may be the way to go, but I'm not sure how to accommodate the "fuzzy" or ambiguous negative/positive space.
Try label smoothing as described in section 7.5.1 of Deep Learning book:
We can assume that for some small constant eps, the training set label y is correct with probability 1 - eps, and otherwise any of the other possible labels might be correct.
Label smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of eps / k and 1 - (k - 1) / k * eps, respectively.
See my question about implementing label smoothing in Pandas.
Otherwise if you know for sure, that some areas are negative, other are positive while some are uncertain, then you can introduce a third uncertain class. I have worked with data sets that contained uncertain class, which corresponded to samples that could belong to any of the available classes.
I'm assuming that you are struggling with a data segmantation task with a problem of a ill-definied background (e.g. you are not sure if all examples are correctly labeled). Recently I came across the similiar problem and this is what I came across during my research:
In old days before deep learning and at the begining of deep learning era - the common way to deal with that is to smooth your output with some kind of a probability model which would take into account the possibility of a noisy labels (you could read about this in a Learning to Label from Noisy Data chapter from this book. It's important to discriminate this probabilistic models from models used to smooth your labels w.r.t. to image or label structure like classical CRFs for bilateral smoothing.
What we finally used (and worked really well) is the Channel Inhibited Softmax idea from this paper. In terms of a mathematical properties - it makes your network much more robust to some objects not labeled - because it makes your network to output much higher positive valued logits at correctly labeled objects.
You could treat this as a semi-supervised problem. Use the full dataset without labels to train a bottleneck autoencoder structure (or a GAN approach). This pretrained model can then be adjusted (e.g. removing the last layers, adding a better layer structure at the end on top of the bottleneck features) and finetuned on the labeled data.
I want to apply SVM on audio data det. I am extarcting difftrent features from the speech signal. After reducing the dimention of this matrix, I am still getting a features in matix form. Can anyone help me regarding the data formating
should i have to convert the feature matix in a row vector? Can i assign same label to each row of one feature matrix and other label to the rows of other matrix?
Little bit ambiguous question but let me try to resolve your problem. For feature selection, you can use filter method, wrapper method etc. One popularly used method is principle component analysis. Once you select your feature you can directly feed them to the classifier. In your case, i guess you are getting lower dimensional representation of your training data (for example, if you have used SVD). In this case, its fine, now you can use it for SVM classification.
What did you mean by adding label to feature matrix? You can add label to the training instances, not the features. I guess you are talking about separate matrix for each of the class labels. If that is the case, yes you can use as you want but remember it depends on the model design.
What will happen if I multiply a constant to the loss function? I think I will get a larger gradient, right? Is it equal to having a larger learning rate?
Basically - it depends on many things:
If you use a classic stochastic / batch / full batch learning with an update rule, where:
new_weights = old_weights - learning_rate * gradient
then due to multiplication commutativity - your claim is true.
If you are using any learning method which has an adaptive learning rate (like ADAM or rmsprop)- then things change a little bit. Then still - your gradients would be affected by multiplication - but a learning rate could not be affected at all. It depends on how new value of a cost function will cooperate with learning algorithm.
If you use a learning method in which you have an adaptive gradient but not adaptive learning rate - usually learning rate is affected in a same way like in point 1. (e.g. in momentum methods).
Yes, you are right. It is equivalent to changing the learning rate.
I have 2 grayscale images say G1 and G2 . I also have the statistics (min ,max ,mean and Standard Deviation). I would like to change G2 such that the statistics of G2 (min ,max,mean and SD)match G1. I have tried arithmetic scaling and got the min and max values of both G1 and G2 to match but mean and SD are still different. I have also tried Histogram fitting of G2 in G1 but that did not do what i wanted either. I am using a software called SPIDER this a question applicable to image-processing which can be performed using different software packages(OpenCV MATLABetc) .Any ideas and suggestions would be greatly appreciated.
The easiest thing to do is to apply histogram equalization to both images (histeq in MATLAB). If you do not want to change both images, then you can do histogram matching, but that's a bit more complicated.
You can generate a mapping of input to output based on a simple curve. Start with the values that don't have any dependencies, min and max - those will set the ends of the curve. Now map the mean values to create a single point in the middle of the curve. To modify the standard deviation, you change the shape of the curve between the mean and the endpoints - a curve that is flatter in the middle will give less deviation, and a curve that is flatter towards the ends but steeper in the middle will magnify it.
Edit: I haven't given this enough thought yet, changing the shape of the curve will also change the mean. But I think it can be worked into something usable.
I marked the histogram equalization answer as right because it gave me the best results however I was unable to make the 2 images exactly statistically equivalent as such