During training the two network, the generator converge slowly but suddenly it enter dead zone where all output is zero.
I believe this problem is due to learning rate and fast convergence in the discriminator part which force the generator to output zeros everywhere and can't find the best solution.
I reduced the learning for both of them and tried different implementation but every time I achieve 2000 epochs or higher I enter into dead zone.
Any advice from GAN experts/users ?
It is natural for GAN to die - how soon would it happen, depends from your particular problem. GAN is doomed to die due to the fact, that the point when discriminator is satisfied can never be reached by definition. That means: you are ending up with oscillations (gen "power" >> disc) or mode collapse (gen << disc).
Try several technics, from my experience the best/easiest known approach is weight norm + prelu
Try to make your disc weaker (smaller state, dropout, smoothing categories, skipping training, everything reasonable you can imagine is fine)
Again, here I have to promote a paper about stochastic deconvolutions were are going to publish soon. It is a simple trick that improves stability very much. Details here: Training GAN on small dataset of images
Related
I am training a deep model for MRI segmentation. The models I am using are U-Net++ and UNet3+. However, when plotting the validation and training losses of these models over time, I find that they all end with a sudden drop in loss, and a permanent plateau. Any ideas for what could be causing this plateau? or any ideas for how I could surpass it?
Here are the plots for the training and validation loss curves, and the corresponding segmentation performance (dice score) on the validation set. The drop in loss occurs at around epoch 80 and is pretty obvious in the graphs.
In regard to the things I've tried:
Perhaps a local minima is being found, which is hard to escape, so I tried resuming training at epoch 250 with the learning rate increased by a factor of 10, but the plateau stays the exact same regardless of how many epochs I keep training. I also tried resuming with a reduced LR of factor 10 and 100 and no change either.
Perhaps the model has too many parameters, i.e. the plateau is happening due to over-fitting. So I tried training models that have fewer parameters. This changed the actual loss value (Y-axis value) that the plateau ends up occurring at, but the same general shape of a sudden drop and plateau remains the same. I also tried increasing the parameters (because it was easy to do), and the same problem is observed.
Any ideas for what could be causing this plateau? or any ideas for how I could surpass it?
Due to the high number of parameters it is hard if not impossible to reason about the optimization landscape, so any speculations are really just that, speculations.
If you assume that the model got stuck somewhere, that is, that the gradient is getting very small (it's sometimes worth plotting the distribution of the entries of the gradient over time too, or at least its magnitude), it sometimes is worth artificially forcing the optimizer to adapt, by changing the environment. One popular way to do so is using weight decay. For instance using a usual weight decay for SGD or if you're using Adam, switching to AdamW. Alternatives that are based on a similar idea are warm restarts.
Finally it might very well be possible that you reached the limits of what your model can achieve. A dice score in the neighbourhood of 0.9 is already quite good in many of todays segmentation tasks.
I'm fairly new to machine learning, and working on optimizing hyperparameters for my model. I'm doing this via a randomized search. My question is: should I be searching over # of epochs and batch size along with my other hyperparameters (e.g. loss function, number of layers, etc.)? If not, should I fix a these values first, find the other parameters, then return to tune these?
My concern is a) that searching over many epochs will be extremely time-consuming, so leaving it at one low value for the initial scan would be useful and b) that these parameters, esp. # of epochs, will disproportionately affect the results when the model is behaving well, and won't really give me much information about the rest of my architecture, as there should be a regime where more epochs, up to a point, are better. I know this isn't totally accurate, i.e. # of epochs is a real hyperparameter and too many can lead to overfitting issues, for example. Currently, my model is not clearly improving with # of epochs, though it was suggested by someone working on a similar problem within my area of research that this may be mitigated by implementing batch normalization, which is another parameter I am testing. Finally, I am worried that batch size will be quite affected by the fact that I am scaling my data down to 60% to allow my code to run reasonably (and I think the final model will be trained on vastly more data than the simulated data currently available to me).
I agree with your intuition on epochs. It is common to keep this value as low as possible in order to complete more training "experiments" in the same number of working hours. I don't have a great reference here, but I would welcome one in the comments.
For almost everything else, there is a paper by Leslie N. Smith that I can't recommend enough, A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay.
As you can see, batch size is included but epochs are not. You will also notice that the model architecture is not included (number of layers, layer size, etc). Neural Architecture Search is a huge research field in its own right, separate from hyper-parameter tuning.
As for the loss function, I can't think of any reason to "tune" that except in the context of an Auxiliary Loss for training only, which I suspect is not what you are talking about.
The loss function that will be applied to your validation or test set is part of the problem statement. That, along with the data, defines the problem you are solving. You don't changing it by tuning, you change it by convincing a product manager that your alternative is better for the business need.
tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.
When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.
I have been working on Support Vector Machine for about 2 months now. I have coded SVM myself and for the optimization problem of SVM, I have used Sequential Minimal Optimization(SMO) by Dr. John Platt.
Right now I am in the phase where I am going to grid search to find optimal C value for my dataset. ( Please find details of my project application and dataset details here SVM Classification - minimum number of input sets for each class)
I have successfully checked my custom implemented SVM`s accuracy for C values ranging from 2^0 to 2^6. But now I am having some issues regarding the convergence of the SMO for C> 128.
Like I have tried to find the alpha values for C=128 and it is taking long time before it actually converges and successfully gives alpha values.
Time taken for the SMO to converge is about 5 hours for C=100. This huge I think ( because SMO is supposed to be fast. ) though I`m getting good accuracy?
I am screwed right not because I can not test the accuracy for higher values of C.
I am actually displaying number of alphas changed in every pass of SMO and getting 10, 13, 8... alphas changing continuously. The KKT conditions assures convergence so what is so weird happening here?
Please note that my implementation is working fine for C<=100 with good accuracy though the execution time is long.
Please give me inputs on this issue.
Thank You and Cheers.
For most SVM implementations, training time can increase dramatically with larger values of C. To get a sense of how training time in a reasonably good implementation of SMO scales with C, take a look at the log-scale line for libSVM in the graph below.
SVM training time vs. C - From Sentelle et al.'s A Fast Revised Simplex Method for SVM Training.
alt text http://dmcer.net/StackOverflowImages/svm_scaling.png
You probably have two easy ways and one not so easy way to make things faster.
Let's start with the easy stuff. First, you could try loosening your convergence criteria. A strict criteria like epsilon = 0.001 will take much longer to train, while typically resulting in a model that is no better than a looser criteria like epsilon = 0.01. Second, you should try to profile your code to see if there are any obvious bottlenecks.
The not so easy fix, would be to switch to a different optimization algorithm (e.g., SVM-RSQP from Sentelle et al.'s paper above). But, if you have a working implementation of SMO, you should probably only really do that as a last resort.
If you want complete convergence, especially if C is large, it takes a very long time.You can consider defining a large stop criterion, and give the maximum number of iterations, the default in Libsvm is 1000000, if the number of iterations is more, the time will multiply, but the loss is not worth the cost, but the result may not fully meet the KKT condition, some support vectors are in the band, non-support vectors are out of the band, but the error is small and acceptable.In my opinion, it is recommended to use other quadratic programming algorithms instead of SMO algorithm if the accuracy is higher