TensorFlow - GradientDescentOptimizer - are we actually finding global optimum? - machine-learning

I was playing with tensorflow for quite a while, and I have more of a theoretical question. In general, when we train a network we usually use GradientDescentOptimizer (probably its variations like adagrad or adam) to minimize the loss function. In general it looks like we are trying to adjust weights and biases so that we get the global minimum of this loss function. But the issue is that I assume that this function has an extremely complicated look if you plot it, with lots of local optimums. What I wonder is how can we be sure that Gradient Descent finds global optimum and that we are not getting instantly stuck in some local optimum instead which is far away from global optimum?
I recollect that for example when you are performing clustering in sklearn it usually runs clustering algorithm several times with random initialization of cluster centers, and by doing this we ensure that we are not getting stuck with not optimal result. But we are not doing something like this while training ANNs in tensorflow - we start with some random weights and just travel along the slope of the function.
So, any insight into this? Why we can be more or less sure that the results of training via gradient descent are close to global minimum once the loss stops to decrease significantly?
Just to clarify, why I am wondering about this matter is that if we can't be sure that we get at least close to global minimum we can't easily judge which of 2 different models is actually better. Because we could run experiment, get some model evaluation which shows that model is not good... But actually it just stuck in local minimum shortly after training started. While other model which seemed for us to be better was just more lucky to start training from a better starting point and didn't stuck in local minimum fast. Moreover, this issue means that we can't even be sure that we get maximum from the network architecture we currently could be testing. For example, it could have really good global minimum but it is hard to find it and we mostly get stuck with poor solutions at local minimums, which would be far away from global optimum and never see the full potential of network at hand.

Gradient descent, by its nature, is looking at the function locally (local gradient). Hence, there is absolutely no guarantee that it will be the global minima. In fact, it probably will not be unless the function is convex. This is also the reason GD like methods are sensitive to initial position you start from. Having said that, there was a recent paper which said that in high-dimensional solution spaces, the number of maximas/minimas are not as many as previously thought.
Finding global minimas in high dimensional space in a reasonable way seems very much an unsolved problem. However, you might wanna focus more on saddle points rather than minimas. See this post for example:
High level description for saddle point problem
A more detailed paper is here (https://arxiv.org/pdf/1406.2572.pdf)

Your intuition is quite right. Complex models such as neural networks are typically applied to problems with high dimensional input, where the error surface has a very complex landscape.
Neural networks are not guaranteed to find the global optimum and getting stuck in local minima is a problem where a lot of research has been focussed. If you’re interested in finding out more about this, it would be good to look at techniques such as online learning and momentum, which have traditionally been used to avoid the problem of local minima. However, these techniques in themselves bring further difficulties e.g. integrating online learning is not possible for some optimisation techniques and the addition of a momentum hyper-parameter to the backpropagation algorithm brings further difficulties in training.
A really good video for visualising the influence of momentum (and how it overcomes local minma) during backpropagation can be found here.
Added after question edit - see comments
It’s the aforementioned nature of the problems neural networks are applied to that means we often can’t find a globally optimal solution, because (in the general case) traversing the entire search space for the optimal solution would be intractable using classical computing technology (quantum computers could change this for some problems). As such neural networks are trained to achieve what is hopefully a ‘good’ local optimum.
If you're interested in reading more detailed information about the techniques employed to find good local optima (i.e. something that approximates a global solution) a good paper to read might be this

No. Gradient descent method helps to find out the local minima. In case if global minima and local minima are same then only we get the actual result i.e. global minima.

Related

Best Reinforcement Learner Optimizer

I'm running a SAC reinforcement learner for a robotics application with some pretty decent results. One of the reasons I opted for reinforcement learning is for the ability for learning in the field, e.g. to adjust to a mechanical change, such as worn tires or a wheel going a little out of alignment.
My reinforcement learner restores it's last saved weights and replay buffer upon startup, so it doesn't need to retrain every time I turn it on. However, one concern I have is with respect to the optimizer.
Optimizers have come a long way since ADAM, but everything I read and all the RL code samples I see still seem to use ADAM with a fixed learning rate. I'd like to take advantage of some of the advances in optimizers, e.g. one cycle AdamW. However, a one-cycle optimizer seems inappropriate for a continuous real-world reinforcement learning problem: I imagine it's pretty good for the initial training/calibration, but I expect the low final learning rate would react too slowly to mechanical changes.
One thought I had was perhaps to do a one-cycle approach for initial training, and triggering a smaller one-cycle restart if a change in error that indicates something has changed (perhaps the size of the restart could be based on the size of the change in error).
Has anyone experimented with optimizers other than ADAM for reinforcement learning or have any suggestions for dealing with this sort of problem?
Reinforcement learning is very different from traditional supervised learning because the training data distribution changes as the policy improves. In optimization terms, the objective function can be said to be non-stationary. For this reason, I suspect your intuition is likely correct -- that a "one-cycle" optimizer would perform poorly after a while in your application.
My question is, what is wrong with Adam? Typically, the choice of optimizer is a minor detail for deep reinforcement learning; other factors like the exploration policy, algorithmic hyperparameters, or network architecture tend to have a much greater impact on performance.
Nevertheless, if you really want to try other optimizers, you could experiment with RMSProp, Adadelta, or Nesterov Momentum. However, my guess is that you will see incremental improvements, if any. Perhaps searching for better hyperparameters to use with Adam would be a more effective use of time.
EDIT: In my original answer, I made the claim that the choice of a particular optimizer is not primarily important for reinforcement learning speed, and neither is generalization. I want to add some discussion that helps illustrate these points.
Consider how most deep policy gradient methods operate: they sample a trajectory of experience from the environment, estimate returns, and then conduct one or more gradient steps to improve the parameterized policy (e.g. a neural network). This process repeats until convergence (to a locally optimal policy).
Why must we continuously sample new experience from the environment? Because our current data can only provide a reasonable first-order approximation within a small trust region around the policy parameters that were used to collect that data. Hence, whenever we update the policy, we need to sample more data.
A good way to visualize this is to consider an MM algorithm. At each iteration, a surrogate objective is constructed based on the data we have now and then maximized. Each time, we will get closer to the true optimum, but the speed at which we approach it is determined only by the number of surrogates we construct -- not by the specific optimizer we use to maximize each surrogate. Adam might maximize each surrogate in fewer gradient steps than, say, RMSProp does, but this does not affect the learning speed of the agent (with respect to environment samples). It just reduces the number of minibatch updates you need to conduct.
SAC is a little more complicated than this, as it learns Q-values in an off-policy manner and conducts updates using experience replay, but the general idea holds. The best attainable policy is subject to whatever the current data in our replay memory are; regardless of the optimizer we use, we will need to sample roughly the same amount of data from the environment to converge to the optimal policy.
So, how do you make a faster (more sample-efficient) policy gradient method? You need to fundamentally change the RL algorithm itself. For example, PPO almost always learns faster than TRPO, because John Schulman and co-authors found a different and empirically better way to generate policy gradient steps.
Finally, notice that there is no notion of generalization here. We have an objective function that we want to optimize, and once we do optimize it, we have solved the task as well as we can. This is why I suspect that the "Adam-generalizes-worse-than-SGD" issue is actually irrelevant for RL.
My initial testing suggest the details of the optimizer and it's hyperparameters matter, at least for off-policy techniques. I haven't had the chance to experiment much with PPO or on-policy techniques, so I can't speak for those unfortunately.
To speak to #Brett_Daley's thoughtful response a bit: the optimizer is certainly one of the less important characteristics. The means of exploration, and the use of a good prioritized replay buffer are certainly critical factors, especially with respect to achieving good initial results. However, my testing seems to show that the optimizer becomes important for the fine-tuning.
The off-policy methods I have been using have been problematic with fine-grained stability. In other words, the RL finds the mostly correct solution, but never really hones in on the perfect solution (or if it does find it briefly, it drifts off). I suspect the optimizer is at least partly to blame.
I did a bit of testing and found that varying the ADAM learning rate has an obvious effect. Too high and both the actor and critic bounce around the minimum and never converge on the optimal policy. In my robotics application this looks like the RL consistently makes sub-optimal decisions, as though there's a bit of random exploration with every action that always misses the mark a little bit.
OTOH, a lower learning rate tends to get stuck in sub-optimal solutions and is unable to adapt to changes (e.g. slower motor response due to low battery).
I haven't yet run any tests of single-cycle schedule or AdamW for the learning rate, but I did a very basic test with a two stage learning rate adjustment for both Actor and Critic (starting with a high rate and dropping to a low rate) and the results were a clearly more precise solution that converged quickly during the high learning rate and then honed in better with the low-learning rate.
I imagine AdamW's better weight decay regularization may result in similarly better results for avoiding overfitting training batches contributing to missing the optimal solution.
Based on the improvement I saw, it's probably worth trying single-cycle methods and AdamW for the actor and critic networks for tuning the results. I still have some concerns for how the lower learning rate at the end of the cycle will adapt to changes in the environment, but a simple solution for that may be to monitor the loss and do a restart of the learning rate if it drifts too much. In any case, more testing seems warranted.

How to deal with ill-conditioned neural networks?

When dealing with ill-conditioned neural networks, is the current state of the art to use an adaptive learning rate, some very sophisticated algorithm to deal with the problem, or to eliminate the ill conditioning by preprocessing/scaling of the data?
The problem can be illustrated with the simplest of scenarios: one input and one output where the function to be learned is y=x/1000, so a single weight whose value needs to be 0.001. One data point (0,0). It turns out to matter a great deal, if you are using gradient descent, whether the second data point is (1000,1) or (1,0.001).
Theoretical discussion of the problem, with expanded examples.
Example in TensorFlow
Of course, straight gradient descent is not the only available algorithm. Other possibilities are discussed at here - however, as that article observes, the alternative algorithms it lists that are good at handling ill condition, are not so good when it comes time to handle a large number of weights.
Are new algorithms available? Yes, but these aren't clearly advertised as solutions for this problem, are perhaps intended to solve a different set of problems; swapping in Adagrad in place of GradientDescent does prevent overshoot, but still converges very slowly.
At one time, there were some efforts to develop heuristics to adaptively tweak the learning rate, but then instead of being just a number, the learning rate hyperparameter is a function, much harder to get right.
So these days, is the state of the art to use a more sophisticated algorithm to deal with ill condition, or to just preprocess/scale the data to avoid the problem in the first place?

Should I adapt loss weights for misclassified samples during epochs?

I am using FCN (Fully Convolutional Networks) and trying to do image segmentation. When training, there are some areas which are mislabeled, however further training doesn't help much to make them go away. I believe this is because network learns about some features which might not be completely correct ones, but because there are enough correctly classified examples, it is stuck in local minimum and can't get out.
One solution I can think of is to train for an epoch, then validate the network on training images, and then adjust weights for mismatched parts to penalize mismatch more there in next epoch.
Intuitively, this makes sense to me - but I haven't found any writing on this. Is this a known technique? If yes, how is it called? If no, what am I missing (what are the downsides)?
It highly depends on your network structure. If you are using the original FCN, due to the pooling operations, the segmentation performance on the boundary of your objects is degraded. There have been quite some variants over the original FCN for image segmentation, although they didn't go the route you're proposing.
Just name a couple of examples here. One approach is to use Conditional Random Field (CRF) on top of the FCN output to refine the segmentation. You may search for the relevant papers to get more idea on that. In some sense, it is close to your idea but the difference is that CRF is separated from the network as a post-processing approach.
Another very interesting work is U-net. It employs some idea from the residual network (RES-net), which enables high resolution features from lower levels can be integrated into high levels to achieve more accurate segmentation.
This is still a very active research area. So you may bring the next break-through with your own idea. Who knows! Have fun!
First, if I understand well you want your network to overfit your training set ? Because that's generally something you don't want to see happening, because this would mean that while training your network have found some "rules" that enables it to have great results on your training set, but it also means that it hasn't been able to generalize so when you'll give it new samples it will probably perform poorly. Moreover, you never talk about any testing set .. have you divided your dataset in training/testing set ?
Secondly, to give you something to look into, the idea of penalizing more where you don't perform well made me think of something that is called "AdaBoost" (It might be unrelated). This short video might help you understand what it is :
https://www.youtube.com/watch?v=sjtSo-YWCjc
Hope it helps

Is there any technique to know in advance the amount of training examples you need to make deep learning get good performance?

Deep learning has been a revolution recently and its success is related with the huge amount of data that we can currently manage and the generalization of the GPUs.
So here is the problem I'm facing. I know that deep neural nets have the best performance, there is no doubt about it. However, they have a good performance when the number of training examples is huge. If the number of training examples is low it is better to use a SVM or decision trees.
But what is huge? what is low? In this paper of face recognition (FaceNet by Google) they show the performance vs the flops (which can be related with the number of training examples)
They used between 100M and 200M training examples, which is huge.
My question is:
Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.
My question is: Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.
The short answer is no. You do not have this kind of knowledge, furthermore you will never have. These kind of problems are impossible to solve, ever.
What you can have are just some general heuristics/empirical knowledge, which will say if it is probable that DL will not work well (as it is possible to predict fail of the method, while nearly impossible to predict the success), nothing more. In current research, DL rarely works well for datasets smaller than hundreads thousands/milions of samples (I do not count MNIST because everything works well on MNIST). Furthermore, DL is heavily studied actually in just two types of problems - NLP and image processing, thus you cannot really extraplate it to any other kind of problems (no free lunch theorem).
Update
Just to make it a bit more clear. What you are asking about is to predit whether given estimator (or set of estimators) will yield a good results given a particular training set. In fact you even restrict just to the size.
The simpliest proof (based on your simplification) is as follows: for any N (sample size) I can construct N-mode (or N^2 to make it even more obvious) distribution which no estimator can reasonably estimate (including deep neural network) and I can construct trivial data with just one label (thus perfect model requires just one sample). End of proof (there are two different answers for the same N).
Now let us assume that we do have access to the training samples (without labels for now) and not just sample size. Now we are given X (training samples) of size N. Again I can construct N-mode labeling yielding impossible to estimate distribution (by anything) and trivial labeling (just a single label!). Again - two different answers for the exact same input.
Ok, so maybe given training samples and labels we can predict what will behave well? Now we cannot manipulate samples nor labels to show that there are no such function. So we have to get back to statistics and what we are trying to answer. We are asking about expected value of loss function over whole probability distribution which generated our training samples. So now again, the whole "clue" is to see, that I can manipulate the underlying distributions (construct many different ones, many of which impossible to model well by deep neural network) and still expect that my training samples come from them. This is what statisticians call the problem of having non-representible sample from a pdf. In particular, in ML, we often relate to this problem with curse of dimensionality. In simple words - in order to estimate the probability well we need enormous number of samples. Silverman shown that even if you know that your data is just a normal distribution and you ask "what is value in 0?" You need exponentialy many samples (as compared to space dimensionality). In practise our distributions are multi-modal, complex and unknown thus this amount is even higher. We are quite safe to say that given number of samples we could ever gather we cannot ever estimate reasonably well distributions with more than 10 dimensions. Consequently - whatever we do to minimize the expected error we are just using heuristics, which connect the empirical error (fitting to the data) with some kind of regularization (removing overfitting, usually by putting some prior assumptions on distributions families). To sum up we cannot construct a method able to distinguish if our model will behave good, because this would require deciding which "complexity" distribution generated our samples. There will be some simple cases when we can do it - and probably they will say something like "oh! this data is so simple even knn will work well!". You cannot have generic tool, for DNN or any other (complex) model though (to be strict - we can have such predictor for very simple models, because they simply are so limited that we can easily check if your data follows this extreme simplicity or not).
Consequently, this boils down nearly to the same question - to actually building a model... thus you will need to try and validate your approach (thus - train DNN to answer if DNN works well). You can use cross validation, bootstraping or anything else here, but all essentialy do the same - build multiple models of your desired type and validate it.
To sum up
I do not claim we will not have a good heuristics, heuristic drive many parts of ML quite well. I only answer if there is a method able to answer your question - and there is no such thing and cannot exist. There can be many rules of thumb, which for some problems (classes of problems) will work well. And we already do have such:
for NLP/2d images you should have ~100,000 samples at least to work with DNN
having lots of unlabeled instances can partially substitute the above number (thus you can have like 30,000 labeled ones + 70,000 unlabeled) with pretty reasonable results
Furthermore this does not mean that given this size of data DNN will be better than kernelized SVM or even linear model. This is exactly what I was refering to earlier - you can easily construct counterexamples of distributions where SVM will work the same or even better despite number of samples. The same applies for any other technique.
Yet still, even if you are just interested if DNN will work well (and not better than others) these are just empirical, trivial heuristics, which are based on at most 10 (!) types of problems. This could be very harmfull to treat these as rules or methods. This are just rough, first intuitions gained through extremely unstructured, random research that happened in last decade.
Ok, so I am lost now... when should I use DL? And the answer is exteremly simple:
Use deep learning only if:
You already tested "shallow" techniques and they do not work well
You have large amounts of data
You have huge computational resources
You have experience with neural networks (this are very tricky and ungreatful models, really)
You have great amount of time to spare, even if you will just get a few % better results as an effect.

Relating Machine learning Techniques to solve optimization Problem

Consider an optimization problem of some dimension n, Given some linear set of equations(inequalities) or constraints on the inputs which form a convex region, finding the maximum\minimum value of some expression which is some linear combination of inputs(or dimensions).
For larger dimension, these optimization problems take much time to give the exact answer.
So, can we use machine learning techniques, to get some approximate solution in lesser time.
if we can use machine learning techniques in this context, How the Training set should be??
Do you mean "How big should the training set be?" If so, then that is very much a "how long is a piece of string" question. It needs to be large enough for the algorithm being used, and to represent the data that is being modeled.
This doesn't strike me as being especially focused on machine learning, as is typically meant by the term anyway. It's just a straightforward constrained optimization problem. You say that it takes too long to find solutions now, but you don't mention how you're trying to solve the problem.
The simplex algorithm is designed for this sort of problem, but it's exponential in the worst case. Is that what you're trying that's taking too long? If so, there are tons of metaheuristics that might perform well. Tabu search, simulated annealing, evolutionary algorithms, variable depth search, even simple multistart hill climbers. I would probably try something along those lines before I tried anything exotic.

Resources