How to deal with ill-conditioned neural networks? - machine-learning

When dealing with ill-conditioned neural networks, is the current state of the art to use an adaptive learning rate, some very sophisticated algorithm to deal with the problem, or to eliminate the ill conditioning by preprocessing/scaling of the data?
The problem can be illustrated with the simplest of scenarios: one input and one output where the function to be learned is y=x/1000, so a single weight whose value needs to be 0.001. One data point (0,0). It turns out to matter a great deal, if you are using gradient descent, whether the second data point is (1000,1) or (1,0.001).
Theoretical discussion of the problem, with expanded examples.
Example in TensorFlow
Of course, straight gradient descent is not the only available algorithm. Other possibilities are discussed at here - however, as that article observes, the alternative algorithms it lists that are good at handling ill condition, are not so good when it comes time to handle a large number of weights.
Are new algorithms available? Yes, but these aren't clearly advertised as solutions for this problem, are perhaps intended to solve a different set of problems; swapping in Adagrad in place of GradientDescent does prevent overshoot, but still converges very slowly.
At one time, there were some efforts to develop heuristics to adaptively tweak the learning rate, but then instead of being just a number, the learning rate hyperparameter is a function, much harder to get right.
So these days, is the state of the art to use a more sophisticated algorithm to deal with ill condition, or to just preprocess/scale the data to avoid the problem in the first place?

Related

Best Reinforcement Learner Optimizer

I'm running a SAC reinforcement learner for a robotics application with some pretty decent results. One of the reasons I opted for reinforcement learning is for the ability for learning in the field, e.g. to adjust to a mechanical change, such as worn tires or a wheel going a little out of alignment.
My reinforcement learner restores it's last saved weights and replay buffer upon startup, so it doesn't need to retrain every time I turn it on. However, one concern I have is with respect to the optimizer.
Optimizers have come a long way since ADAM, but everything I read and all the RL code samples I see still seem to use ADAM with a fixed learning rate. I'd like to take advantage of some of the advances in optimizers, e.g. one cycle AdamW. However, a one-cycle optimizer seems inappropriate for a continuous real-world reinforcement learning problem: I imagine it's pretty good for the initial training/calibration, but I expect the low final learning rate would react too slowly to mechanical changes.
One thought I had was perhaps to do a one-cycle approach for initial training, and triggering a smaller one-cycle restart if a change in error that indicates something has changed (perhaps the size of the restart could be based on the size of the change in error).
Has anyone experimented with optimizers other than ADAM for reinforcement learning or have any suggestions for dealing with this sort of problem?
Reinforcement learning is very different from traditional supervised learning because the training data distribution changes as the policy improves. In optimization terms, the objective function can be said to be non-stationary. For this reason, I suspect your intuition is likely correct -- that a "one-cycle" optimizer would perform poorly after a while in your application.
My question is, what is wrong with Adam? Typically, the choice of optimizer is a minor detail for deep reinforcement learning; other factors like the exploration policy, algorithmic hyperparameters, or network architecture tend to have a much greater impact on performance.
Nevertheless, if you really want to try other optimizers, you could experiment with RMSProp, Adadelta, or Nesterov Momentum. However, my guess is that you will see incremental improvements, if any. Perhaps searching for better hyperparameters to use with Adam would be a more effective use of time.
EDIT: In my original answer, I made the claim that the choice of a particular optimizer is not primarily important for reinforcement learning speed, and neither is generalization. I want to add some discussion that helps illustrate these points.
Consider how most deep policy gradient methods operate: they sample a trajectory of experience from the environment, estimate returns, and then conduct one or more gradient steps to improve the parameterized policy (e.g. a neural network). This process repeats until convergence (to a locally optimal policy).
Why must we continuously sample new experience from the environment? Because our current data can only provide a reasonable first-order approximation within a small trust region around the policy parameters that were used to collect that data. Hence, whenever we update the policy, we need to sample more data.
A good way to visualize this is to consider an MM algorithm. At each iteration, a surrogate objective is constructed based on the data we have now and then maximized. Each time, we will get closer to the true optimum, but the speed at which we approach it is determined only by the number of surrogates we construct -- not by the specific optimizer we use to maximize each surrogate. Adam might maximize each surrogate in fewer gradient steps than, say, RMSProp does, but this does not affect the learning speed of the agent (with respect to environment samples). It just reduces the number of minibatch updates you need to conduct.
SAC is a little more complicated than this, as it learns Q-values in an off-policy manner and conducts updates using experience replay, but the general idea holds. The best attainable policy is subject to whatever the current data in our replay memory are; regardless of the optimizer we use, we will need to sample roughly the same amount of data from the environment to converge to the optimal policy.
So, how do you make a faster (more sample-efficient) policy gradient method? You need to fundamentally change the RL algorithm itself. For example, PPO almost always learns faster than TRPO, because John Schulman and co-authors found a different and empirically better way to generate policy gradient steps.
Finally, notice that there is no notion of generalization here. We have an objective function that we want to optimize, and once we do optimize it, we have solved the task as well as we can. This is why I suspect that the "Adam-generalizes-worse-than-SGD" issue is actually irrelevant for RL.
My initial testing suggest the details of the optimizer and it's hyperparameters matter, at least for off-policy techniques. I haven't had the chance to experiment much with PPO or on-policy techniques, so I can't speak for those unfortunately.
To speak to #Brett_Daley's thoughtful response a bit: the optimizer is certainly one of the less important characteristics. The means of exploration, and the use of a good prioritized replay buffer are certainly critical factors, especially with respect to achieving good initial results. However, my testing seems to show that the optimizer becomes important for the fine-tuning.
The off-policy methods I have been using have been problematic with fine-grained stability. In other words, the RL finds the mostly correct solution, but never really hones in on the perfect solution (or if it does find it briefly, it drifts off). I suspect the optimizer is at least partly to blame.
I did a bit of testing and found that varying the ADAM learning rate has an obvious effect. Too high and both the actor and critic bounce around the minimum and never converge on the optimal policy. In my robotics application this looks like the RL consistently makes sub-optimal decisions, as though there's a bit of random exploration with every action that always misses the mark a little bit.
OTOH, a lower learning rate tends to get stuck in sub-optimal solutions and is unable to adapt to changes (e.g. slower motor response due to low battery).
I haven't yet run any tests of single-cycle schedule or AdamW for the learning rate, but I did a very basic test with a two stage learning rate adjustment for both Actor and Critic (starting with a high rate and dropping to a low rate) and the results were a clearly more precise solution that converged quickly during the high learning rate and then honed in better with the low-learning rate.
I imagine AdamW's better weight decay regularization may result in similarly better results for avoiding overfitting training batches contributing to missing the optimal solution.
Based on the improvement I saw, it's probably worth trying single-cycle methods and AdamW for the actor and critic networks for tuning the results. I still have some concerns for how the lower learning rate at the end of the cycle will adapt to changes in the environment, but a simple solution for that may be to monitor the loss and do a restart of the learning rate if it drifts too much. In any case, more testing seems warranted.

TensorFlow - GradientDescentOptimizer - are we actually finding global optimum?

I was playing with tensorflow for quite a while, and I have more of a theoretical question. In general, when we train a network we usually use GradientDescentOptimizer (probably its variations like adagrad or adam) to minimize the loss function. In general it looks like we are trying to adjust weights and biases so that we get the global minimum of this loss function. But the issue is that I assume that this function has an extremely complicated look if you plot it, with lots of local optimums. What I wonder is how can we be sure that Gradient Descent finds global optimum and that we are not getting instantly stuck in some local optimum instead which is far away from global optimum?
I recollect that for example when you are performing clustering in sklearn it usually runs clustering algorithm several times with random initialization of cluster centers, and by doing this we ensure that we are not getting stuck with not optimal result. But we are not doing something like this while training ANNs in tensorflow - we start with some random weights and just travel along the slope of the function.
So, any insight into this? Why we can be more or less sure that the results of training via gradient descent are close to global minimum once the loss stops to decrease significantly?
Just to clarify, why I am wondering about this matter is that if we can't be sure that we get at least close to global minimum we can't easily judge which of 2 different models is actually better. Because we could run experiment, get some model evaluation which shows that model is not good... But actually it just stuck in local minimum shortly after training started. While other model which seemed for us to be better was just more lucky to start training from a better starting point and didn't stuck in local minimum fast. Moreover, this issue means that we can't even be sure that we get maximum from the network architecture we currently could be testing. For example, it could have really good global minimum but it is hard to find it and we mostly get stuck with poor solutions at local minimums, which would be far away from global optimum and never see the full potential of network at hand.
Gradient descent, by its nature, is looking at the function locally (local gradient). Hence, there is absolutely no guarantee that it will be the global minima. In fact, it probably will not be unless the function is convex. This is also the reason GD like methods are sensitive to initial position you start from. Having said that, there was a recent paper which said that in high-dimensional solution spaces, the number of maximas/minimas are not as many as previously thought.
Finding global minimas in high dimensional space in a reasonable way seems very much an unsolved problem. However, you might wanna focus more on saddle points rather than minimas. See this post for example:
High level description for saddle point problem
A more detailed paper is here (https://arxiv.org/pdf/1406.2572.pdf)
Your intuition is quite right. Complex models such as neural networks are typically applied to problems with high dimensional input, where the error surface has a very complex landscape.
Neural networks are not guaranteed to find the global optimum and getting stuck in local minima is a problem where a lot of research has been focussed. If you’re interested in finding out more about this, it would be good to look at techniques such as online learning and momentum, which have traditionally been used to avoid the problem of local minima. However, these techniques in themselves bring further difficulties e.g. integrating online learning is not possible for some optimisation techniques and the addition of a momentum hyper-parameter to the backpropagation algorithm brings further difficulties in training.
A really good video for visualising the influence of momentum (and how it overcomes local minma) during backpropagation can be found here.
Added after question edit - see comments
It’s the aforementioned nature of the problems neural networks are applied to that means we often can’t find a globally optimal solution, because (in the general case) traversing the entire search space for the optimal solution would be intractable using classical computing technology (quantum computers could change this for some problems). As such neural networks are trained to achieve what is hopefully a ‘good’ local optimum.
If you're interested in reading more detailed information about the techniques employed to find good local optima (i.e. something that approximates a global solution) a good paper to read might be this
No. Gradient descent method helps to find out the local minima. In case if global minima and local minima are same then only we get the actual result i.e. global minima.

How Did Neural Networks Overcome the Bias/Variance Dilemma?

Deep learning has been seen as a rebranding of Neural Networks.
Were the issues presented in the paper "Neural Networks and the Bias/Variance Dilemma" by Stuart Geman ever resolved in the architectures in use today?
We learned a lot about NN, in particular:
we now learn better representations due to progress in unsupervised/autoregressive learning, such as restricted boltzman machines, autoencoders, denoising autoencoders, variational autoencoders, which help as stabilize the process, learn from reasonable representations
we have better priors - not neceserly in the strict probabilistic sense, but we know, that for example in image processing a good architecture is the convolutional one, thus we have a smaller (in terms of parameters), but better suited for the problem - models. Consequently we are less prone to overfitting.
we have better optimization techniques and activation functions - which help us with underfitting (we can learn larger networks), in particular - we can learn deeper networks. Why is deep often better then wide? Because again - this is another prior, the assumption that representation should be hierarchical, and it seems to be valid prior for many modern problems (even that not all of them).
dropout, and other techniques brought as better regularization methods (than previously known and used simple weights priors) - which again limits problem with overfitting (variance).
There are many more things that changed, but in general - we were simply able to find better architectures, better assumptions, thus we now search in more narrow class of hypotheses. Consequently - we overfit less (variance), and underfit less (bias) - yet there is still lots to be done!
Next thing is, as #david pointed out, amount of data. We have huge datasets now, we often have access to more data that we can process in a reasonable time, and obviously more data means less variance - even highly overfitting models start to behave well.
Last, but not least - hardware. This is something that every single deep learning expert will tell you - our computers got stronger. We still use the same algorithms, the same architectures (with many little tweaks, but the core is the same), but our hardware is exponentially faster, and this changes a lot.
#lejlot gave a good overview. I want to point to two specific parts of the whole process.
First, neural networks are universal approximators. That means, their bias in principle can be made arbitrarily small. The problem that was rather thought to be severe was overfitting -- too large variance.
Now, a common and successful way in Machine Learning to deal with too large variance is by "averaging it away" over many different predictions -- which should be as uncorrelated as possible. This worked in Random Forests, for instance, and in this way I tend to understand current Neural Networks as well (particular the maxout+dropout stuff). Of course, this is a narrow view -- there is further this whole representational learning stuff, the not explaining-away property, etc. -- but it's one I find suitable for your question regarding the bias/variance tradeoff.
Second point: there is no better way to prevent overfitting than having very much data. And currently we're in the situation to gather a lot of data.

What should I look for in the analysis of the attached signals?

I'm looking to analyze and compare the following `signals':
(Edit: better renderings here: oscillations good and here: oscillations bad)
What you see are plots of neuron activations from a type of artificial neural network plotted against time. Each line in the plot is a neuron's activation over time which can have a value between -1 and 1.
In the first plot, the activities are stable and consistent while the second exemplifies more chaotic activity (for want of a better term)-- some kind of destructive interference seems to occur ever so often..
Anyhow, I would like to do some kind of 'clever' analysis but since signal analysis is really not my strong point, thought I'd ask for some advice here...
EDIT: Let me clarify a bit. Ultimately, I would like to characterize the data. This could for example involve the pinpointing of correlations between the individual signals contained in each plot. I would like to measure 'regularity' or data invariance: in the above examples, the upper plot is more regular than the lower plot. I guess therefore I could compute the variance of each signal and take that as a measure; but I was wondering if some more comprehensive signal-processing technique could be better suited (I'm not sure). In fact I'm not even sure if signal-processing is what I really want now that I think about it. Perhaps some kind of wavelet or ft analysis...
For those interested, I am working on the computational modelling of worm locomotion.
You should consult some good books on nonlinear time series analysis. For instane, a measure for the regularity of your signal could be the Lyapunov spectrum. Another possibility would entropy. If you are interested in the correlation between signal, you could use transfer-entropy or granger causality, or for neurons it would be good to have a look at some measure for phase synchronization. The bayesian stuff could also be worth trying.
But – most important – firstly you need a proper question about what you really want to know. Once you've got that it is far more easy to pick the right tool.
And one final hint. Look for tools outside the engineering community. Their tools are mostly linear, but you are dealing with a highly nonlinear system. Wavelets, FFT and stuff are useful if you don't know anything about your signal and you want to have another perspective on it, but they are not suited for your kind of problem.

Relating Machine learning Techniques to solve optimization Problem

Consider an optimization problem of some dimension n, Given some linear set of equations(inequalities) or constraints on the inputs which form a convex region, finding the maximum\minimum value of some expression which is some linear combination of inputs(or dimensions).
For larger dimension, these optimization problems take much time to give the exact answer.
So, can we use machine learning techniques, to get some approximate solution in lesser time.
if we can use machine learning techniques in this context, How the Training set should be??
Do you mean "How big should the training set be?" If so, then that is very much a "how long is a piece of string" question. It needs to be large enough for the algorithm being used, and to represent the data that is being modeled.
This doesn't strike me as being especially focused on machine learning, as is typically meant by the term anyway. It's just a straightforward constrained optimization problem. You say that it takes too long to find solutions now, but you don't mention how you're trying to solve the problem.
The simplex algorithm is designed for this sort of problem, but it's exponential in the worst case. Is that what you're trying that's taking too long? If so, there are tons of metaheuristics that might perform well. Tabu search, simulated annealing, evolutionary algorithms, variable depth search, even simple multistart hill climbers. I would probably try something along those lines before I tried anything exotic.

Resources