Do trained weights depend on the order in which trained data has been input? - machine-learning

Suppose one makes a neural network using Keras. Do the trained weights depend on the order in which the training data has been fed into the system ? Is it ok to feed data belonging to one category first and then data belonging to another category or should they be random?

As the training will be done in batches, which means optimizing the weights on data chunk by chunk, the main assumption is that the batches of data are somewhat representative of the dataset. To make it representative it is thus better to randomly sample the data.
Bottomline : It will theoritically learn better if you feed randomly the neural network. I strongly advise yo to shuffle your dataset when you feed it in training mode (and there is an option in the .fit() function).
In inference mode, if you only want to make a forward pass on the neural net, then the order doesn't matter at all since you don't change the weights.
I hope this clarifies things a bit for you :-)

Nassim answer is believed to be True for small networks and datasets but recent articles (or e.g. this one) makes us believe that for deeper networks (with more than 4 layers) - not shuffling your data set might be considered as some kind of regularization - as poor minima are expected to be deep but small and good minima are expected to be wide and hard to leave.
In case of inference time - the only way where this might harm your inference process is when you are using a training distribution of your data in a highly coupled manner - e.g. using BatchNormalization or Dropout like in a training phase (this is sometimes used for some kinds of Bayesian Deep Learning).

Related

In what circumstances might using biases in a neural network not be beneficial?

I am currently looking through Michael Nielsen's ebook Neural Networks and Deep Learning and have run the code found at the end of chapter 1 which trains a neural network to recognize hand-written digits (with a slight modification to make the backpropagation algorithm over a mini-batch matrix-based).
However, having run this code and achieving a classification accuracy of just under 94%, I decided to remove the use of biases from the network. After re-training the modified network, I found no difference in classification accuracy!
NB: The output layer of this network contains ten neurons; if the ith of these neurons has the highest activation then the input is classified as being the digit i.
This got me wondering why it is necessary to use biases in a neural network, rather than just weights, and what differentiates between a task where biases will improve the performance of a network and a task where they will not?
My code can be found here: https://github.com/pipthagoras/neural-network-1
Biases are used to account for the fact that your underlying data might not be centered. It is clearer to see in the case of a linear regression.
If you do a regression without an intercept (or bias), you are forcing the underlying model to pass through the origin, which will result in a poor model if the underlying data is not centered (for example if the true generating process is Y=3000). If, on the other hand, your data is centered or close to centered, then eliminating bias is good, since you won't introduce a term that is, in fact, independent to your predictive variable (it's like selecting a simpler model, which will tend to generalize better PROVIDED that it actually reflects the underlying data).

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Model selection with dropout training neural network

I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

scaling inputs data to neural network

Do we have to scale input data for neural network? How does it affect the final solution of neural network?
I've tried to find some reliable sources on that. The book "elements of statistical learning" (page 400) says it will help choosing reasonable initial random weights to start with.
Aren't the final weights deterministic regardless of the initial random weights we use?
Thank you.
Firstly, there are many types of ANNs, I will assume you are talking about the simplest one - multilayer perceptron with backpropagation.
Secondly, in your question you are mixing up data scaling (normalization) and weight initialization.
You need to randomly initialize weights to avoid symmetry while learning (if all weights are initially the same, their update will also be the same). In general, concrete values don't matter, but too large values can cause slower convergence.
You are not required to normalize your data, but normalization can make learning process faster. See this question for more details.

Resources