scaling inputs data to neural network - machine-learning

Do we have to scale input data for neural network? How does it affect the final solution of neural network?
I've tried to find some reliable sources on that. The book "elements of statistical learning" (page 400) says it will help choosing reasonable initial random weights to start with.
Aren't the final weights deterministic regardless of the initial random weights we use?
Thank you.

Firstly, there are many types of ANNs, I will assume you are talking about the simplest one - multilayer perceptron with backpropagation.
Secondly, in your question you are mixing up data scaling (normalization) and weight initialization.
You need to randomly initialize weights to avoid symmetry while learning (if all weights are initially the same, their update will also be the same). In general, concrete values don't matter, but too large values can cause slower convergence.
You are not required to normalize your data, but normalization can make learning process faster. See this question for more details.

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

In Lay man's terms what's the difference between a LossFunction and an OptimizationAlgorithm?

I get the part that training a network is all about finding the right weights with Optimization Algorithms deciding how weights are updated until the one needed to get the right prediction is come about.
So the million dollar que$tion$ to the main one are:
(1.) If optimization algorithms updates the weights what do loss functions do to the weights of the network?
(2.) Are loss functions only specific to the output layer of a neural network? (most examples I see with the deeplearning4j framework implement it at the output layer).
P.S: I really want to understand the basic difference between this two in the simplest way possible. I am not looking for anything complex or with some mathematical explosions.
The optimization algorithm tries to find the minimum of the loss function. At which points the weights are ideal.

Do trained weights depend on the order in which trained data has been input?

Suppose one makes a neural network using Keras. Do the trained weights depend on the order in which the training data has been fed into the system ? Is it ok to feed data belonging to one category first and then data belonging to another category or should they be random?
As the training will be done in batches, which means optimizing the weights on data chunk by chunk, the main assumption is that the batches of data are somewhat representative of the dataset. To make it representative it is thus better to randomly sample the data.
Bottomline : It will theoritically learn better if you feed randomly the neural network. I strongly advise yo to shuffle your dataset when you feed it in training mode (and there is an option in the .fit() function).
In inference mode, if you only want to make a forward pass on the neural net, then the order doesn't matter at all since you don't change the weights.
I hope this clarifies things a bit for you :-)
Nassim answer is believed to be True for small networks and datasets but recent articles (or e.g. this one) makes us believe that for deeper networks (with more than 4 layers) - not shuffling your data set might be considered as some kind of regularization - as poor minima are expected to be deep but small and good minima are expected to be wide and hard to leave.
In case of inference time - the only way where this might harm your inference process is when you are using a training distribution of your data in a highly coupled manner - e.g. using BatchNormalization or Dropout like in a training phase (this is sometimes used for some kinds of Bayesian Deep Learning).

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

How to continue to train SVM based on the previous model

We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.

Resources