Updating the weights in a 2-layer neural network - machine-learning

I am trying to simulate a XOR gate using a neural network similar to this:
Now I understand that each neuron has certain number of weights and a bias. I am using a sigmoid function to determine whether a neuron should fire or not in each state (since this uses a sigmoid rather than a step function, I use firing in a loose sense as it actually spits out real values).
I successfully ran the simulation for feed-forwarding part, and now I want to use the backpropagation algorithm to update the weights and train the model. The question is, for each value of x1 and x2 there is a separate result (4 different combinations in total) and under different input pairs, separate error distances (the difference between the desired output and the actual result) could be be computed and subsequently a different set of weight updates will eventually be achieved. This means we would get 4 different sets of weight updates for each separate input pairs by using backpropagation.
How should we decide about the right weight updates?
Say we repeat the back propagation for a single input pair until we converge, but what if we would converge to a different set of weights if we choose another pair of inputs?

Now I understand that each neuron has certain weights. I am using a sigmoid function to determine a neuron should fire or not in each state.
You do not really "decide" this, typical MLP do not "fire", they output real values. There are neural networks which actually fire (like RBMs) but this is a completely different model.
This means we would get 4 different sets of weight updates for each input pairs by using back propagation.
This is actually a feature. Lets start from the beggining. You try to minimize some loss function on your whole training set (in your case - 4 samples), which is of form:
L(theta) = SUM_i l(f(x_i), y_i)
where l is some loss function, f(x_i) is your current prediction and y_i true value. You do this by gradient descent, thus you try to compute the gradient of L and go against it
grad L(theta) = grad SUM_i l(f(x_i), y_i) = SUM_i grad l(f(x_i), y_i)
what you now call "a single update" is grad l(f(x_i) y_i) for a single training pair (x_i, y_i). Usually you would not use this, but instead you would sum (or taken average) of updates across whole dataset, as this is your true gradient. Howver, in practise this might be computationaly not feasible (training set is usualy quite large), furthermore, it has been shown empirically that more "noise" in training is usually better. Thus another learning technique emerged, called stochastic gradient descent, which, in short words, shows that under some light assumptions (like additive loss function etc.) you can actually do your "small updates" independently, and you will still converge to local minima! In other words - you can do your updates "point-wise" in random order and you will still learn. Will it be always the same solution? No. But this is also true for computing whole gradient - optimization of non-convex functions is nearly always non-deterministic (you find some local solution, not global one).

Related

Minibatch SGD gradient computation- average or sum

I am trying to understand how the gradients are computed when using miinibatch SGD. I have implemented it in CS231 online course, but only came to realize that in intermediate layers the gradient is basically the sum over all the gradients computed for each sample (the same for the implementations in Caffe or Tensorflow). It is only in the last layer (the loss) that they are averaged by the number of samples.
Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically?
Thanks!
It is best to understand why SGD works first.
Normally, what a neural network actually is, a very complex composite function of an input vector x, a label y(or target variable, changes according to whether the problem is classification or regression) and some parameter vector, w. Assume that we are working on classification. We are actually trying to do a maximum likelihood estimation (actually MAP estimation since we are certainly going to use L2 or L1 regularization, but this is too much technicality for now) for variable vector w. Assuming that samples are independent; then we have the following cost function:
p(y1|w,x1)p(y2|w,x2) ... p(yN|w,xN)
Optimizing this wrt to w is a mess due to the fact that all of these probabilities are multiplicated (this will produce an insanely complicated derivative wrt w). We use log probabilities instead (taking log does not change the extreme points and we divide by N, so we can treat our training set as a empirical probability distribution, p(x) )
J(X,Y,w)=-(1/N)(log p(y1|w,x1) + log p(y2|w,x2) + ... + log p(yN|w,xN))
This is the actual cost function we have. What the neural network actually does is to model the probability function p(yi|w,xi). This can be a very complex 1000+ layered ResNet or just a simple perceptron.
Now the derivative for w is simple to state, since we have an addition now:
dJ(X,Y,w)/dw = -(1/N)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(yN|w,xN)/dw)
Ideally, the above is the actual gradient. But this batch calculation is not easy to compute. What if we are working on a dataset with 1M training samples? Worse, the training set may be a stream of samples x, which has an infinite size.
The Stochastic part of the SGD comes into play here. Pick m samples with m << N randomly and uniformly from the training set and calculate the derivative by using them:
dJ(X,Y,w)/dw =(approx) dJ'/dw = -(1/m)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(ym|w,xm)/dw)
Remember that we had an empirical (or actual in the case of infinite training set) data distribution p(x). The above operation of drawing m samples from p(x) and averaging them actually produces the unbiased estimator, dJ'/dw, for the actual derivative dJ(X,Y,w)/dw. What does that mean? Take many such m samples and calculate different dJ'/dw estimates, average them as well and you get dJ(X,Y,w)/dw very closely, even exactly, in the limit of infinite sampling. It can be shown that these noisy but unbiased gradient estimates will behave like the original gradient in the long run. On the average, SGD will follow the actual gradient's path (but it can get stuck at a different local minima, all depends on the selection of the learning rate). The minibatch size m is directly related to the inherent error in the noisy estimate dJ'/dw. If m is large, you get gradient estimates with low variance, you can use larger learning rates. If m is small or m=1 (online learning), the variance of the estimator dJ'/dw is very high and you should use smaller learning rates, or the algorithm may easily diverge out of control.
Now enough theory, your actual question was
It is only in the last layer (the loss) that they are averaged by the number of samples. Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically? Thanks!
Yes, it is enough to divide by m in the last layer, since the chain rule will propagate the factor (1/m) to all parameters once the lowermost layer is multiplied by it. You don't need to do separately for each parameter, this will be invalid.
In the last layer they are averaged, and in the previous are summed. The summed gradients in previous layers are summed across different nodes from the next layer, not by the examples. This averaging is done only to make the learning process behave similarly when you change the batch size -- everything should work the same if you sum all the layers, but decrease the learning rate appropriately.

Why use softmax only in the output layer and not in hidden layers?

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.
What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?
I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :
1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.
Actually, Softmax functions are already used deep within neural networks, in certain cases, when dealing with differentiable memory and with attention mechanisms!
Softmax layers can be used within neural networks such as in Neural Turing Machines (NTM) and an improvement of those which are Differentiable Neural Computer (DNC).
To summarize, those architectures are RNNs/LSTMs which have been modified to contain a differentiable (neural) memory matrix which is possible to write and access through time steps.
Quickly explained, the softmax function here enables a normalization of a fetch of the memory and other similar quirks for content-based addressing of the memory. About that, I really liked this article which illustrates the operations in an NTM and other recent RNN architectures with interactive figures.
Moreover, Softmax is used in attention mechanisms for, say, machine translation, such as in this paper. There, the Softmax enables a normalization of the places to where attention is distributed in order to "softly" retain the maximal place to pay attention to: that is, to also pay a little bit of attention to elsewhere in a soft manner. However, this could be considered like to be a mini-neural network that deals with attention, within the big one, as explained in the paper. Therefore, it could be debated whether or not Softmax is used only at the end of neural networks.
Hope it helps!
Edit - More recently, it's even possible to see Neural Machine Translation (NMT) models where only attention (with softmax) is used, without any RNN nor CNN: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability
distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.
Softmax function is used for the output layer only (at least in most cases) to ensure that the sum of the components of output vector is equal to 1 (for clarity see the formula of softmax cost function). This also implies what is the probability of occurrence of each component (class) of the output and hence sum of the probabilities(or output components) is equal to 1.
Softmax function is one of the most important output function used in deep learning within the neural networks (see Understanding Softmax in minute by Uniqtech). The Softmax function is apply where there are three or more classes of outcomes. The softmax formula takes the e raised to the exponent score of each value score and devide it by the sum of e raised the exponent scores values. For example, if I know the Logit scores of these four classes to be: [3.00, 2.0, 1.00, 0.10], in order to obtain the probabilities outputs, the softmax function can be apply as follows:
import numpy as np
def softmax(x):
z = np.exp(x - np.max(x))
return z / z.sum()
scores = [3.00, 2.0, 1.00, 0.10]
print(softmax(scores))
Output: probabilities (p) = 0.642 0.236 0.087 0.035
The sum of all probabilities (p) = 0.642 + 0.236 + 0.087 + 0.035 = 1.00. You can try to substitute any value you know in the above scores, and you will get a different values. The sum of all the values or probabilities will be equal to one. That’s makes sense, because the sum of all probability is equal to one, thereby turning Logit scores to probability scores, so that we can predict better. Finally, the softmax output, can help us to understand and interpret Multinomial Logit Model. If you like the thoughts, please leave your comments below.

Questions about Q-Learning using Neural Networks

I have implemented Q-Learning as described in,
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
In order to approx. Q(S,A) I use a neural network structure like the following,
Activation sigmoid
Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
Outputs, single output. Q-Value
N number of M Hidden Layers.
Exploration method random 0 < rand() < propExplore
At each learning iteration using the following formula,
I calculate a Q-Target value then calculate an error using,
error = QTarget - LastQValueReturnedFromNN
and back propagate the error through the neural network.
Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.
Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.
Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.
Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.
Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

Can someone explain to me the difference between a cost function and the gradient descent equation in logistic regression?

I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Resources