I am working on replicating a neural network. I'm trying to get an understanding of how the standard layer types work. In particular, I'm having trouble finding a description anywhere of how cross-channel normalisation layers behave on the backward-pass.
Since the normalization layer has no parameters, I could guess two possible options:
The error gradients from the next (i.e. later) layer are passed backwards without doing anything to them.
The error gradients are normalized in the same way the activations are normalized across channels in the forward pass.
I can't think of a reason why you'd do one over the other based on any intuition, hence why I'd like some help on this.
EDIT1:
The layer is a standard layer in caffe, as described here http://caffe.berkeleyvision.org/tutorial/layers.html (see 'Local Response Normalization (LRN)').
The layer's implementation in the forward pass is described in section 3.3 of the alexNet paper: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
EDIT2:
I believe the forward and backward pass algorithms are described in both the Torch library here: https://github.com/soumith/cudnn.torch/blob/master/SpatialCrossMapLRN.lua
and in the Caffe library here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/lrn_layer.cpp
Please could anyone who is familiar with either/both of these translate the method for the backward pass stage into plain english?
It uses the chain rule to propagate the gradient backwards through the local response normalization layer. It is somewhat similar to a nonlinearity layer in this sense (which also doesn't have trainable parameters on its own, but does affect gradients going backwards).
From the code in Caffe that you linked to I see that they take the error in each neuron as a parameter, and compute the error for the previous layer by doing following:
First, on the forward pass they cache a so-called scale, that is computed (in terms of AlexNet paper, see the formula from section 3.3) as:
scale_i = k + alpha / n * sum(a_j ^ 2)
Here and below sum is sum indexed by j and goes from max(0, i - n/2) to min(N, i + n/2)
(note that in the paper they do not normalize by n, so I assume this is something that Caffe does differently than AlexNet). Forward pass is then computed as b_i = a_i + scale_i ^ -beta.
To backward propagate the error, let's say that the error coming from the next layer is be_i, and the error that we need to compute is ae_i. Then ae_i is computed as:
ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * sum(be_j * b_j / scale_j)
Since you are planning to implement it manually, I will also share two tricks that Caffe uses in their code that makes the implementation simpler:
When you compute the addends for the sum, allocate an array of size N + n - 1, and pad it with n/2 zeros on each end. This way you can compute the sum from i - n/2 to i + n/2, without caring about going below zero and beyond N.
You don't need to recompute the sum on each iteration, instead compute the the addends in advance (a_j^2 for the front pass, be_j * b_j / scale_j for the backward pass), then compute the sum for i = 0, and then for each consecutive i just add addend[i + n/2] and subtract addend[i - n/2 - 1], it will give you the value of the sum for the new value of i in constant time.
Of cause,you can either print the variables to observe the changes with them or use the debug model to see how errors change during passing the net.
I have an alternative formulation of the backward and I don't know if it is equivalent to caffe's:
So caffe's is :
ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * sum(be_j * b_j / scale_j)
by differentiating the original expression
b_i = a_i/(scale_i^-b)
I get
ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * be_i*sum(ae_j)/scale_i^(-b-1)
Related
We are trying an alteration optimization strategy to solve Lyapunov problems.
We break down our decision variables into two sets, Set 1 and Set 2.
We were perplexed how it was possible that, after getting a solution to Set 1, and plugging in those solved variables into the optimization over Set 2, the transferred variables would not be feasible.
The constraints that fail are those due to the SOS coefficient matching equality constraints.
Here, we print in each row the constraint that failed, and the value of our Initial Guess. We can see that the Initial Guess is off only a very small amount compared to the constraints.
LinearEqualityConstraint
(2 * Symmetric(97,40) + 2 * Symmetric(96,41)) == 9.50028
[9.50027496]
LinearEqualityConstraint
(2 * Symmetric(97,47) + 2 * Symmetric(96,48)) == 234.465
[234.4647013]
LinearEqualityConstraint
(2 * Symmetric(97,54) + 2 * Symmetric(96,55)) == -234.463
[-234.46336504]
LinearEqualityConstraint
(2 * Symmetric(97,61) + 2 * Symmetric(96,62)) == 12.7962
[12.79618825]
LinearEqualityConstraint
(2 * Symmetric(97,68) + 2 * Symmetric(96,69)) == -12.7964
[-12.79637068]
LinearEqualityConstraint
(2 * Symmetric(97,75) + 2 * Symmetric(96,76)) == -51.4061
[-51.40605828]
LinearEqualityConstraint
(2 * Symmetric(97,81) + 2 * Symmetric(96,82)) == 51.406
[51.40604213]
LinearEqualityConstraint
(2 * Symmetric(97,86) + 2 * Symmetric(96,87)) == 192.794
[192.79430158]
LinearEqualityConstraint
(2 * Symmetric(97,90) + 2 * Symmetric(96,91)) == -141.924
[-141.92366183]
LinearEqualityConstraint
(2 * Symmetric(97,93) + 2 * Symmetric(96,94)) == -37.6674
[-37.66740401]
InitialGuess V_sos and
Our guess for what's happening is:
When you extract the solution from one optimization using result.GetSolution(var), you lose some precision.
Or, when you set the previous solution using prog.SetInitialGuess(np_array) you lose some precision.
What's the solution here? Should we just keep feeding the solution back in even though it says infeasible?
This is a partial cookbook when I debug SOS problem, especially when working with Lyapunov problems:
Choose the right monomial basis
The main idea is to remove the 0-th order monomial 1 from the monomial basis of the sos polynomial. Here is a quick explanation:
The mathematical problem is
Find λ(x)
−Vdot − λ(x) * (ρ − V) is sos
λ(x) is sos
Namely you want to prove that V≤ρ ⇒ Vdot ≤ 0
So first I would suggest to re-write your dynamics to make sure that 0 is the goal state (you can always shift your state).
Second you can see that since x=0 is the equilibrium point, then both V(0) = 0 and Vdot(0) = 0 (Because x=0 is the global minimal of V(x), hence ∂V/∂x=0 at x=0, indicating Vdot(0) = 0), now your sos polynomial p(x) = −Vdot − λ(x) * (ρ − V) must satisfy p(0) = -λ(0) * ρ. But λ(x) >= 0 and ρ > 0, so we know λ(0) = 0.
Lemma
If a sos polynomial s(x) satisfies s(0) = 0, then its monomial basis cannot contain the 0-th order monomial (namely 1).
Proof
Remember that s(x) is a sos polynomial, namely
s(x) = m(x)ᵀQm(x)
where m(x) contains the monomial basis, and Q is a psd matrix. Now let's decompose the monomial basis m(x) into two parts, the 0-th order monomial 1 and the remaining monomials mbar(x). For example, if m(x) = [x1, x2, 1], then mbar(x) = [x1, x2]. We also decompose the psd matrix Q accordingly
s(x) = [mbar(x)]ᵀ [Q11 Q10] [mbar(x)]
[ 1] [Q10 Q00] [ 1]
Since s(0) = Q00 = 0, we also know that Q10 = 0, so now we can use a smaller psd matrix Q11 rather than Q. Equivalently we write s(x) = mbar(x)ᵀ * Q11 * mbar(x), where mbar(x) is the monomial basis that doesn't contain the 0-th order monomial, QED.
So why removing the 0-th order monomial from the monomial basis and use the smaller psd matrix Q11 is a good idea when your sos polynomial s(x) satisfies s(0) = 0? The reason is that if the monomial basis contains 1, then your psd matrix Q has to be on the boundary of the psd cone, namely your SDP problem doesn't have a strict interior. This could leads to violation of Slater's condition, which also breaks the strong duality. One example is that if your s(x) = x², by including the 0'th order monomial, it is written as
x² = [x] [1 0] [x]
[1] [0 0] [1]
And you see that the Gram matrix [[1 0], [0, 0]] is on the boundary of the psd cone (with one eigen value equal to 0). But if you remove 1 from the monomial basis, then its Gram matrix is just Q11=1, strictly in the interior of the psd cone.
In Drake, after removing 1 from the monomial basis, you can create your sos polynomial λ(x) as
lambda_poly, lambda_gram = prog.NewSosPolynomial(monomial_basis)
whee monomial_basis doesn't contain the 0-th order monomial.
Backoff during bilinear alternation
This is a typical problem in bilinear alternation. The issue is that when you solve a conic optimization problem with an objective function, the optimal solution always occurs at the boundary of the cone, namely it is very close to being infeasible. Then when you fix some variables to this solution at the cone boundary, in the next iteration the problem is very likely infeasible due to numerical roundoff error.
A typical solution is that after solving the optimization problem on variable Set 1 with an objective, now "backoff" a little bit by solving a feasibility problem on variable Set 1. This new solution is often strictly feasible (namely it is inside the strict interior of the cone), now pass this strictly feasible solution Set 1 to the next iteration and search for Set 2.
More concretely, suppose at one iteration you solve the following optimization problem
min c'*x
s.t constraint_on_x
and denote the optimal cost as p. Now solve a new feasibility problem
find x
s.t c'*x <= p + epsilon
constraint_on_x
where epsilon can be a small positive number. This new solution will be used in the next iteration to search for a different set of variables.
You can check if your solution is on the boundary of the positive semidefinite cone by checking the Eigen value of your psd matrix. Here is the pseudo-code
for binding : prog.positive_semidefinite_constraints():
psd_sol = result.GetSolution(binding.variables())
psd_sol.reshape((binding.evaluator().matrix_rows(), binding.evaluator().matrix_rows()))
print(f"minimal eigenvalue {np.linalg.eig(psd_sol)[0].min()}")
You should see that before doing this "backoff" some of the minimal eigen value is almost 0. After "backoff" the minimal eigen value gets larger.
I have implemented a very simple linear regression with gradient descent algorithm in JavaScript, but after consulting multiple sources and trying several things, I cannot get it to converge.
The data is absolutely linear, it's just the numbers 0 to 30 as inputs with x*3 as their correct outputs to learn.
This is the logic behind the gradient descent:
train(input, output) {
const predictedOutput = this.predict(input);
const delta = output - predictedOutput;
this.m += this.learningRate * delta * input;
this.b += this.learningRate * delta;
}
predict(x) {
return x * this.m + this.b;
}
I took the formulas from different places, including:
Exercises from Udacity's Deep Learning Foundations Nanodegree
Andrew Ng's course on Gradient Descent for Linear Regression (also here)
Stanford's CS229 Lecture Notes
this other PDF slides I found from Carnegie Mellon
I have already tried:
normalizing input and output values to the [-1, 1] range
normalizing input and output values to the [0, 1] range
normalizing input and output values to have mean = 0 and stddev = 1
reducing the learning rate (1e-7 is as low as I went)
having a linear data set with no bias at all (y = x * 3)
having a linear data set with non-zero bias (y = x * 3 + 2)
initializing the weights with random non-zero values between -1 and 1
Still, the weights (this.b and this.m) do not approach any of the data values, and they diverge into infinity.
I'm obviously doing something wrong, but I cannot figure out what it is.
Update: Here's a little bit more context that may help figure out what my problem is exactly:
I'm trying to model a simple approximation to a linear function, with online learning by a linear regression pseudo-neuron. With that, my parameters are:
weights: [this.m, this.b]
inputs: [x, 1]
activation function: identity function z(x) = x
As such, my net will be expressed by y = this.m * x + this.b * 1, simulating the data-driven function that I want to approximate (y = 3 * x).
What I want is for my network to "learn" the parameters this.m = 3 and this.b = 0, but it seems I get stuck at a local minima.
My error function is the mean-squared error:
error(allInputs, allOutputs) {
let error = 0;
for (let i = 0; i < allInputs.length; i++) {
const x = allInputs[i];
const y = allOutputs[i];
const predictedOutput = this.predict(x);
const delta = y - predictedOutput;
error += delta * delta;
}
return error / allInputs.length;
}
My logic for updating my weights will be (according to the sources I've checked so far) wi -= alpha * dError/dwi
For the sake of simplicity, I'll call my weights this.m and this.b, so we can relate it back to my JavaScript code. I'll also call y^ the predicted value.
From here:
error = y - y^
= y - this.m * x + this.b
dError/dm = -x
dError/db = 1
And so, applying that to the weight correction logic:
this.m += alpha * x
this.b -= alpha * 1
But this doesn't seem correct at all.
I finally found what's wrong, and I'm answering my own question in hopes it will help beginners in this area too.
First, as Sascha said, I had some theoretical misunderstandings. It may be correct that your adjustment includes the input value verbatim, but as he said, it should already be part of the gradient. This all depends on your choice of the error function.
Your error function will be the measure of what you use to measure how off you were from the real value, and that measurement needs to be consistent. I was using mean-squared-error as a measurement tool (as you can see in my error method), but I was using a pure-absolute error (y^ - y) inside of the training method to measure the error. Your gradient will depend on the choice of this error function. So choose only one and stick with it.
Second, simplify your assumptions in order to test what's wrong. In this case, I had a very good idea what the function to approximate was (y = x * 3) so I manually set the weights (this.b and this.m) to the right values and I still saw the error diverge. This means that weight initialization was not the problem in this case.
After searching some more, my error was somewhere else: the function that was feeding data into the network was mistakenly passing a 3 hardcoded value into the predicted output (it was using a wrong index in an array), so the oscillation I saw was because of the network trying to approximate to y = 0 * x + 3 (this.b = 3 and this.m = 0), but because of the small learning rate and the error in the error function derivative, this.b wasn't going to get near to the right value, making this.m making wild jumps to adjust to it.
Finally, keep track of the error measurement as your network trains, so you can have some insight into what's going on. This helps a lot to identify a difference between simple overfitting, big learning rates and plain simple mistakes.
Could someone give a clear explanation of backpropagation for LSTM RNNs?
This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:
In the code, it's simply three matrices and two bias vectors:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.
What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
d_c_next_t = d_c_prev_t
d_h_next_t = d_h_prev_t
d_x[:,t,:] = d_x_t
d_h0 = d_h_prev_t
d_Wx += d_Wx_t
d_Wh += d_Wh_t
d_b += d_b_t
# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
"""
Backward pass for a single timestep of an LSTM.
Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass
Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache
d_z = o * d_next_h
d_o = z * d_next_h
d_next_c += (1 - z * z) * d_z
d_f = d_next_c * prev_c
d_prev_c = d_next_c * f
d_i = d_next_c * g
d_g = d_next_c * i
d_a_g = (1 - g * g) * d_g
d_a_o = o * (1 - o) * d_o
d_a_f = f * (1 - f) * d_f
d_a_i = i * (1 - i) * d_i
d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)
d_prev_h = d_a.dot(Wh.T)
d_Wh = prev_h.T.dot(d_a)
d_x = d_a.dot(Wx.T)
d_Wx = x.T.dot(d_a)
d_b = np.sum(d_a, axis=0)
return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b
Summary
Now, back to your questions.
My question is how is LSTM backpropagation different then regular Neural Networks
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])
dy[targets[t]] -= 1
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.
I think your questions could not be answered in a short response. Nico's simple LSTM has a link to a great paper from Lipton et.al., please read this. Also his simple python code sample helps to answer most of your questions.
If you understand Nico's last sentence
ds = self.state.o * top_diff_h + top_diff_s
in detail, please give me a feed back. At the moment I have a final problem with his "Putting all this s and h derivations together".
I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}
according to wikipedia, with the delta rule we adjust the weight by:
dw = alpha * (ti-yi)*g'(hj)xi
when alpha = learning constant, ti - true answer, yi - perceptron's guess,g' = the derivative of the activation function g with respect to the weighted sum of the perceptron's inputs, xi - input.
The part that I don't understand in this formula is the multiplication by the derivative g'. let g = sign(x) (the sign of the weighted sum). so g' is always 0, and dw = 0. However, in code examples I saw in the internet, the writers just omitted the g' and used the formula:
dw = alpha * (ti-yi)*(hj)xi
I will be glad to read a proper explanation!
thank you in advance.
You're correct that if you use a step function for your activation function g, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0). This is why a step-function perceptron doesn't work well with gradient descent. :)
For a linear perceptron, you'd have g'(x) = 1, for dw = alpha * (t_i - y_i) * x_i.
You've seen code that uses dw = alpha * (t_i - y_i) * h_j * x_i. We can reverse-engineer what's going on here, because apparently g'(h_j) = h_j, which means remembering our calculus that we must have g(x) = e^x + constant. So apparently the code sample you found uses an exponential activation function.
This must mean that the neuron outputs are constrained to be on (0, infinity) (or I guess (a, infinity) for any finite a, for g(x) = e^x + a). I haven't run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).