Updating weights via backpropagation - machine-learning

How exactly are the theta parameters (AKA weights) updated after the errors have been calculated?
I do not know how to implement LaTeX here so I will display the pictures (from Andrew Ng's machine learning course) of what I know has to be done before theta/weights can be updated.
What is done next using the calculted delta "accumulator" matrix, or rather, how do I apply the accumulator matrix to update my weights?
On the course Andrew uses the computations shown below to be fed into an optimizer such as fminunc and do the parameter updating automatically, but I want to know clearly how to do the updating which is hidden away by the optimizer. I want to know how to compute the updating manually.
Do I simply apply New Weights = Old Weights - (learning-rate x D matrix)(from step 5 below)?

Related

Learning a Sin function

I'm new to Machine Learning
I' building a simple model that would be able to predict simple sin function
I generated some sin values, and feeding them into my model.
from math import sin
xs = np.arange(-10, 40, 0.1)
squarer = lambda t: sin(t)
vfunc = np.vectorize(squarer)
ys = vfunc(xs)
model= Sequential()
model.add(Dense(units=256, input_shape=(1,), activation="tanh"))
model.add(Dense(units=256, activation="tanh"))
..a number of layers here
model.add(Dense(units=256, activation="tanh"))
model.add(Dense(units=1))
model.compile(optimizer="sgd", loss="mse")
model.fit(xs, ys, epochs=500, verbose=0)
I then generate some test data, which overlays my learning data, but also introduces some new data
test_xs = np.arange(-15, 45, 0.01)
test_ys = model.predict(test_xs)
plt.plot(xs, ys)
plt.plot(test_xs, test_ys)
Predicted data and learning data looks as follows. The more layers I add, the more curves network is able to learn, but the training process increases.
Is there a way to make it predict sin for any number of curves? Preferably with a small number of layers.
With a fully connected network I guess you won't be able to get arbitrarily long sequences, but with an RNN it looks like people have achieved this. A google search will pop up many such efforts, I found this one quickly: http://goelhardik.github.io/2016/05/25/lstm-sine-wave/
An RNN learns a sequence based on a history of inputs, so it's designed to pick up these kinds of patterns.
I suspect the limitation you observed is akin to performing a polynomial fit. If you increase the degree of polynomial you can better fit a function like this, but a polynomial can only represent a fixed number of inflection points depending on the degree you choose. Your observation here appears the same. As you increase layers you add more non-linear transitions. However, you are limited by a fixed number of layers you chose as the architecture in a fully connected network.
An RNN does not work on the same principals because it maintains a state and can make use of the state being passed forward in the sequence to learn the pattern of a single period of the sine wave and then repeat that pattern based on the state information.

Why the hypothesis has to introduce two parameters, namely θ0 and θ1

I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.

Continuous Regression in Machine Learning

Suppose we have a set of inputs (named x1, x2, ..., xn) that give us the output y. The goal is to predict y from some values of x1... xn that were not seem yet. It's clear to me that this problem can be modelled as a Regression problem on the realm of Machine Learning.
However, let's say data keep coming. I'm able to predict y from x1... xn. Furthermore, I'm able to check afterwards whether or not that prediction was a good one. If it was a good one, everything is fine. On the other hand, I would like to update my model in case that prediction deviates a lot from the real y. The one way I can see this is to insert this new data on my training set and train the regression algorithm again. Two problems arise from that. First, it may cost more than I can afford to recompute my module from scratch from time to time. Second, I may already have too much data on my training set so that new coming data is negligible. However, the new coming data might be more import than the older ones due to the nature of my problem.
It seems that a good solution would be to compute a kind of continuous regression that is more related to the new data than to the older one. I have searched for such approach but I have not found anything relevant. Perhaps I'm looking at the wrong direction. Does anyone have a clue on how to do it?
If you want to consider the newer data more important you have to use weights. Usually it is called
sample_weight
in fit() function in scikit-learn (if you use this library).
Weights can be defined as 1 / (time pass from this current observation).
Now about the second problem. If the recalculation takes much time you can cut your observations and use the latest ones. Fit your model on the whole data and on the fresh one + some part of the old data and check how much your weights are changed. I suppose if you really have a dependence between {x_i} and {y} you don't need the whole dataset.
Otherwise you can use weights again. But for now you will weight weights in the model:
model for old data: w1*x1 + w2*x2 + ...
model for new data: ~w1*x1 + ~w2*x2 + ...
common model: (w1*a1_1 + ~w1*a1_2)*x1 + (w2*a2_1 + ~w2*a2_2)*x2 + ...
Here a1_1, a2_1 are the weights for 'old model', a2_1, a2_2 - for new one, w1, w2 - coefficients of old model, ~w1, ~w2 - of the new one.
Parameters {a} can be estimated as in the first bullet (be hands), but you also can create another linear model to estimate them. But my advice: don't use non-linear regression for {a} - you will overfit.

How to train a neural network in forward manner and using it in backward manner

I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.

Gradient descent stochastic update - Stopping criterion and update rule - Machine Learning

My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.

Resources