I'm programming a simple one layered perceptron.
For example, I have 3 neurons at the first layer, 2 at the second, and 2 at the output layer.
I have to solve a binary classification problem. This way I have 10 weights.
But I want to visualize the function that I get from this weights. E.g I want to plot function y = w0 + w1*x
So, the question is, which w0 and w1 are proper for this purpose?
You do not get a single function. You can derive the decision function from the weights, but it's not a trivial combination of any pair of inputs; rather, it's that composition of the two layers of functions -- and then you subtract the values computed for class1 and class2; positive is one class, negative is the other.
If you want to plot the input weighting function for a particular perceptron, then you need to use the obvious w values, but I don't think this is what you're asking.
Related
I have some problem understanding the theory of loss function and hope some one can help me.
Usually when people try to explain gradient descent to you, they will show you a loss function that looks like the very first image in this post gradient descent: all you need to know. I understand the entire theory of gradient descent is to adjust the weights and minimize the loss function.
My question is, will the shape of the Loss function change during the training or it will just stay remain as the image shown in the above post? I know that the weights are something that we are always tuning so the parameters that determines the shape of the Loss function should be the inputs X={x1,x2,...xn}. Let's make an easy example: suppose our inputs are [[1,2,3,4,5],[5,4,3,2,1]] and labels are [1,0] (Only two training sample for ease, and we are setting the batch size to 1). Then the loss function should be some thing like this for the first training sample
L = (1-nonlinear(1*w1+2*w2+3*w3+4*w4+5*w5+b))^2
and for the second training sample the loss function should be:
L = (0-nonlinear(5*w1+4*w2+3*w3+2*w4+1*w5+b))^2
Apparently, these two loss functions doesn't looks like the same if we plot them so does that mean the shape of the Loss function are changing during training? Then why are people still using that one image ( A point that slides down from the Loss function and finds the global minima) to explain the gradient descent theory?
Note: I'm not changing the loss function, the loss function are still mean square error. I'm trying to say that the shape of the Loss function seems to be changing.
I know where my problem comes from! I thought that we are not able to plot a function such as f(x,y) = xy without any constant in it, but we actually could! I searched the graph on google for f(x,y)=xy and truly we can plot them out! So now I understand, as long as we get the lost function, we can get the plot! Thanks guys
The function stays the same. The point of Gradient Decent is to find the lowest point on a given loss function that you define.
Generally, the loss function you are training to minimize does not change throughout the course of a training session. The flaw in reasoning is that you are assuming that the loss function is characterized by weights of the network, when in fact the weights of that network are a sort-of input to the loss function.
To clarify, let us assume we are predicting some N-dimensional piece of information and we have a ground truth vector, call it p, and a loss function L taking in a prediction vector p_hat which we define as
L(p_hat) := norm(p - p_hat).
This is a very primitive (and quite ineffective) loss function, but it is one nonetheless. Once we begin training, this loss function will be the function that we will try to minimize to get our network to perform the best with respect to. Notice that this loss function will attain different values for different inputs of p_hat, this does not mean the loss function is changing! In the end, the loss function will be an N-dimensional hypersurface in an N+1-dimensional hyperspace that stays the same no matter what (similar to the thing you see in the image where it is a 2-dimensional surface in a 3-dimensional space).
Gradient descent tries to find a minimum on this surface that is constructed by the loss function, but we do not really know what the surface looks like as a whole, instead, we find out small things about the surface by evaluating the loss function as the values of p_hat we give it.
Note, this is all a huge oversimplification, but can be a useful way to think about it getting started.
A Loss Function is a metric that measures the distance from your predictions to your targets.
The ideia is to choose the weighs so your predictions are close to your targets, that is, your model learned/memorized the input.
The loss function should usually not be changed during training, because the minimum in the original function might not coincide with the new one, so the gradient descent's work is lost.
I'm trying to understand SVR model.
To do it I looked at SVM and it's pretty clear for me. But there is no much explications about SVR.
The first question is why it's called Support Vector Regression or how we use vectors to predict numerical values?
Also I don't understand some parameters such as epsilon and gamma. How they influence predicted result?
A SVM learns a so called decision function from your features, such that features from you positive class produce positive real numbers, and features from the negative class produce negative numbers (at least most of the time, depending on your data).
For two features you can visualize this in a 2D plane. The function assigns a real value to each point in the plane, this value can be depicted as color. This plot shows the values as different blue colors.
The feature values resulting in zero form the so called decision boundary.
This function itself has two kind of parameters:
kernel dependend parameters. In your case for the radial basis functions, these parameters are epsilon and gamma, which you set before learning.
And the so called support-vectors which are determined during learning. support-vectors are just parameters of your decision function.
Learning is nothing than determining good support-vectors (parameters !).
In this 2d example video the colors don't show the actual function value, but only the sign. You can see how gamma influences the smoothness of the decision function.
To answer you question:
SVR builds such a function but with a different goal. The function does not try to assign positive outcomes to your postive examples, and negative outcomes to the negative examples.
Instead the function is built to approximate the given numeric outcomes.
I know the form of the softmax regression, but I am curious about why it has such a name? Or just for some historical reasons?
The maximum of two numbers max(x,y) could have sharp corners / steep edges which sometimes is an unwanted property (e.g. if you want to compute gradients).
To soften the edges of max(x,y), one can use a variant with softer edges: the softmax function. It's still a max function at its core (well, to be precise it's an approximation of it) but smoothed out.
If it's still unclear, here's a good read.
Let's say you have a set of scalars xi and you want to calculate a weighted sum of them, giving a weight wi to each xi such that the weights sum up to 1 (like a discrete probability). One way to do it is to set wi=exp(a*xi) for some positive constant a, and then normalize the weights to one. If a=0 you get just a regular sample average. On the other hand, for a very large value of a you get max operator, that is the weighted sum will be just the largest xi. Therefore, varying the value of a gives you a "soft", or a continues way to go from regular averaging to selecting the max. The functional form of this weighted average should look familiar to you if you already know what a SoftMax regression is.
I'm new to neural networks and trying to get the hang of it by solving the following task:
Given a semi circle which defines an area above the x-axis, I would like to teach an ANN to output the length of a vector pointing to any position within that area. In addition, I would also like to know the angle between it and the x-axis.
I thought of this as a classical example of supervised learning and used Backpropagation to train a feed-forward network. The network is built by two Input-, two Output-, and variable amount of Hidden-neurons organised in a variable amount of hidden layers.
My training data is a random and unsorted sample of points within that area and the respective desired values. The coordinates of the points serve as the input of the net while I use the calculated values to minimise the error.
However, even after thousands of training iterations and empirical changes of the networks topology, I am unable to produce results with an error below ~0.2 (Radius: 20.0, Topology: 2/4/2).
Are there any obvious pitfalls I'm failing to see or does the chosen approach just not fit the task? Which other network types and/or learning techniques could be used to complete the task?
I wouldn't use variable amounts of hidden layers, I would use just one.
Then, I wouldn't use two output neurons, I would use two separate ANNs, one for each of the values you're after. This should do better, since your outputs aren't clearly related in my opinion.
Then, I would experiment with number of hidden neurons between 2 and 10 and different activation functions (logistic and tanh, maybe ReLUs).
After that, do you scale your data? It might be worth scaling both your inputs and outputs. Sigmoid units return small numbers, so it is good if you can adapt your outputs to be small as well (in [-1 , 1] or [0, 1]). For example, if want your angles in degrees, divide all of your targets by 360 before training the ANN on them. Then when the ANN returns a result, multiply it by 360 and see if that helps.
Finally, there are a number of ways to train your neural network. Gradient descent is the classic, but probably not the best. Better methods are conjugate gradient, BFGS etc. See here for optimizers if you're using python - even if not, they might give you an idea of what to search for in your language.
I am trying to implement the stochastic diagonal Levenberg-Marquardt method for Convolutional Neural Network in order to back propagate for learning weights.
i am new in it, and quite confused in it, so I have few questions, i hope you may help me.
1) How can i calculate the second order derivative at output layer from the two outputs.
As i in first order derivative i have to subtract output from desired output and multiply it with derivative of the output.
But in second derivative how can i do that?
2) In MaxPooling layer of convolutional Neural Network, I select max value in 2x2 window, and multiply it with the weight, now Does i have to pass it through activation function or not?
Can some one give me explanation how to do it in opencv, or how with mathematical explanation or any reference which show the mathematics.
thanks in advance.
If you have calculated Jacobian matrix already (the matrix of partial first order derivatives) then you can obtain an approximation of the Hessian (the matrix of partial second order derivatives) by multiplying J^T*J (if residuals are small).
You can calculate second derivative from two outputs: y and f(X) and Jacobian this way:
In other words Hessian approximation B is chosen to satisfy:
In this paper you can find more about it.
Ananth Ranganathan: The Levenberg-Marquardt Algorithm