I am trying to find an appropriate neural network structure to learn a function of the following form: F(x1,x2,x3,x4,x5)= a*x1+b*(x2-x4)/(x3-x4) + c*x5.
I am using the matlab's neural network toolbox to create a feedforwardnet, but without any luck.
Is it even possible to learn this kind of function using a neural network?
If yes, what can be an appropriate structure?
If no, are there any other models that can learn this kind of function?
Thanks.
I suggest that you start by preparing a training dataset in which you have the following:
1- Dataset
x1, x6, x5; x6 = (x2 - x4) / (x3 - x4)
2- Target label Y
Y = f(x1, x6, x5); you may assume some values of a,b,c
So, you have 3 input variables or features with one target variable Y.
Then, you define the ANN to have only one single layer (single layer Perceptron) and make sure that the output function is linear.
Finally, train the ANN and give it new values in terms of x1, x5 & x6 and compare
with the actual function.
If I understand correctly, you are trying to estimate the values of a, b, and c. Although the function is not linear with respect to its input, it is linear with respect to a, b, and c. So you should be able to solve your problem with linear regression.
More precisely, if you define x6 = (x2 - x4) / (x3 - x4), then you get F(x1, x5, x6) = a * x1 + b * x6 + c * x5, which is linear.
Related
Where does the graph of the loss function in machine learning come from?
I am studying about machine learning. I sometimes don't understand models that have been optimized using regularization terms.
In the explanation of regularization, the following figure may appear.
Here is an example of the L1 regularization term. I have assumed that the model has two weight parameters w1, w2. That is, the equation of model y is expressed by the following equation.
y = w1x1 + w2x2
For simplicity, I ignored the bias term.
The red squares represent regularization terms. And the blue ellipses are represents the loss function without the regularization term.
The regularization term is given by
| w1 | ^ q + | w2 | ^ q = r ^ q (r is const.)
Therefore, the equation of the graph at w1> 0 and w2> 0 is expressed as follows.
w2 = (r ^ q-| w1 | ^ q) ^ (1 / q)
By substituting w1 for this equation (q = 0 at Lasso), you can draw a graph of the regularized term.
On the other hand, I could not draw a graph of the loss function. Perhaps you need more than one piece of data to draw this graph. For simplicity, I have assumed that I have only two pieces of data. I define them as (x11, x12, t1), (x21, x22, t2). When the loss function is MSE, it is expressed by the following equation.
Ed = 1/2 * {(t1-w1x11-w2x12) + (t1-w1x21-w2x22)}
If I simplify this, it is expressed as
Ed = a*w1^2 + b*w1 + c*w2^2 + d*w2 + e*w1*w2 + f
Here, a, b, c, d, e, and f are functions represented by all or part of x11, x12, x21, and x22. After finding a, b, c, d, e, and f, I thought that if we substitute w1 for this equation, we could draw a graph of the loss function. However, I cannot draw well.
Is the above understanding correct? Thank you.
To visualize the loss function, Ed which is a function of w1 and w2, we should visualize it as a 3 dimensional plot. For example, you can use Geogebra to visualize a 3 dimensional surface plot.
Here is an example, where a=3, b=-1, c=1, d =-1 , e=2.
The 2D plot that you see is called a countor plot. This link enables you to draw it online.
To draw a contour plot manually, you fix the value of Ed, then you obtained a quadratic equation, after which, as you varies w1, you can solve for your w2, for each w1, you can obtain up to 2 w2 as it is quadratic.
Remark: If you are looking for closed form expression in terms of arbitrary q, that could be more challenging.
As I understand it, in a deep neural network, we use an activation function (g) after applying the weights (w) and bias(b) (z := w * X + b | a := g(z)). So there is a composition function of (g o z) and the activation function makes so our model can learn function other than linear functions. I see that Sigmoid and Tanh activation function makes our model non-linear, but I have some trouble seeing that a ReLu (which takes the max out of 0 and z) can make a model non-linear...
Let's say if every Z is always positive, then it would be as if there was no activation function...
So why does ReLu make a neural network model non-linear?
Deciding if a function is linear or not is of course not a matter of opinion or debate; there is a very simple definition of a linear function, which is roughly:
f(a*x + b*y) = a*f(x) + b*f(y)
for every x & y in the function domain and a & b constants.
The requirement "for every" means that, if we are able to find even a single example where the above condition does not hold, then the function is nonlinear.
Assuming for simplicity that a = b = 1, let's try x=-5, y=1 with f being the ReLU function:
f(-5 + 1) = f(-4) = 0
f(-5) + f(1) = 0 + 1 = 1
so, for these x & y (in fact for every x & y with x*y < 0) the condition f(x + y) = f(x) + f(y) does not hold, hence the function is nonlinear...
The fact that we may be able to find subdomains (e.g. both x and y being either negative or positive here) where the linearity condition holds is what defines some functions (such as ReLU) as piecewise-linear, which are still nonlinear nevertheless.
Now, to be fair to your question, if in a particular application the inputs happened to be always either all positive or all negative, then yes, in this case the ReLU would in practice end up behaving like a linear function. But for neural networks this is not the case, hence we can rely on it indeed to provide our necessary non-linearity...
In the traditional residual block, is the "addition" of layer N to the output of layer N+2 (prior to non-linearity) element-wise addition or concatenation?
The literature indicates something like this:
X1 = X
X2 = relu(conv(X1))
X3 = conv(X2)
X4 = relu(conv(X3 + X1))
It has to be element-wise, with concatenation you don't get a residual function. One has also to be aware about using the proper padding mode so convolutions produce outputs with the same spatial dimensions as the block input.
I am new to machine learning and statistics (well, I've been learning math in my university but that was about 10-12 years ago)
Could you please explain the meaning of following sentence from 4 page (in a book 5 page) from book here ( https://www.researchgate.net/publication/227612766_An_Empirical_Comparison_of_Machine_Learning_Models_for_Time_Series_Forecasting ):
The multilayer perceptron (often simply called neural network) is perhaps the most
popular network architecture in use today both for classification and regression (Bishop
[5]). The MLP is given as follows:
N
H
y ˆ = v0 +
j=1
X
vj g(wj T x′ )
(1)
where x′ is the input vector x, augmented with 1, i.e. x′ = (1, xT )T , wj is the weight
vector for j th hidden node, v0 , v1 , . . . , vN H are the weights for the output node, and y ˆ is
the network output. The function g represents the hidden node output, and it is given
in terms of a squashing function, for example (and that is what we used) the logistic
function: g(u) = 1/(1 + exp(−u)). A related model in the econometrics literature is
For instance, we have a vector x = [0.2, 0.3, 0.4, 0.5]
How do I transform it to get a x′ vector augmented to 1
x′ = (1, x)
This is part of the isomorphism between matrices and systems of equations. What you have at the moment is a row equivalent to a right-hand-side expression, such as
w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
w2 = ...
w3 = ...
w4 = ...
When we want to solve the system, we need to augment the matrix. This requires adding the coefficient of each w[n] variable. They are trivially all ones:
1*w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
1*w2 = ...
1*w3 = ...
1*w4 = ...
... and that's where we get the augmented matrix. When we assume the variables by position -- w by row, x by column -- what remains is the coefficients alone, in a nice matrix.
I'm currently studying Machine Learning but I don't have a statistics background. Everywhere I've seen the logistic function, it has always been:
wx + b
but this example in Theano documentation used:
wx - b
Please which one is it? I'm new to this and I don't want to get confused.
The example on your linked page is not using wx - b. Here is the formula I assume you are referencing:
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))
You can break this up into the sigmoid argument and the sigmoid function:
arg = T.dot(x, w) + b # sigmoid argument
p_1 = 1 / (1 + T.exp(-arg)) # sigmoid function
So there are two issues. The first is that you didn't factor the sign of the b variable properly (the formula is using wx + b). Second is that the formula you quoted isn't actually the sigmoid function; rather, it is the argument (a linear weighted sum of input variables) that is passed to the sigmoid function.