Logistic function Addition or subtraction - machine-learning

I'm currently studying Machine Learning but I don't have a statistics background. Everywhere I've seen the logistic function, it has always been:
wx + b
but this example in Theano documentation used:
wx - b
Please which one is it? I'm new to this and I don't want to get confused.

The example on your linked page is not using wx - b. Here is the formula I assume you are referencing:
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))
You can break this up into the sigmoid argument and the sigmoid function:
arg = T.dot(x, w) + b # sigmoid argument
p_1 = 1 / (1 + T.exp(-arg)) # sigmoid function
So there are two issues. The first is that you didn't factor the sign of the b variable properly (the formula is using wx + b). Second is that the formula you quoted isn't actually the sigmoid function; rather, it is the argument (a linear weighted sum of input variables) that is passed to the sigmoid function.

Related

Where does the graph of the loss function in machine learning come from?

Where does the graph of the loss function in machine learning come from?
I am studying about machine learning. I sometimes don't understand models that have been optimized using regularization terms.
In the explanation of regularization, the following figure may appear.
Here is an example of the L1 regularization term. I have assumed that the model has two weight parameters w1, w2. That is, the equation of model y is expressed by the following equation.
y = w1x1 + w2x2
For simplicity, I ignored the bias term.
The red squares represent regularization terms. And the blue ellipses are represents the loss function without the regularization term.
The regularization term is given by
| w1 | ^ q + | w2 | ^ q = r ^ q (r is const.)
Therefore, the equation of the graph at w1> 0 and w2> 0 is expressed as follows.
w2 = (r ^ q-| w1 | ^ q) ^ (1 / q)
By substituting w1 for this equation (q = 0 at Lasso), you can draw a graph of the regularized term.
On the other hand, I could not draw a graph of the loss function. Perhaps you need more than one piece of data to draw this graph. For simplicity, I have assumed that I have only two pieces of data. I define them as (x11, x12, t1), (x21, x22, t2). When the loss function is MSE, it is expressed by the following equation.
Ed = 1/2 * {(t1-w1x11-w2x12) + (t1-w1x21-w2x22)}
If I simplify this, it is expressed as
Ed = a*w1^2 + b*w1 + c*w2^2 + d*w2 + e*w1*w2 + f
Here, a, b, c, d, e, and f are functions represented by all or part of x11, x12, x21, and x22. After finding a, b, c, d, e, and f, I thought that if we substitute w1 for this equation, we could draw a graph of the loss function. However, I cannot draw well.
Is the above understanding correct? Thank you.
To visualize the loss function, Ed which is a function of w1 and w2, we should visualize it as a 3 dimensional plot. For example, you can use Geogebra to visualize a 3 dimensional surface plot.
Here is an example, where a=3, b=-1, c=1, d =-1 , e=2.
The 2D plot that you see is called a countor plot. This link enables you to draw it online.
To draw a contour plot manually, you fix the value of Ed, then you obtained a quadratic equation, after which, as you varies w1, you can solve for your w2, for each w1, you can obtain up to 2 w2 as it is quadratic.
Remark: If you are looking for closed form expression in terms of arbitrary q, that could be more challenging.

Why is ReLU a non-linear activation function?

As I understand it, in a deep neural network, we use an activation function (g) after applying the weights (w) and bias(b) (z := w * X + b | a := g(z)). So there is a composition function of (g o z) and the activation function makes so our model can learn function other than linear functions. I see that Sigmoid and Tanh activation function makes our model non-linear, but I have some trouble seeing that a ReLu (which takes the max out of 0 and z) can make a model non-linear...
Let's say if every Z is always positive, then it would be as if there was no activation function...
So why does ReLu make a neural network model non-linear?
Deciding if a function is linear or not is of course not a matter of opinion or debate; there is a very simple definition of a linear function, which is roughly:
f(a*x + b*y) = a*f(x) + b*f(y)
for every x & y in the function domain and a & b constants.
The requirement "for every" means that, if we are able to find even a single example where the above condition does not hold, then the function is nonlinear.
Assuming for simplicity that a = b = 1, let's try x=-5, y=1 with f being the ReLU function:
f(-5 + 1) = f(-4) = 0
f(-5) + f(1) = 0 + 1 = 1
so, for these x & y (in fact for every x & y with x*y < 0) the condition f(x + y) = f(x) + f(y) does not hold, hence the function is nonlinear...
The fact that we may be able to find subdomains (e.g. both x and y being either negative or positive here) where the linearity condition holds is what defines some functions (such as ReLU) as piecewise-linear, which are still nonlinear nevertheless.
Now, to be fair to your question, if in a particular application the inputs happened to be always either all positive or all negative, then yes, in this case the ReLU would in practice end up behaving like a linear function. But for neural networks this is not the case, hence we can rely on it indeed to provide our necessary non-linearity...

How to create the predict model for Multiple input on keras using LSTM

When there is only one input, I can use lstm to complete the forecast. When the following two cases, I will be confused, do not know how to build a neural network:
The data format is shown in the picture。
The first case:
Use a, b, c, d to predict d (t + 1)
The second case:
d= f (a, b, c) f is an unknown nonlinear function, using a, b, c, d to predict d (t + 1)
Simply concatenate the inputs in an array with the following dimensions:
(number_of_samples, timesteps, number_of_features)
Where number_of_features in your case is 4 as you have a,b,c,d. Your input_shape of the first layer will be (timesteps, number_of_features).

what does that mean "vector augmented to 1"?

I am new to machine learning and statistics (well, I've been learning math in my university but that was about 10-12 years ago)
Could you please explain the meaning of following sentence from 4 page (in a book 5 page) from book here ( https://www.researchgate.net/publication/227612766_An_Empirical_Comparison_of_Machine_Learning_Models_for_Time_Series_Forecasting ):
The multilayer perceptron (often simply called neural network) is perhaps the most
popular network architecture in use today both for classification and regression (Bishop
[5]). The MLP is given as follows:
N
H
y ˆ = v0 +
j=1
X
vj g(wj T x′ )
(1)
where x′ is the input vector x, augmented with 1, i.e. x′ = (1, xT )T , wj is the weight
vector for j th hidden node, v0 , v1 , . . . , vN H are the weights for the output node, and y ˆ is
the network output. The function g represents the hidden node output, and it is given
in terms of a squashing function, for example (and that is what we used) the logistic
function: g(u) = 1/(1 + exp(−u)). A related model in the econometrics literature is
For instance, we have a vector x = [0.2, 0.3, 0.4, 0.5]
How do I transform it to get a x′ vector augmented to 1
x′ = (1, x)
This is part of the isomorphism between matrices and systems of equations. What you have at the moment is a row equivalent to a right-hand-side expression, such as
w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
w2 = ...
w3 = ...
w4 = ...
When we want to solve the system, we need to augment the matrix. This requires adding the coefficient of each w[n] variable. They are trivially all ones:
1*w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
1*w2 = ...
1*w3 = ...
1*w4 = ...
... and that's where we get the augmented matrix. When we assume the variables by position -- w by row, x by column -- what remains is the coefficients alone, in a nice matrix.

Weighing Samples in a Decision Tree

I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....
For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.
Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?
(I guess this is the same idea as in some comments, e.g., #Alleo)
Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:
a1, a2, ..., ap ---------- weights of the p positive examples
b1, b2, ..., bn ---------- weights of the n negative examples
Suppose
a1 + a2 + ... + ap = A
b1 + b2 + ... + bn = B
As you pointed out, if the examples have unit weights, the entropy would be:
p p n n
- _____ log (____ ) - ______log(______ )
p + n p + n p + n p + n
Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.
A A B B
- _____ log (_____) - ______log(______ )
A + B A + B A + B A + B
Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.
Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.

Resources