Equation of the hyperplane for machine learning - machine-learning

I'm doing a ML MOOC through Edx(archived) and am trying to figure out what theta is for hyperplanes.
So the equation is
When Theta = [-1, 1.5] and Theta_0 = [3], we have:
How do I interpret theta? I thought it was [change in x, change in y], but the line in the image looks like it has slope positive (2/3). Also, I thought Theta_0 represented the the y-intercept, but the intercept seems to be at -2. What is the equation representing if not the hyperplane?
The scale on both axis is one.

Whatever i get from your question
Theta_1 = [-1, 1.5] and Theta_0 = [3] which simply means you have two features and you are interpreting this as one feature equation.
Main equation becomes :
Theta_0 + Theta_1*X1 + Theta_2*X2

Related

Can BDT do squares?

I am trying to separate background from signal where it is known that the quantity x^2 - y^2 is the physical reason why the background and signal are different. If I provide x and y as input variables, the BDT is having the hard time figuring out how to achieve the separation. Is BDT unable to do squares?
No, a binary decision tree is unable to take squares of input features. Given input features x, y, it will try to approximate the desired function by subdividing the x,y plane along vertical and horizontal lines. Let us take a look at an example: I fit a decision tree classifier to a square grid of points, and plot the decision boundary.
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-5.5, 5.5, 1)
y = np.arange(-5.0, 6.0, 1)
xx, yy = np.meshgrid(x,y)
#the function we want to learn:
target = xx.ravel()**2 - yy.ravel()**2 > 0
data = np.c_[xx.ravel(), yy.ravel()]
#Fit a decision tree:
clf = DecisionTreeClassifier()
clf.fit(data, target)
#Plot the decision boundary:
xxplot, yyplot = np.meshgrid(np.arange(-7, 7, 0.1),
np.arange(-7, 7, 0.1))
Z = clf.predict(np.c_[xxplot.ravel(), yyplot.ravel()])
# Put the result into a color plot
Z = Z.reshape(xxplot.shape)
plt.contourf(xxplot, yyplot, Z, cmap=plt.cm.hot)
# Plot also the training points
plt.scatter(xx.ravel(), yy.ravel(), c=target, cmap=plt.cm.flag)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Decision boundary for a binary decision tree learning a function x**2 - y**2 > 0")
plt.show()
Here you can see what kind of boundaries a decision tree can learn: piecewise-rectangular. They are not going to approximate your function well, especially in the area where there are few training points. Since you know that x^2 - y^2 is the quantity that determines the answer, you can just add it as a new feature instead of trying to learn it.

What would be a good loss function to penalize the magnitude and sign difference

I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU

Simple RNN example showing numerics

I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture https://i1.wp.com/karpathy.github.io/assets/rnn/charseq.jpeg
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow

Logistic Regression using Gradient Descent with OCTAVE

I've gone through few courses of Professor Andrew for machine Learning and viewed the transcript for Logistic Regression using Newton's method. However when implementing the logistic regression using gradient descent I face certain issue.
The graph generated is not convex.
My code goes as follows:
I am using the vectorized implementation of the equation.
%1. The below code would load the data present in your desktop to the octave memory
x=load('ex4x.dat');
y=load('ex4y.dat');
%2. Now we want to add a column x0 with all the rows as value 1 into the matrix.
%First take the length
m=length(y);
x=[ones(m,1),x];
alpha=0.1;
max_iter=100;
g=inline('1.0 ./ (1.0 + exp(-z))');
theta = zeros(size(x(1,:)))'; % the theta has to be a 3*1 matrix so that it can multiply by x that is m*3 matrix
j=zeros(max_iter,1); % j is a zero matrix that is used to store the theta cost function j(theta)
for num_iter=1:max_iter
% Now we calculate the hx or hypothetis, It is calculated here inside no. of iteration because the hupothesis has to be calculated for new theta for every iteration
z=x*theta;
h=g(z); % Here the effect of inline function we used earlier will reflect
j(num_iter)=(1/m)*(-y'* log(h) - (1 - y)'*log(1-h)) ; % This formula is the vectorized form of the cost function J(theta) This calculates the cost function
j
grad=(1/m) * x' * (h-y); % This formula is the gradient descent formula that calculates the theta value.
theta=theta - alpha .* grad; % Actual Calculation for theta
theta
end
The code per say doesn't give any error but does not produce proper convex graph.
I shall be glad if any body could point out the mistake or share insight on what's causing the problem.
thanks
2 things you need to look into:
Machine Learning involves learning patterns from data. If your files ex4x.dat and ex4y.dat are randomly generated, it won't have patterns that you can learn.
You have used variables like g, h, i, j which make debugging difficult. Since it's a very small program, it might be a better idea to rewrite it.
Here's my code that gives the convex plot
clc; clear; close all;
load q1x.dat;
load q1y.dat;
X = [ones(size(q1x, 1),1) q1x];
Y = q1y;
m = size(X,1);
n = size(X,2)-1;
%initialize
theta = zeros(n+1,1);
thetaold = ones(n+1,1);
while ( ((theta-thetaold)'*(theta-thetaold)) > 0.0000001 )
%calculate dellltheta
dellltheta = zeros(n+1,1);
for j=1:n+1,
for i=1:m,
dellltheta(j,1) = dellltheta(j,1) + [Y(i,1) - (1/(1 + exp(-theta'*X(i,:)')))]*X(i,j);
end;
end;
%calculate hessian
H = zeros(n+1, n+1);
for j=1:n+1,
for k=1:n+1,
for i=1:m,
H(j,k) = H(j,k) -[1/(1 + exp(-theta'*X(i,:)'))]*[1-(1/(1 + exp(-theta'*X(i,:)')))]*[X(i,j)]*[X(i,k)];
end;
end;
end;
thetaold = theta;
theta = theta - inv(H)*dellltheta;
(theta-thetaold)'*(theta-thetaold)
end
I get the following values of error after iterations:
2.8553
0.6596
0.1532
0.0057
5.9152e-06
6.1469e-12
Which when plotted looks like:

Vectorized gradient descent basics

I'm implementing simple gradient descent in octave but its not working. Here is the data I'm using:
X = [1 2 3
1 4 5
1 6 7]
y = [10
11
12]
theta = [0
0
0]
alpha = 0.001 and itr = 50
This is my gradient descent implementation:
function theta = Gradient(X,y,theta,alpha,itr)
m= length(y)
for i = 1:itr,
th1 = theta(1) - alpha * (1/m) * sum((X * theta - y) .* X(:, 1));
th2 = theta(2) - alpha * (1/m) * sum((X * theta - y) .* X(:, 2));
th3 = theta(3) - alpha * (1/m) * sum((X * theta - y) .* X(:, 3));
theta(1) = th1;
theta(2) = th2;
theta(3) = th3;
end
Questions are:
It produces some values of theta which I use in theta * [1 2 3] and expect an output near about 10 (from y). Is that the correct way to test the hypothesis? [h(x) = theta' * x]
How can I determine how many times should it iterate? If I give it 1500 iterations, theta gets extremely small (in e).
If I use double digit numbers in X, theta gets too small again. Even with < 5 iterations.
I've been struggling with these things for a long time now. Unable to resolve it myself.
Sorry for bad formatting.
Your Batch gradient descent implementation seems perfectly fine to me. Can you be more specific on what is the error you are facing. Having said that, for your question Is that the correct way to test the hypothesis? [h(x) = theta' * x].
Based on the dimensions of your test set here, you should test it as h(x) = X*theta.
For your second question, the number of iterations depends on the data set provided. To decide on the optimized number of iterations, you need to plot your cost function with the number of iterations. And as iterations increase, values of cost function should decrease. By this you can decide upon how many iterations you need. You might also consider, increasing the value of alpha in steps of 0.001, 0.003, 0.01, 0.03, 0.1 ... to consider best possible alpha value.
For your third question, I guess you are directly trying to model the data which you have in this question. This data is very small, it just contains 3 training examples. You might be trying to implement linear regression algorithm. For that, you need to that proper training set which contains sufficient data to train your model. Then you can test your model with your test data.
Refer to Andrew Ng course of Machine Learning in www.coursera.org. You will find more information in that course.

Resources