Rescaling after feature scaling, linear regression

Rescaling after feature scaling, linear regression - machine-learning

Seems like a basic question, but I need to use feature scaling (take each feature value, subtract the mean then divide by the standard deviation) in my implementation of linear regression with gradient descent. After I'm finished, I'd like the weights and regression line rescaled to the original data. I'm only using one feature, plus the y-intercept term. How would I change the weights, after I get them using the scaled data, so that they apply to the original unscaled data?

Suppose your regression is y = W*x + b with x the scaled data, with the original data it is
y = W/std * x0 + b - u/std * W
where u and std are mean value and standard deviation of x0. Yet I don't think you need to transform back the data. Just use the same u and std to scale the new test data.

Related

Why would I use a Non Linear activation function in CNN convolutional layer?

I was going through one of the deep learning lectures from MIT on CNN. It said when multiplying weights with pixel values, a non linear activation function like relu can be applied on every pixel. I understand why it should be applied in a simple neural network, since it introduces non linearity in our input data. But why would I want to apply it on a single pixel ? Or am I getting it wrong ?

You may have got it a little wrong.
When they say "multiplying weights with pixel values" - they refer to the linear operation of multiplying the filter (weights + bias) with the pixels of the image. If you think about it, each filter in a CNN essentially represents a linear equation.
For example - if we're looking at a 4*4 filter, the filter is essentially computing x1 * w1 + x2 * w2 + x3 * w3 + x4 * w4 + b for every 4*4 patch of the image it goes over. (In the above equation, x1,x2,x4,x4 refer to pixels of the image, while w1,w2,w3,w4 refer to the weights present in the CNN filter)
Now, hopefully it's fairly clear that the filter is essentially computing a linear equation. To be able to perform a task like let's say image classification, we require some amount of non-linearity. This is achieved by using, most popularly, the ReLU activation function.
So you aren't applying non linearity to a "pixel" per se, you're still applying it to a linear operation (like in a vanilla neural network) - which consists of pixel values multiplied by the weights present in a filter.
Hope this cleared your doubt, feel free to reach out for more help!

how to calculate theta in univariate linear regression model?

I have hypothesis function h(x) = theta0 + theta1*x.
How can I select theta0 and theta1 value for the linear regression model?

The question is unclear whether you would like to do this by hand (with the underlying math), use a program like Excel, or solve in a language like MATLAB or Python.
To start, here is a website offering a summary of the math involved for a univariate calculation: http://www.statisticshowto.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/
Here, there is some discussion of the matrix formulation of the multivariate problem (I know you asked for univariate but some people find the matrix formulation helps them conceptualize the problem): https://onlinecourses.science.psu.edu/stat501/node/382
We should start with a bit of an intuition, based on the level of the question. The goal of a linear regression is to find a set of variables, in your case thetas, that minimize the distance between the line formed and the data points observed (often, the square of this distance). You have two "free" variables in the equation you defined. First, theta0: this is the intercept. The intercept is the value of the response variable (h(x)) when the input variable (x) is 0. This visually is the point where the line will cross the y axis. The second variable you have defined is the slope (theta1), this variable expresses how much the response variable changes when the input changes. If theta1 = 0, h(x) does not change when x changes. If theta1 = 1, h(x) increases and decreases at the same rate as x. If theta1 = -1, h(x) responds in the opposite direction: if x increases, h(x) decreases by the same amount; if x decreases, h(x) increases by the quantity.
For more information, Mathworks provides a fairly comprehensive explanation: https://www.mathworks.com/help/symbolic/mupad_ug/univariate-linear-regression.html
So after getting a handle on what we are doing conceptually, lets take a stab at the math. We'll need to calculate the standard deviation of our two variables, x and h(x). WTo calculate the standard deviation, we will calculate the mean of each variable (sum up all the x's and then divide by the number of x's, do the same for h(x)). The standard deviation captures how much a variable differs from its mean. For each x, subtract the mean of x. Sum these differences up and then divide by the number of x's minus 1. Finally, take the square root. This is your standard deviation.
Using this, we can normalize both variables. For x, subtract the mean of x and divide by the standard deviation of x. Do this for h(x) as well. You will now have two lists of normalized numbers.
For each normalized number, multiply the value by its pair (the first normalized x value with its h(x) pair, for all values). Add these products together and divide by N. This gives you the correlation. To get the least squares estimate of theta1, calculate this correlation value times the standard deviation of h(x) divided by the standard deviation of x.
Given all this information, calculating the intercept (theta0) is easy, all we'll have to do is take the mean of h(x) and subtract the product (multiply!) of our calculated theta1 and the average of x.
Phew! All taken care of! We have our least squares solution for those two variables. Let me know if you have any questions! One last excellent resource: https://people.duke.edu/~rnau/mathreg.htm

If you are asking about the hypothesis function in linear regression, then those theta values are selected by an algorithm called gradient descent. This helps in finding the theta values to minimize the cost function.

Handling zero rows/columns in covariance matrix during em-algorithm

I tried to implement GMMs but I have a few problems during the em-algorithm.
Let's say I've got 3D Samples (stat1, stat2, stat3) which I use to train the GMMs.
One of my training sets for one of the GMMs has in nearly every sample a "0" for stat1. During training I get really small Numbers (like "1.4456539880060609E-124") in the first row and column of the covariance matrix which leads in the next iteration of the EM-Algorithm to 0.0 in the first row and column.
I get something like this:
0.0 0.0 0.0
0.0 5.0 6.0
0.0 2.0 1.0
I need the inverse covariance matrix to calculate the density but since one column is zero I can't do this.
I thought about falling back to the old covariance matrix (and mean) or to replace every 0 with a really small number.
Or is there a another simple solution to this problem?

Simply your data lies in degenerated subspace of your actual input space, and GMM is not well suited in most generic form for such setting. THe problem is that empirical covariance estimator that you use simply fail for such data (as you said - you cannot inverse it). What you usually do? You chenge covariance estimator to the constrained/regularized ones, which contain:
Constant-based shrinking, thus instead of using Sigma = Cov(X) you do Sigma = Cov(X) + eps * I, where eps is prefedefined small constant, and I is identity matrix. Consequently you never have a zero values on the diagonal, and it is easy to prove that for reasonable epsilon, this will be inversible
Nicely fitted shrinking, like Oracle Covariance Estimator or Ledoit-Wolf Covariance Estimator which find best epsilon based on the data itself.
Constrain your gaussians to for example spherical family, thus N(m, sigma I), where sigma = avg_i( cov( X[:, i] ) is the mean covariance per dimension. This limits you to spherical gaussians, and also solves the above issue
There are many more solutions possible, but all based on the same thing - chenge covariance estimator in such a way, that you have a guarantee of invertability.

normalization in image processing

What is the correct mean of normalization in image processing? I googled it but i had different definition. I'll try to explain in detail each definition.
Normalization of a kernel matrix
If normalization is referred to a matrix (such as a kernel matrix for convolution filter), usually each value of the matrix is divided by the sum of the values of the matrix in order to have the sum of the values of the matrix equal to one (if all values are greater than zero). This is useful because a convolution between an image matrix and our kernel matrix give an output image with values between 0 and the max value of the original image. But if we use a sobel matrix (that have some negative values) this is not true anymore and we have to stretch the output image in order to have all values between 0 and max value.
Normalization of an image
I basically find two definition of normalization. The first one is to "cut" values too high or too low. i.e. if the image matrix has negative values one set them to zero and if the image matrix has values higher than max value one set them to max values. The second one is to linear stretch all the values in order to fit them into the interval [0, max value].

I will extend a bit the answer from #metsburg. There are several ways of normalizing an image (in general, a data vector), which are used at convenience for different cases:
Data normalization or data (re-)scaling: the data is projected in to a predefined range (i.e. usually [0, 1] or [-1, 1]). This is useful when you have data from different formats (or datasets) and you want to normalize all of them so you can apply the same algorithms over them. Is usually performed as follows:
Inew = (I - I.min) * (newmax - newmin)/(I.max - I.min) + newmin
Data standarization is another way of normalizing the data (used a lot in machine learning), where the mean is substracted to the image and dividied by its standard deviation. It is specially useful if you are going to use the image as an input for some machine learning algorithm, as many of them perform better as they assume features to have a gaussian form with mean=0,std=1. It can be performed easyly as:
Inew = (I - I.mean) / I.std
Data stretching or (histogram stretching when you work with images), is refereed as your option 2. Usually the image is clamped to a minimum and maximum values, setting:
Inew = I
Inew[I < a] = a
Inew[I > b] = b
Here, image values that are lower than a are set to a, and the same happens inversely with b. Usually, values of a and b are calculated as percentage thresholds. a= the threshold that separates bottom 1% of the data and b=the thredhold that separates top 1% of the data. By doing this, you are removing outliers (noise) from the image.
This is similar (simpler) to histogram equalization, which is another used preprocessing step.
Data normalization, can also be refereed to a normalization of a vector respect to a norm (l1 norm or l2/euclidean norm). This, in practice, is translated as to:
Inew = I / ||I||
where ||I|| refeers to a norm of I.
If the norm is choosen to be the l1 norm, the image will be divided by the sum of its absolute values, making the sum of the whole image be equal to 1. If the norm is choosen to be l2 (or euclidean), then image is divided by the sum of the square values of I, making the sum of square values of I be equal to 1.
The first 3 are widely used with images (not the 3 of them, as scaling and standarization are incompatible, but 1 of them or scaling + streching or standarization + stretching), the last one is not that useful. It is usually applied as a preprocess for some statistical tools, but not if you plan to work with a single image.

Answer by #Imanol is great, i just want to add some examples:
Normalize the input either pixel wise or dataset wise. Three normalization schemes are often seen:
Normalizing the pixel values between 0 and 1:
img /= 255.0
Normalizing the pixel values between -1 and 1 (as Tensorflow does):
img /= 127.5
img -= 1.0
Normalizing according to the dataset mean & standard deviation (as Torch does):
img /= 255.0
mean = [0.485, 0.456, 0.406] # Here it's ImageNet statistics
std = [0.229, 0.224, 0.225]
for i in range(3): # Considering an ordering NCHW (batch, channel, height, width)
img[i, :, :] -= mean[i]
img[i, :, :] /= std[i]

In data science, there are two broadly used normalization types:
1) Where we try to shift the data so that there sum is a particular value, usually 1 (https://stats.stackexchange.com/questions/62353/what-does-it-mean-to-use-a-normalizing-factor-to-sum-to-unity)
2) Normalize data to fit it within a certain range (usually, 0 to 1): https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.

It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.

I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.

The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces

For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.

I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.

Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.

Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart