Feature scaling a linear regression model and how it affects the output

Feature scaling a linear regression model and how it affects the output - machine-learning

I have input data that looks like:
col1 col2 col3 col4 col5 col 6
-0.1144887 -0.1717161 3847 3350 2823 2243
0.3534122 0.53008300 4230 3520 2421 3771
...
So columns 1 and 2 range from -1 to 1, and columns 3-6 range from 2000-5000
The output data ranges from 5.0 to 10.0. I expect to predict a single real-valued output for each input vector and am using a linear regression dense neural network with an 'mse' loss function.
I'm thinking I should scale columns 3-6 to between 0 and 1 and leave columns 1 and 2 as is. Is that correct or should I also scale columns 1 and 2 to be between 0 and 1? If I scale the input, does that affect my predicted output value or does it only speed up the learning? Is there any need to scale the output?

You should scale all the features in the same range. The standard way is to center to the mean value and scale using the variance:
1) compute the mean value and the variance of the features using the training set (e.g. col1_av=average(col1_train), col2_av=average(col2_train),...)
2) from each feature subtract the corresponding mean value and scale wrt the variance(e.g. [x1=-0.1144887,x2=0.3534122,...]-> (x1-col1_av)/col1_var). The sample in the test set must be scaled using the value estimated on the training set.
Having features so different in magnitude will affect also the output and not only the learning process since feature with larger magnitude will weight more in the model.
In general there is no need to scale the output.
An interesting read: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

Related

Perceptron training rule, why multiply by x

I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is
where
: training rate
: expected output
: actual output
: ith input
This implies that if is very large then so is , but I don't understand the purpose of a large update when is large
on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in will result in a big change in the final output (due to )

The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.
Consider a 1xd weight vector indicating the weights of the perceptron model. Also, consider a 1xd datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be
-- Eq. 1
Here '.' is a dot product, or
The hyperplane above equation is
(Ignoring the iteration indices for the weight updates for simplicity)
Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.
The vector which is normal to this hyperplane is . The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.
There are three possibilities of (ignoring the training rate)
: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
implying that the target was 1, but the present set of weights classified it as 0. The Eq1. which was supposed to be . Eq1. in this case is , which indicates that the angle between and is greater that 90 degrees, which should have been lesser. The update rule is . If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between and is closer than before and less than 90 degrees.
implying that the target was 0, but the present set of weights classified it as 1. The eq1. which was supposed to be . Eq1. in this case is indicates that the angle between and is lesser that 90 degrees, which should have been greater. The update rule is . Similarly this will rotate the hyperplane so that the angle between and is greater than 90 degrees.
This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.
If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.
Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4

Architecture MNIST, fully connected layer 1, output size

I don't understand part of this (quora: How does the last layer of a ConvNet connects to the first fully connected layer):
Make an one hot representation of feature maps. So we would have 64 *
7 * 7 = 3136 input features which is again processed by a 3136 neurons
reducing it to 1024 features. The matrix multiplication this layer
would be (1x3136) * (3136x1024) => 1x1024
I mean, what is the process to reduce 3136 inputs using 3136 neurons to 1024 features?

I would explain it using layman's terms how I understand it.
One hot representation of feature maps is a way for categorical values to be represented by a matrix using 1 and 0. This is a way for machines to read/process the data (in your example, an image or a picture). Then ig makes computations using matrix algebra.
Now the part of the computation is multiplication of 1 row and 3136 columns of binary values (1 or 0) and another matrix of size 3136 rows and 1024 columns. When you multiple these two matrices, the resulting matrix is 1 row and 1024 columns. This is now the matrix of 1's and 0's that represents your image or picture.

Hope I got your question right.
You need to understand matrix multiplication. (1x3136) * (3136x1024) is an example of matrix multiplication that first multiplier's((1x3136)) column number must be equal to second multiplier's (3136x1024) row number. This results in (1x1024) because first multiplier's row becomes result's row, while second multiplier's column becomes result's column.
Also, check this :
https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/v/multiplying-a-matrix-by-a-matrix

Why represent neural network quality as 1 minus the ratio of the mean absolute error in prediction to the range of the predicted values?

The documentation for IBM's SPSS Modeler defines neural network quality as:
For a continuous target, this 1 minus the ratio of the mean absolute error in prediction (the average of the absolute values of the predicted values minus the observed values) to the range of predicted values (the maximum predicted value minus the minimum predicted value).
Is this calculation standard?
I'm having trouble understanding how quality is derived from this.

The main point here is to make the network quality measure independent from the range of output values. The proposed measure is 1 - relative_error This means that for a perfect network, you will get the maximum quality of 1. It also means that the quality cannot become less than 0.
Example:
If you want to predict values in the range 0 to 1, an absolute error of 0.2 would mean 20%. When predicting values in the range 0 to 100, you could have a much larger absolute error of 20 for the same accuracy of 20%.
When using the formula you describe, you get these relative errors:
1 - 0.2 / (1 - 0) = 0.8
1 - 20 / (100 - 0) = 0.8

normalization in image processing

What is the correct mean of normalization in image processing? I googled it but i had different definition. I'll try to explain in detail each definition.
Normalization of a kernel matrix
If normalization is referred to a matrix (such as a kernel matrix for convolution filter), usually each value of the matrix is divided by the sum of the values of the matrix in order to have the sum of the values of the matrix equal to one (if all values are greater than zero). This is useful because a convolution between an image matrix and our kernel matrix give an output image with values between 0 and the max value of the original image. But if we use a sobel matrix (that have some negative values) this is not true anymore and we have to stretch the output image in order to have all values between 0 and max value.
Normalization of an image
I basically find two definition of normalization. The first one is to "cut" values too high or too low. i.e. if the image matrix has negative values one set them to zero and if the image matrix has values higher than max value one set them to max values. The second one is to linear stretch all the values in order to fit them into the interval [0, max value].

I will extend a bit the answer from #metsburg. There are several ways of normalizing an image (in general, a data vector), which are used at convenience for different cases:
Data normalization or data (re-)scaling: the data is projected in to a predefined range (i.e. usually [0, 1] or [-1, 1]). This is useful when you have data from different formats (or datasets) and you want to normalize all of them so you can apply the same algorithms over them. Is usually performed as follows:
Inew = (I - I.min) * (newmax - newmin)/(I.max - I.min) + newmin
Data standarization is another way of normalizing the data (used a lot in machine learning), where the mean is substracted to the image and dividied by its standard deviation. It is specially useful if you are going to use the image as an input for some machine learning algorithm, as many of them perform better as they assume features to have a gaussian form with mean=0,std=1. It can be performed easyly as:
Inew = (I - I.mean) / I.std
Data stretching or (histogram stretching when you work with images), is refereed as your option 2. Usually the image is clamped to a minimum and maximum values, setting:
Inew = I
Inew[I < a] = a
Inew[I > b] = b
Here, image values that are lower than a are set to a, and the same happens inversely with b. Usually, values of a and b are calculated as percentage thresholds. a= the threshold that separates bottom 1% of the data and b=the thredhold that separates top 1% of the data. By doing this, you are removing outliers (noise) from the image.
This is similar (simpler) to histogram equalization, which is another used preprocessing step.
Data normalization, can also be refereed to a normalization of a vector respect to a norm (l1 norm or l2/euclidean norm). This, in practice, is translated as to:
Inew = I / ||I||
where ||I|| refeers to a norm of I.
If the norm is choosen to be the l1 norm, the image will be divided by the sum of its absolute values, making the sum of the whole image be equal to 1. If the norm is choosen to be l2 (or euclidean), then image is divided by the sum of the square values of I, making the sum of square values of I be equal to 1.
The first 3 are widely used with images (not the 3 of them, as scaling and standarization are incompatible, but 1 of them or scaling + streching or standarization + stretching), the last one is not that useful. It is usually applied as a preprocess for some statistical tools, but not if you plan to work with a single image.

Answer by #Imanol is great, i just want to add some examples:
Normalize the input either pixel wise or dataset wise. Three normalization schemes are often seen:
Normalizing the pixel values between 0 and 1:
img /= 255.0
Normalizing the pixel values between -1 and 1 (as Tensorflow does):
img /= 127.5
img -= 1.0
Normalizing according to the dataset mean & standard deviation (as Torch does):
img /= 255.0
mean = [0.485, 0.456, 0.406] # Here it's ImageNet statistics
std = [0.229, 0.224, 0.225]
for i in range(3): # Considering an ordering NCHW (batch, channel, height, width)
img[i, :, :] -= mean[i]
img[i, :, :] /= std[i]

In data science, there are two broadly used normalization types:
1) Where we try to shift the data so that there sum is a particular value, usually 1 (https://stats.stackexchange.com/questions/62353/what-does-it-mean-to-use-a-normalizing-factor-to-sum-to-unity)
2) Normalize data to fit it within a certain range (usually, 0 to 1): https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

K means clustering for multidimensional data

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data)
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculate the mean of values of each row, will that be the centroid?
and how do I plot resulting clusters in matlab.

OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.
K-Means
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.
After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.
Some Matlab starting points
You load the data by X = load('path/to/the/dataset', '-ascii');
In your case X will be a 440x8 matrix.
You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);,
where both, example and centroid1 have dimensionality 1x8.
Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.
As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart