when using RobustScaler should you transform y train? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am using RobustScaler to fit and transform x_train and x_test. Should i also transform
y_train and y_test as well. I was thinking this because neural net gives weird val loss.
Sometimes val loss is small and good but sometimes its high and bad maybe its just initialized weights
of neural net but just want to make sure.

No, you shouldn't. You should scale your Xs because otherwise you neural network can start thinking some of the features are more usefull only because too big values.
y - is the result. Scaling it - is pointless activity. Neural networks can produce big numbers.
Actually, NNs can process big values, when all features have the same "weight". Using scalers is just a good practice.

Related

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

In locally weighted regression, how determine distance from query point with more than one dimension [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
If the query point in a locally weighted regression is multidimensional (for different features), how do we determine if there are points close-by the query point? This is especially true if the features have different units.
If x is a vector of the individual differences for each feature, one could use a few different norms to measure the "size" of x (and the distance between any two points). The most commonly used norm is the L2 norm. Different normalization schemes could be used but if you scale each feature so that 80% of the points have a value between -10 and 10 you should be OK.

Find the simplest polynomial kernel for which this data becomes linearly separable [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
This question covers the (kernel) perceptron and requires you to refer to the following training data for parts (a)-(c). You are only permitted to make use of numpy and matplotlib. You are not permitted to make use of any existing numpy implementations of perceptrons (if they exist).
dataset
Recall that the polynomial kernel is defined as
polynomial kernel
Each such kernel corresponds to
a feature representation of the original data. Find the simplest polynomial kernel for which this
data becomes linearly separable (note: simplest in this case is defined as the polynomial kernel
with the smallest values of both m and d).
The answer if m = 1, d = 3
use logistic regression to solve it
The weights of the final model will give you the values you need

Difference between Cost Function and Activation funtion? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to understand the difference between Cost function and Activation function in a machine learning problems.
Can you please help me understand the difference?
A cost function is a measure of error between what value your model predicts and what the value actually is. For example, say we wish to predict the value yi for data point xi. Let fθ(xi) represent the prediction or output of some arbitrary model for the point xi with parameters θ. The cost function is the sum of (yi−fθ(xi))2 (this is only an example it could be the absolute value over the square). Training the hypothetical model we stated above would be the process of finding the θ that minimizes this sum.
An activation function transforms the shape/representation of the in the model. A simple example could be max(0,x), a function which outputs 0 if the input x is negative or x if the input x is positive. This function is known as the “Rectified Linear Unit” (ReLU) activation function. These representations are essential for making high-dimensional data linearly separable, which is one of the many uses of a neural network. The choice of these functions depends on your case if you need a custom model also the kind of layer (hidden / output) or the model architecture.

Does the order of terms in MSE matter in case of differentiation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Mean squared error is a popular cost function used in machine learning:
(1/n) * sum(y - pred)**2
Basically the order of subtraction terms doesn't matter as the whole expression is squared.
But if we differentiate this function, it will no longer be squared:
2 * (y - pred)
Would the order make a difference for a neural network?
In most cases reversing the order of the terms y and pred would change the sign of the result. As we use the result to compute the slope of the weight - would it influence the way the neural network converges?
Well, actually
and
so they're the same.
(I took the derivative w.r.t. y_i assuming those are the network outputs but of course the same holds if you derive by \hat{y}_i.)

Resources