What happens when I normalize the dependent variable but not the independent variables in a linear regression ? How will I interpret the model as opposed to normalizing both dependent and independent variables.
Thank You !!
What happens when I normalize the dependent variable but not the independent variables in a linear regression ?
Nothing.
How will I interpret the model as opposed to normalizing both dependent and independent variables.
If you normalize independent variables you will be able to compare/interpret weights of them after fitting.
Related
I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.
I am performing logistic regression and had a doubt
I have categorical (0,1)as well as continuous variables in my data set..
Now do I need to scale my continuous variables between 0 and 1?
Coz few of my continuous variables have values up to 10k
Does it make sense to keep such continous values along with categorical variables while performing the logistic regression?
Theoretically it is not neccesary. But your resulting system will probably have very small coefficients for the inputs with large range. This can be a problem if you want to use numbers with reduced accuracy (for example 16 bit) for your model.
I am not sure why you are asking if you should use the continuous values in your model. If there is any possibility that they are correlated with the result, keep them. Only if you are sure they are uncorrelated, ignore them.
For simple linear/logistic regression (without regularization): no need to scale variables.
For linear/logistic regression with regularization: you need to perform scaling.
For linear/logistic regression without regularization you need to scale features only if you'd like to interpret/compare weights after fitting. Otherwise features with higher values will possibly have smaller weights than other ones.
You can scale by variance and by location. There are many options. My advice is to consider scaling if your variables vary a lot between and within. You can try the following;
All the stuff below here represents a vector, so by X, I mean
. Thus, all I write below are either vectors or matrices.
Scaling by range,
, where R is the range of the variables, basically max(X)-min(X).
Scaling by location (centering), and variance (scaling),
, where xbar and s are the sample mean and sample variance of X, respectively.
The latter one provides centering as well, so make sure that you select the proper formula for your data. There is no rule of thumb here, but intuiton and inference is a key point. You can also try different combinations of scale and location measures.
I know that auto-regression is a regression with lag variables. But, we know that in linear regression, we cannot use correlated independent variables.
Then how does auto-regression work? And what is the difference between that and linear regression?
I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I'm not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.
Is there a way to deal with this problem?
Thanks!
you may try to create dummy variables on the categorical variables. Before that, try to combine some of the categorical variables.
I saw someone creating dummy variables from nominal variables for machine learning models for classification problems. And then use both the original nominal variables and the newly created dummy variables in decision tree, SVM, NN models.
I don't see the point of it. I feel the use of nominal variables with their derived dummy variables is redundant.
Am I correct or is it necessary to use both the original nominal variable and their dummy indicators?
Depends on what kind of model you're training. Simple models (such as linear ones) can be too "dumb" to "see" how the derived features relate to the original ones.
In the linear regression case, introducing a new feature that is the square of another is enough to "trick" the model; it can only "see" linear relationships, so the quadratic one looks independent.