Most of the LSTM tutorials I have seen are using "dependent variable" in the "independent variable" as well for the LSTM model.
For instance, I want to predict pollution. The other variables such as dew, temp, press, wnd_dir, wnd_spd, snow, rain are independent variables of pollution. But from this tutorial, they are using the "pollution" variable as one of their "independent variables" for LSTM.
How can we adjust the code? so we can only predict using the other independent variables without the "pollution" variable.
Related
I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.
What's your approach to solving a machine learning problem with multiple data sets with different parameters, columns and lengths/widths? Only one of them has a dependent variable. Rest of the files contain supporting data.
Your query is too generic and irrelevant to some extent as well. The concern around columns length and width is not justified when building a ML model. Given the fact that only one of the datasets has a dependent variable, there will be a need to merge the datasets based on keys that are common across datasets. Typically, the process followed before doing modelling is :
step 0: Identify the dependent variable and decide whether to do regression or classification (assuming you are predicting variable value)
Clean up the provided data by handling duplicates, spelling mistakes
Scan through the categorical variables to handle any discrepancies.
Merge the datasets and create a single dataset that has all the independent variables and the dependent variable for which prediction has to be done.
Do exploratory data analysis in order to understand the dependent variable's behavior with other independent variables.
Create model and refine the model based on VIF (Variance Inflation factor) and p-value.
Iterate and keep reducing the variables till you get a model which has all the
significant variables, stable R^2 value. Finalize the model.
Apply the trained model on the test dataset and see the predicted value against the variable in test dataset.
Following these steps at high level will help you to build models.
What happens when I normalize the dependent variable but not the independent variables in a linear regression ? How will I interpret the model as opposed to normalizing both dependent and independent variables.
Thank You !!
What happens when I normalize the dependent variable but not the independent variables in a linear regression ?
Nothing.
How will I interpret the model as opposed to normalizing both dependent and independent variables.
If you normalize independent variables you will be able to compare/interpret weights of them after fitting.
I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I'm not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.
Is there a way to deal with this problem?
Thanks!
you may try to create dummy variables on the categorical variables. Before that, try to combine some of the categorical variables.
I saw someone creating dummy variables from nominal variables for machine learning models for classification problems. And then use both the original nominal variables and the newly created dummy variables in decision tree, SVM, NN models.
I don't see the point of it. I feel the use of nominal variables with their derived dummy variables is redundant.
Am I correct or is it necessary to use both the original nominal variable and their dummy indicators?
Depends on what kind of model you're training. Simple models (such as linear ones) can be too "dumb" to "see" how the derived features relate to the original ones.
In the linear regression case, introducing a new feature that is the square of another is enough to "trick" the model; it can only "see" linear relationships, so the quadratic one looks independent.