How to insert covariates in logistic regression? - spss

I am running a binary logistic regression in SPSS and have the following setup:
One dichotomous DV
Two dichotomous IVs
Two covariates that were measured on 7 point Likert scales
When inserting the covariates into the regression model, SPSS asks you to define the reference category for categorical variables. Must I also do this for the two covariates?

No, you don't need to define reference category for IV (because they are dichotomous) and covariates (because they are ordinal). ATM, I don't remember SPSS dialogue box, but you can simply add the covariates as continuous predictors (like any other IV).

Related

Bernoulli and Categorical Naive Bayes in scikit-learn

Is sklearn.naive_bayes.CategoricalNB the same as sklearn.naive_bayes.BernoulliNB, but with one hot encoding in the columns?
Couldn't quite guess from documentation, and CategoricalNB has that one extra parameter alpha whose purpose I don't understand.
The categorical distribution is the Bernoulli distribution, generalized to more than two categories. Stated another way, the Bernoulli distribution is a special case of the categorical distribution, with exactly 2 categories.
In the Bernoulli model, each feature is assumed to have exactly 2 categories, often denoted as 1 and 0 or True and False. In the categorical model, each feature is assumed to have at least 2 categories, and each feature may have a different total number of categories.
One-hot encoding is unrelated to either model. It is a technique for encoding a categorical variable in a numerical matrix. It has no bearing on the actual distribution used to model that categorical variable, although it is natural to model categorical variables using the categorical distribution.
The "alpha" parameter is called the Laplace smoothing parameter. I will not go into detail about it here, because that is better suited for CrossValidated, e.g. https://stats.stackexchange.com/q/192233/36229. From a computational perspective, it exists in order to prevent "poisoning" the calculations with 0s, which propagate multiplicatively throughout the model. This is a practical concern that arises whenever some combination of class label and feature category is not present in your data set. It's fine to leave it at the default value of 1.

How to deal with multiple categorical variables each with different cardinality?

I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

Random forest in sklearn

I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.
You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as #Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).
A random enumeration(assigning some integer to each category) of the countries will work quite well sometimes. Especially if categories are few and training set size is large. Sometimes better than one-hot encoding.
Some threads discussing the two options with sklearn:
https://github.com/scikit-learn/scikit-learn/issues/5442
How to use dummy variable to represent categorical data in python scikit-learn random forest
You can also choose to use an RF algorithm that truly supports categorical data such as Arborist(python and R front end), extraTrees(R, Java, RF'isch) or randomForest(R). Why sklearn chose not to support categorical splits, I don't know. Perhaps convenience of implementation.
The number of possible categorical splits to try blows up after 10 categories and the search becomes slow and the splits may become greedy. Arborist and extraTrees will only try a limited selection of splits in each node.

SPSS and ordinary least squares

I am doing regression and I am using SPSS/PASW. But it doesn't seem to support Ordinary Least Squares, it only has Partial least Squares and 2-stages Least Squares. Any suggestions about what to do?
This link mentions SPSS weighted least squares. I think if you make all the weights equal to 1.0 you've got what you're calling "ordinary" least squares.
I agree with Barry - OLS is 'standard' in SPSS/PASW - the least squares method is used in standard linear regressions and in PASW if you select "Analyze>Regression>Linear" that will give you what you are calling OLS.
This is taken from SPSS/PASW's help documents - it does not directly say OLS under standard linear regression, but infers OLS via this document...
"Standard linear regression models
assume that errors in the dependent
variable are uncorrelated with the
independent variable(s). When this is
not the case (for example, when
relationships between variables are
bidirectional), linear regression
using ordinary least squares (OLS) no
longer provides optimal model
estimates. Two-stage least-squares
regression uses instrumental variables
that are uncorrelated with the error
terms to compute estimated values of
the problematic predictor(s) (the
first stage), and then uses those
computed values to estimate a linear
regression model of the dependent
variable (the second stage). Since the
computed values are based on variables
that are uncorrelated with the errors,
the results of the two-stage model are
optimal."
SPSS should default to OLS unless you are doing something to make it switch; I think that the problem is that the default is assumed, and not explicitly mentioned.

Resources