I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation.
Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column?
Does the dummy method has positive impact on machine learning model both in classification and regression model ?
If it is , and why?
Thanks.
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.
To one-hot-encode a feature with N possible values you only need N-1 columns with 0 / 1 values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N features instead of N-1 shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.
Related
Is sklearn.naive_bayes.CategoricalNB the same as sklearn.naive_bayes.BernoulliNB, but with one hot encoding in the columns?
Couldn't quite guess from documentation, and CategoricalNB has that one extra parameter alpha whose purpose I don't understand.
The categorical distribution is the Bernoulli distribution, generalized to more than two categories. Stated another way, the Bernoulli distribution is a special case of the categorical distribution, with exactly 2 categories.
In the Bernoulli model, each feature is assumed to have exactly 2 categories, often denoted as 1 and 0 or True and False. In the categorical model, each feature is assumed to have at least 2 categories, and each feature may have a different total number of categories.
One-hot encoding is unrelated to either model. It is a technique for encoding a categorical variable in a numerical matrix. It has no bearing on the actual distribution used to model that categorical variable, although it is natural to model categorical variables using the categorical distribution.
The "alpha" parameter is called the Laplace smoothing parameter. I will not go into detail about it here, because that is better suited for CrossValidated, e.g. https://stats.stackexchange.com/q/192233/36229. From a computational perspective, it exists in order to prevent "poisoning" the calculations with 0s, which propagate multiplicatively throughout the model. This is a practical concern that arises whenever some combination of class label and feature category is not present in your data set. It's fine to leave it at the default value of 1.
I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.
I am trying to encode gender feature containing two values Male and Female. I created two one-hot features from main feature, is_male and is_female, containing boolean values. But while applying model, I realized they are complement to each other. Does this impact model performance as they seem to be correlated?
One-hot encoding(creating separate columns for each value of column) should not be used with binary valued variables (MALE-FEMALE in your case).
Doing so causes DUMMY VARIABLE TRAP.
I am trying to build a logistic regression model and a lot of my features have ordered categorical variables. I think dummy variable may not be useful as it treats each category with equal weightage. So, do i need to treat to ordered categorial variable like numerical ?
Thanks in advance .
Ordered categorical values are termed as "Ordinal" attribute in data mining where one value is less than or greater than another value. You can treat these values as nominal values or continuous values (numbers).
Some of the pros and cons of treating them as numbers (continuous) are:
Pros:
This gives you a lot of flexibility in your choice of analysis and
preserves the information in the ordering. More importantly to many
analysts, it allows you to analyze the data easily.
Cons:
This approach requires the assumption that the numerical distance
between each set of subsequent categories is equal. Otherwise
depending on the domain you can make the interval large.
So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.