One-hot encoding for binary categorical variable - machine-learning

I am trying to encode gender feature containing two values Male and Female. I created two one-hot features from main feature, is_male and is_female, containing boolean values. But while applying model, I realized they are complement to each other. Does this impact model performance as they seem to be correlated?

One-hot encoding(creating separate columns for each value of column) should not be used with binary valued variables (MALE-FEMALE in your case).
Doing so causes DUMMY VARIABLE TRAP.

Related

How to deal with multiple categorical variables each with different cardinality?

I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.

One-hot encoding in random forest classifier

Is one-hot encoding necessary for random forest classifier in python? I want to understand logically if random forest can handle categorical features with label encoding rather that one-hot-encoding.
The concept of encoding is necessary in machine learning because with the help of it, we can convert non-numeric features into numeric ones which is understandable by any model.
Any type of encoding can be done on any non-numeric features, it solely depends on intution.
Now, coming to your question when to use label-encoding and when to use One-hot encoding:
Use Label-encoding - Use this when, you want to preserve the ordinal nature of your feature. For example, you have a feature of education level, which has string values like "Bachelor","Master","Ph.D". In this case, you want to preserve the ordinal nature that, Ph.D > Master > Bachelor hence you'll map using label-encoding like - Bachelor-1, Master-2, Ph.D-3.
Use One-hot encoding - Use this when, you want to treat your categorical variable with equal order. For example, you have colors variable which has values "red","yellow", "orange". Now, in this case any value has no precedence over other values, hence you'll use One hot encoding here.
NOTE: In One-hot encoding your number of features will increase, which is not good for any tree based algorithm like Decision-trees, Random Forest etc. That's why Label encoding is mostly preferred in this case, but still if you use one hot encoding, you can check the importance of categorical features by using feature_importances_ hyperparameter in sklearn. If the feature is having low importance you can drop it off.
Random forest is based on the principle of Decision Trees which are sensitive to one-hot encoding. Now here sensitive means like if we induce one-hot to a decision tree splitting can result in sparse decision tree. The trees generally tend to grow in one direction because at every split of a categorical variable there are only two values (0 or 1). The tree grows in the direction of zeroes in the dummy variables.
Now you must be wondering how will you tackle the categorical values without one-hot encoding? For that you can refer to this Hashing Trick further you can also look into h2o Random Forest.

How change numeric column to categorical data on Driverless AI

I have try to use Driverless AI using the docker version. When I try to import my data I have a problem on recognize which data are real numeric and the categorical variables.
How can fix this?
The handling of categorical and user control is described in the DAI documentation FAQ. I will repost here for your convenience:
How does Driverless AI deal with categorical variables? What if an integer column should really be treated as categorical?
If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple methods for how Driverless AI converts the categorical variables to numeric. These include:
One Hot Encoding: creating dummy variables for each value
Frequency Encoding: replace category with how frequently it is seen in the data
Target Encoding: replace category with the average target value (additional steps included to prevent overfitting)
Weight of Evidence: calculate weight of evidence for each category (http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/)
Driverless AI will try multiple methods for representing the column and determine which representation(s) are best.
If the column has integers, Driverless AI will try treating the column as a categorical column and numeric column. It will treat any integer column as both categorical and numeric if the number of unique values is less than 50.
This is configurable in the config.toml file:
# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represent different numerical codes instead.
num_as_cat = true
# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
max_int_as_cat_uniques = 50
(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)

H2O Flow: How does H2O flow UI treat data types differently

Specifically, what is the difference in how H2O treats enum and string data types in contrast to 'int's and 'numerical' types?
For example, say I have a binary classifier that takes input samples that have features
x1=(1 of 10 possible favorite ice cream flavors (enum))
x2=(some random phrase (string))
x3=(some number (int))
What would be the difference in how the classifier treats these types during training?
When uploading data into h2o Flow UI, I get the option to convert certain data types (like enum) to 'numerical.' This makes me think that there is more than just string-to-number mapping going on when I just leave the 'enum' as an 'enum' (not converting to 'numerical' type), but I can't find information on what that difference is.
Clarification would be appreciated, thanks.
The "enum" type is the type of encoding you'll want to use for categorical features. If the categorical features are encoded as "enum", then the tree-based algorithms like Random Forest and GBM will be able to handle these features in a smart way. Most other implementations of RFs and GBM force you to do a one-hot expansion of the categorical features (into K dummy columns), but in H2O, the tree-based methods can use these features without any expansion. The exact whay that the variables are handled can be controlled using the categorical_encoding argument.
If you have an ordered categorical variable, then it might be okay to encode that as "int", however, the effect of doing that on model performance will depend on the data.
If you were to convert an "enum" column to "numeric" that would simply encode each category as an integer and you'd lose the notion that those numbers represent categories (so it's not recommended).
You should not use the "string" type in H2O unless you are going to exclude that column from the set of predictors. It would make sense to use a "string" column for text, but you'll probably want to parse (e.g. tokenize) that text to generate new numeric or enum features that will be included in the set of predictors.

Does use dummy value make model's performance better?

I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation.
Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column?
Does the dummy method has positive impact on machine learning model both in classification and regression model ?
If it is , and why?
Thanks.
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.
To one-hot-encode a feature with N possible values you only need N-1 columns with 0 / 1 values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N features instead of N-1 shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.

Resources