How change numeric column to categorical data on Driverless AI - machine-learning

I have try to use Driverless AI using the docker version. When I try to import my data I have a problem on recognize which data are real numeric and the categorical variables.
How can fix this?

The handling of categorical and user control is described in the DAI documentation FAQ. I will repost here for your convenience:
How does Driverless AI deal with categorical variables? What if an integer column should really be treated as categorical?
If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple methods for how Driverless AI converts the categorical variables to numeric. These include:
One Hot Encoding: creating dummy variables for each value
Frequency Encoding: replace category with how frequently it is seen in the data
Target Encoding: replace category with the average target value (additional steps included to prevent overfitting)
Weight of Evidence: calculate weight of evidence for each category (http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/)
Driverless AI will try multiple methods for representing the column and determine which representation(s) are best.
If the column has integers, Driverless AI will try treating the column as a categorical column and numeric column. It will treat any integer column as both categorical and numeric if the number of unique values is less than 50.
This is configurable in the config.toml file:
# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represent different numerical codes instead.
num_as_cat = true
# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
max_int_as_cat_uniques = 50
(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)

Related

How to deal with multiple categorical variables each with different cardinality?

I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.

One-hot encoding for binary categorical variable

I am trying to encode gender feature containing two values Male and Female. I created two one-hot features from main feature, is_male and is_female, containing boolean values. But while applying model, I realized they are complement to each other. Does this impact model performance as they seem to be correlated?
One-hot encoding(creating separate columns for each value of column) should not be used with binary valued variables (MALE-FEMALE in your case).
Doing so causes DUMMY VARIABLE TRAP.

Categorical variables with large amounts of categories in XGBoost/CatBoost

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

H2O Flow: How does H2O flow UI treat data types differently

Specifically, what is the difference in how H2O treats enum and string data types in contrast to 'int's and 'numerical' types?
For example, say I have a binary classifier that takes input samples that have features
x1=(1 of 10 possible favorite ice cream flavors (enum))
x2=(some random phrase (string))
x3=(some number (int))
What would be the difference in how the classifier treats these types during training?
When uploading data into h2o Flow UI, I get the option to convert certain data types (like enum) to 'numerical.' This makes me think that there is more than just string-to-number mapping going on when I just leave the 'enum' as an 'enum' (not converting to 'numerical' type), but I can't find information on what that difference is.
Clarification would be appreciated, thanks.
The "enum" type is the type of encoding you'll want to use for categorical features. If the categorical features are encoded as "enum", then the tree-based algorithms like Random Forest and GBM will be able to handle these features in a smart way. Most other implementations of RFs and GBM force you to do a one-hot expansion of the categorical features (into K dummy columns), but in H2O, the tree-based methods can use these features without any expansion. The exact whay that the variables are handled can be controlled using the categorical_encoding argument.
If you have an ordered categorical variable, then it might be okay to encode that as "int", however, the effect of doing that on model performance will depend on the data.
If you were to convert an "enum" column to "numeric" that would simply encode each category as an integer and you'd lose the notion that those numbers represent categories (so it's not recommended).
You should not use the "string" type in H2O unless you are going to exclude that column from the set of predictors. It would make sense to use a "string" column for text, but you'll probably want to parse (e.g. tokenize) that text to generate new numeric or enum features that will be included in the set of predictors.

Does use dummy value make model's performance better?

I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation.
Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column?
Does the dummy method has positive impact on machine learning model both in classification and regression model ?
If it is , and why?
Thanks.
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.
To one-hot-encode a feature with N possible values you only need N-1 columns with 0 / 1 values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N features instead of N-1 shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.

Resources