Categorical variables with too many levels in machine learning

Categorical variables with too many levels in machine learning - machine-learning

I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I'm not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.
Is there a way to deal with this problem?
Thanks!

you may try to create dummy variables on the categorical variables. Before that, try to combine some of the categorical variables.

Related

Maximum categories for categorical variables in K-Means clustering

I am trying to perform K-means clustering on a dataset using Sci-Kit Learn. One of my categorical features has 96 possible options. Would this be too many features for one variable to have?
The alternative would be to either attempt to convert it to a numerical variable through weight of evidence, or simply drop it. What do you guys think?

Random Forest using one variable

I am trying to determine the optimal group of variables for a classification task. Sometimes instead of a group of variables, only a single variable should be selected (but the data was pretty weak looking at each variable alone).
I used several classifiers (Random Forest, Logistic regression, SVM) and I have a small problem in understanding the results (the best results were achieved by using RF).
Can someone with a deeper conceptual understanding of random forest than me please explain what a random forest using one variable is doing? Since it is only one variable, it is hard for me to see how the random forest can achieve a better sens/spec than that single variable can ever achieve alone (which it does). Is (in this case) the RF a decision tree? I was thinking that it might be the case, and after testing I observed that all the scores (accuracy, F1, precision, recall) were the same for the two of them.
Thanks for the help.

Normalizing the dependent variable in a multiple linear regression model

What happens when I normalize the dependent variable but not the independent variables in a linear regression ? How will I interpret the model as opposed to normalizing both dependent and independent variables.
Thank You !!

What happens when I normalize the dependent variable but not the independent variables in a linear regression ?
Nothing.
How will I interpret the model as opposed to normalizing both dependent and independent variables.
If you normalize independent variables you will be able to compare/interpret weights of them after fitting.

Categorical Variable in EditMetadata module of Azure ML

Could anyone please let me know what is the purpose of making some variable as Categorical in EditMetadata module of Machine Learning? Would appreciate if explained with some example. Also is it applicable on both features as well as label?

There are cases where variables are considered as string instead of categorical. This impacts the accuracy of the model based on how variable is treated. So for these cases, user may want to have system treat these variables categorical instead of string and edit meta-data module helps in enforcing this behavior
https://msdn.microsoft.com/library/azure/370b6676-c11c-486f-bf73-35349f842a66?f=255&MSPPError=-2147217396

Should we use both the nominal variables and their derived dummy variables in ML models for prediction

I saw someone creating dummy variables from nominal variables for machine learning models for classification problems. And then use both the original nominal variables and the newly created dummy variables in decision tree, SVM, NN models.
I don't see the point of it. I feel the use of nominal variables with their derived dummy variables is redundant.
Am I correct or is it necessary to use both the original nominal variable and their dummy indicators?

Depends on what kind of model you're training. Simple models (such as linear ones) can be too "dumb" to "see" how the derived features relate to the original ones.
In the linear regression case, introducing a new feature that is the square of another is enough to "trick" the model; it can only "see" linear relationships, so the quadratic one looks independent.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Categorical variables with too many levels in machine learning - machine-learning

you may try to create dummy variables on the categorical variables. Before that, try to combine some of the categorical variables.

Related

Maximum categories for categorical variables in K-Means clustering

Random Forest using one variable

Normalizing the dependent variable in a multiple linear regression model

Categorical Variable in EditMetadata module of Azure ML

Should we use both the nominal variables and their derived dummy variables in ML models for prediction

Categories

Resources