Does one-hot encoding cause issues of unbalanced feature? - machine-learning

We know that in data mining, we often need one-hot encoding to encode categorical features, thus, one categorical feature will be encoded to a few "0/1" features.
There is a special case that confused me:
Now I have one categorical feature and one numerical feature in my dataset.I encode the categorical feature to 300 new "0/1" features, and then Normalized the numerical feature using MinMaxScaler, so all my features value is in the range of 0 to 1.But the suspicious phenomenon is that The ratio of categorical feature and numerical feature is seems to changed from 1:1 to 300:1.
Is my method of encoding correct?This made me doubt about one-hot encoding,I think this may lead to the issue of unbalanced features.
Can anybody tell me the truth? Any word will be appreciated! Thanks!!!

As each record only has one category, only one of them will be 1.
Effectively, with such preprocessing, the weight on the categoricial features will only be about 2 times the weight of a standardized feature. (2 times, if you consider distances and objects of two different categories).
But in essence you are right: one-hot encoding is not particularly smart. It's an ugly hack to make programs run on data they do not support. Things get worse when algorithms such as k-means are used, that assume we can take the mean and need to minimize squared errors on these variables... The statistical value of the results will be limited.

Related

One-hot encoding in random forest classifier

Is one-hot encoding necessary for random forest classifier in python? I want to understand logically if random forest can handle categorical features with label encoding rather that one-hot-encoding.
The concept of encoding is necessary in machine learning because with the help of it, we can convert non-numeric features into numeric ones which is understandable by any model.
Any type of encoding can be done on any non-numeric features, it solely depends on intution.
Now, coming to your question when to use label-encoding and when to use One-hot encoding:
Use Label-encoding - Use this when, you want to preserve the ordinal nature of your feature. For example, you have a feature of education level, which has string values like "Bachelor","Master","Ph.D". In this case, you want to preserve the ordinal nature that, Ph.D > Master > Bachelor hence you'll map using label-encoding like - Bachelor-1, Master-2, Ph.D-3.
Use One-hot encoding - Use this when, you want to treat your categorical variable with equal order. For example, you have colors variable which has values "red","yellow", "orange". Now, in this case any value has no precedence over other values, hence you'll use One hot encoding here.
NOTE: In One-hot encoding your number of features will increase, which is not good for any tree based algorithm like Decision-trees, Random Forest etc. That's why Label encoding is mostly preferred in this case, but still if you use one hot encoding, you can check the importance of categorical features by using feature_importances_ hyperparameter in sklearn. If the feature is having low importance you can drop it off.
Random forest is based on the principle of Decision Trees which are sensitive to one-hot encoding. Now here sensitive means like if we induce one-hot to a decision tree splitting can result in sparse decision tree. The trees generally tend to grow in one direction because at every split of a categorical variable there are only two values (0 or 1). The tree grows in the direction of zeroes in the dummy variables.
Now you must be wondering how will you tackle the categorical values without one-hot encoding? For that you can refer to this Hashing Trick further you can also look into h2o Random Forest.

one hot encoding of output labels

While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/

best practices for using Categorical Variables in H2O?

I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.
High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.

Standardization before or after categorical encoding?

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator

Categorical variables with large amounts of categories in XGBoost/CatBoost

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

Resources