What kind of preprocessing was used to encode categorical variables in CatBoost benchmarks? - machine-learning

I've recently started to use CatBoost for rapid prototyping of machine learning models, inspired by the outstanding performance benchmarks of CatBoost compared to XGBoost, LightGBM and h2o.
Since XGBoost can only accept numeric features, a comparison between CatBoost and XGBoost needs a common preprocessing of categorical features. It is not entirely clear to me what kind of preprocessing was used to encode categorical features in the benchmarking experiments, and the rationale for not using a simple one-hot encoding.
I've tried to read the documentation of the experiments. As far as I understand it, the procedure to encode categorical feature j is about equivalent to the following:
On the train set, group the response y by j, aggregating with the mean function . Let's call the result df_agg_j
Left join the train set and df_agg_j on the categorical column j, drop the original categorical column j and use the new numeric column instead
Left join the valid set and df_agg_j on the categorical column j, drop the original categorical column j and use the new numeric column instead
Still I don't understand the need for "a random permutation of the objects for j-th categorical feature and i-th object", and for adding 1 at the numerator and 2 to the denominator in the final formula under the section "Preparation of Splits" of the documentation.
The code for splitting and preprocessing the data can be found here.
Is there an explanation (or some reference in the literature) about the method used to encode categorical features in this experiment, and a comparison between this method and one-hot encoding?

For categorical features target based statistics were used. This is currently the best way to preprocess categorical features for GBDT, works better than one-hot. This is similar to target encoding but uses a permutation to not have overfitting.
Details and comparisons about this approach can be found in NIPS 2018 paper "CatBoost: unbiased boosting with categorical features" (https://arxiv.org/abs/1706.09516).

Related

One-hot encoding in random forest classifier

Is one-hot encoding necessary for random forest classifier in python? I want to understand logically if random forest can handle categorical features with label encoding rather that one-hot-encoding.
The concept of encoding is necessary in machine learning because with the help of it, we can convert non-numeric features into numeric ones which is understandable by any model.
Any type of encoding can be done on any non-numeric features, it solely depends on intution.
Now, coming to your question when to use label-encoding and when to use One-hot encoding:
Use Label-encoding - Use this when, you want to preserve the ordinal nature of your feature. For example, you have a feature of education level, which has string values like "Bachelor","Master","Ph.D". In this case, you want to preserve the ordinal nature that, Ph.D > Master > Bachelor hence you'll map using label-encoding like - Bachelor-1, Master-2, Ph.D-3.
Use One-hot encoding - Use this when, you want to treat your categorical variable with equal order. For example, you have colors variable which has values "red","yellow", "orange". Now, in this case any value has no precedence over other values, hence you'll use One hot encoding here.
NOTE: In One-hot encoding your number of features will increase, which is not good for any tree based algorithm like Decision-trees, Random Forest etc. That's why Label encoding is mostly preferred in this case, but still if you use one hot encoding, you can check the importance of categorical features by using feature_importances_ hyperparameter in sklearn. If the feature is having low importance you can drop it off.
Random forest is based on the principle of Decision Trees which are sensitive to one-hot encoding. Now here sensitive means like if we induce one-hot to a decision tree splitting can result in sparse decision tree. The trees generally tend to grow in one direction because at every split of a categorical variable there are only two values (0 or 1). The tree grows in the direction of zeroes in the dummy variables.
Now you must be wondering how will you tackle the categorical values without one-hot encoding? For that you can refer to this Hashing Trick further you can also look into h2o Random Forest.

Predicting over data that has categorical, numerical and text

I am trying to build a classifier for my dataset. Each observation in the data has categorical and numerical values, as well as a more general description in free-text. I understand how to build a boosting algorithm to handle the categorical and numerical values, and I have already trained a neural network that predicted over the text quite succesfully. What I'm wrapping my head around is how to integrate both approaches?
Embed your free text using a Language Model (e.g. averaging fasttext wordembeddings, or using google-universal-sentence-encoder) into an N-dim vector of floats. One hot encode the categorical stuff. Concatenate [embedding, one_hot_encoding, numericals] and badabing badaboom, you've got yourself 1 vector representing your datapoint.
Tensorflow hub's KerasLayer + https://tfhub.dev/google/universal-sentence-encoder/4 is def a good starting point. I you need to train something yourself, you could look into tf.keras.layers.Embedding.

What does the ranker in Weka PCA tell us about feature selection?

I have a data set that is 31000 rows with 13 attributes. But because most are categorical I had to use NominalToBinary for those attributes so the attributes grew to 61.
I have sampled the data to 18000 rows and applied the PCA with ranker in Weka. centerData is false so it should normalise it for me.
This is my result:
0.945 1 -0.367Marial_Status= Married-civ-spouse-0.365Relationship= Husband+0.298Marial_Status= Never-married+0.244Age=0_23+0.232Gender= Female...
I understand that the ranking is the variance. So rank 1 is 94.5%? Now the issue I have with feature selecting is how do i know which ones to keep? Most of these attributes are categorical and changed to numeric for the PCA. So with the original data-set with both categorical and numeric, with respects to this output what is it saying about feature selecting?
PCA assumes numerical data. If you binary encode you categorical variables you basically take a hammer and make you data fit your models assumption.
Another way to deal with categorical features are non-linear feature transformations which will find a way to represent distances between categories in a suitable way. A quick google search provided Categorical Principal Components Analysis (CTPCA) for me. Maybe have a look at this.

How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real.
I have created two sets of features:
A) Bigram Term Frequency-Inverse Document Frequency
B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...
Which is the best way to combine the TFIDF features with the other features for a single prediction?
Thanks a lot to everyone.
Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.
Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).
That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.
What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones

Random forest in sklearn

I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.
You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as #Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).
A random enumeration(assigning some integer to each category) of the countries will work quite well sometimes. Especially if categories are few and training set size is large. Sometimes better than one-hot encoding.
Some threads discussing the two options with sklearn:
https://github.com/scikit-learn/scikit-learn/issues/5442
How to use dummy variable to represent categorical data in python scikit-learn random forest
You can also choose to use an RF algorithm that truly supports categorical data such as Arborist(python and R front end), extraTrees(R, Java, RF'isch) or randomForest(R). Why sklearn chose not to support categorical splits, I don't know. Perhaps convenience of implementation.
The number of possible categorical splits to try blows up after 10 categories and the search becomes slow and the splits may become greedy. Arborist and extraTrees will only try a limited selection of splits in each node.

Resources