Random forest in sklearn - machine-learning

I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.

You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as #Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).

A random enumeration(assigning some integer to each category) of the countries will work quite well sometimes. Especially if categories are few and training set size is large. Sometimes better than one-hot encoding.
Some threads discussing the two options with sklearn:
https://github.com/scikit-learn/scikit-learn/issues/5442
How to use dummy variable to represent categorical data in python scikit-learn random forest
You can also choose to use an RF algorithm that truly supports categorical data such as Arborist(python and R front end), extraTrees(R, Java, RF'isch) or randomForest(R). Why sklearn chose not to support categorical splits, I don't know. Perhaps convenience of implementation.
The number of possible categorical splits to try blows up after 10 categories and the search becomes slow and the splits may become greedy. Arborist and extraTrees will only try a limited selection of splits in each node.

Related

How to deal with multiple categorical variables each with different cardinality?

I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.

How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real.
I have created two sets of features:
A) Bigram Term Frequency-Inverse Document Frequency
B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...
Which is the best way to combine the TFIDF features with the other features for a single prediction?
Thanks a lot to everyone.
Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.
Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).
That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.
What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones

Standardization before or after categorical encoding?

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator

Categorical variables with large amounts of categories in XGBoost/CatBoost

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

How to include words as numerical feature in classification

Whats the best method to use the words itself as the features in any machine learning algorithm ?
The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ?
In general, How are words itself used as features in NLP ?
There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:
a Boolean field which encodes the presence or absence of that word in a given document;
a frequency histogram of a
predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the
last paragraph of this Answer);
the juxtaposition of two or more
words (e.g., 'alternative' and
'lifestyle' in consecutive order have
a meaning not related either
component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;
words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.
A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.
Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur.
This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK.
"Index in the dictionary" is a useless feature, I wouldn't use it.
tf-idf is a pretty standard way of turning words into numeric features.
You need to remember to use a learning algorithm that supports numeric featuers, like SVM. Naive Bayes doesn't support numeric features.

Resources