Is sklearn.naive_bayes.CategoricalNB the same as sklearn.naive_bayes.BernoulliNB, but with one hot encoding in the columns?
Couldn't quite guess from documentation, and CategoricalNB has that one extra parameter alpha whose purpose I don't understand.
The categorical distribution is the Bernoulli distribution, generalized to more than two categories. Stated another way, the Bernoulli distribution is a special case of the categorical distribution, with exactly 2 categories.
In the Bernoulli model, each feature is assumed to have exactly 2 categories, often denoted as 1 and 0 or True and False. In the categorical model, each feature is assumed to have at least 2 categories, and each feature may have a different total number of categories.
One-hot encoding is unrelated to either model. It is a technique for encoding a categorical variable in a numerical matrix. It has no bearing on the actual distribution used to model that categorical variable, although it is natural to model categorical variables using the categorical distribution.
The "alpha" parameter is called the Laplace smoothing parameter. I will not go into detail about it here, because that is better suited for CrossValidated, e.g. https://stats.stackexchange.com/q/192233/36229. From a computational perspective, it exists in order to prevent "poisoning" the calculations with 0s, which propagate multiplicatively throughout the model. This is a practical concern that arises whenever some combination of class label and feature category is not present in your data set. It's fine to leave it at the default value of 1.
Related
I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.
My model is based on Decision-Tree algorithm hence I want to avoid One hot encoder as it will increase training time. I know a technique where i know instead of taking one hot encoding, I can go with an option of probability of category with respect to my classified output.But i don't know how to apply that probability part.?
Decision Trees can handle both numerical and categorical variables, therefore there is no need to encode your categorical variables.
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.
You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as #Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).
A random enumeration(assigning some integer to each category) of the countries will work quite well sometimes. Especially if categories are few and training set size is large. Sometimes better than one-hot encoding.
Some threads discussing the two options with sklearn:
https://github.com/scikit-learn/scikit-learn/issues/5442
How to use dummy variable to represent categorical data in python scikit-learn random forest
You can also choose to use an RF algorithm that truly supports categorical data such as Arborist(python and R front end), extraTrees(R, Java, RF'isch) or randomForest(R). Why sklearn chose not to support categorical splits, I don't know. Perhaps convenience of implementation.
The number of possible categorical splits to try blows up after 10 categories and the search becomes slow and the splits may become greedy. Arborist and extraTrees will only try a limited selection of splits in each node.
I am a newbie in Machine learning and Natural language processing.
I am always confused between what are those three terms?
From my understanding:
class: The various categories our model output. Given a name of person identify whether he/she is male or female?
Lets say I am using Naive Bayes classifier.
What would be my features and parameters?
Also, what are some of the aliases of the above words which are used interchangeably.
Thank you
Let's use the example of classifying the gender of a person. Your understanding about class is correct! Given an input observation, our Naive Bayes Classifier should output a category. The class is that category.
Features: Features in a Naive Bayes Classifier, or any general ML Classification Algorithm, are the data points we choose to define our input. For the example of a person, we can't possibly input all data points about a person; instead, we pick a few features to define a person (say "Height", "Weight", and "Foot Size"). Specifically, in a Naive Bayes Classifier, the key assumption we make is that these features are independent (they don't affect each other): a person's height doesn't affect weight doesn't affect foot size. This assumption may or not be true, but for a Naive Bayes, we assume that it is true. In the particular case of your example where the input is just the name, features might be frequency of letters, number of vowels, length of name, or suffix/prefixes.
Parameters: Parameters in Naive Bayes are the estimates of the true distribution of whatever we're trying to classify. For example, we could say that roughly 50% of people are male, and the distribution of male height is a Gaussian distribution with mean 5' 7" and standard deviation 3". The parameters would be the 50% estimate, the 5' 7" mean estimate, and the 3" standard deviation estimate.
Aliases: Features are also referred to as attributes. I'm not aware of any common replacements for 'parameters'.
I hope that was helpful!
#txizzle explained the case of Naive Bayes well. In a more general sense:
Class: The output category of your data. You can call these categories as well. The labels on your data will point to one of the classes (if it's a classification problem, of course.)
Features: The characteristics that define your problem. These are also called attributes.
Parameters: The variables your algorithm is trying to tune to build an accurate model.
As an example, let us say you are trying to decide to whether admit a student to gard school or not based on various factors like his/her undergrad GPA, test scores, scores on recommendations, projects etc. In this case, the factors mentioned above are your features/attributes, whether the student is given an admit or not become your 2 classes, and the numbers which decide how these features combine together to get your output become your parameters. What the parameters actually represent depends on your algorithm. For a Neural Net, it's the weights on the synaptic links. Similarly, for a regression problem, the parameters are the coefficients of your features when they are combined.
take a simple linear classification problem-
y={0 if 5x-3>=0 else 1}
here y is class, x is feature, 5,3 are parameters.
I just wanted to add a definition that distinguishes between attributes and features, as these are often used interchangeably, and it may not be correct to do so. I'm quoting 'Hands-On Machine Learning with SciKit-Learn and TensorFlow'.
In Machine Learning an attribute is a data type (e.g., “Mileage”),
while a feature has several meanings depending on the context, but
generally means an attribute plus its value (e.g., “Mileage =
15,000”). Many people use the words attribute and feature interchangeably,
though.
I like the definition in “Hands-on Machine Learning with Scikit and Tensorflow” (by Aurelian Geron) where
ATTRIBUTE = DATA TYPE (e.g., Mileage)
FEATURE = DATA TYPE + VALUE (e.g., Mileage = 50000)
Regarding FEATURE versus PARAMETER, based on the definition in Geron’s book I used to interpret FEATURE as the variable and the PARAMETER as the weight or coefficient, such as in the model below
Y = a + b*X
X is the FEATURE
a, b are the PARAMETERS
However, in some publications I have seen the following interpretation:
X is the PARAMETER
a, b are the WEIGHTS
So, lately, I’ve begun to use the following definitions:
FEATURE = variables of the RAW DATA (e.g., all columns in the spreadsheet)
PARAMETER = variables used in the MODEL (ie after selecting the features that will be in the model)
WEIGHT = coefficients of the parameters of the MODEL
Thoughts ?
Let's see if this works :)
Imagine you have an excel spreadsheet which has data about a specific product and the presence of 7 atomic elements in them.
[product] [calcium] [magnesium] [zinc] [iron] [potassium] [nitrogen] [carbon]
Features - are each column except the product because all the other columns are independent, coexisting, has measurable impact on the target i.e. the product. You can even choose to combine some of them to be called Essential Elements i.e. dimension reduction to make it more appropriate for analysis. The term "Dimension Reduction" is strictly for explanation here, not be confused by the PCA technique in unsupervised learning. Features are relevant for supervised learning technique.
Now, imagine a cool machine that has the capability of looking at the data above and inferring what the product is.
parameters are like levers and stopcocks to the specific to that machine which you can juggle with, and make sure that if the machine says "It's soap scum" it really/truly is. If you you think about yourself doing the dart board practice, what are the things you'd do to yourself to get closer to the bullseye (balance bias/variance)?
Hyperparameters are like parameters, BUT external to this machine we're talking about. What if the machine parts/mechanical elements are made of a specific compound e.g. carbon fibre or magnesium poly-alloy? How would this change what the machine can/can't do better?
I suppose it's an oversimplification of what things are, but hopefully acceptable?