Feature Importances for categorical features after one hot encoding? - machine-learning

We need to encode a categorical feature with multiple categories, the resulting one-hot encoded feature will have importance scores separately for each category. If we want to combine these importances into a single feature importance for the original categorical feature, can we simply add the importances for each one hot encoded feature.
For example, let's say we one hot encode a categorical feature called Department.
Resulting values
Department_0 : 0.03
Department_1 : 0.08
Department_2: 0.12
To combine these into a single feature importance for ‘Department’, can we sum up the values ?
Department : 0.03 + 0.08 + 0.12= 0.23.
Is there any drawback in this approach ? If yes, then what is the best approach ?

Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green“, and “blue“.
A “place” variable with the values: “first“, “second“, and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as a natural ordering.
The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.
A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.
Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.
Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.
more info: Link
You must learn categorical feature: Nominal and Ordinal Variables.
A “place” variable with the values: “first“, “second“, and “third“ -- you can have one feature.
A “color” variable with the values: “red“, “green“, and “blue“ -- you can not have one feature.

When calling the attribute feature_importances_ on an sklearn estimator, you notice that the sum of all makes 100%.
Therefore I would say it makes sense to sum the importance when you regroup features together:
feature_importance(featureA & featureB) = feature_importance(featureA) + feature_importance(featureB)

Related

Normalize data with outlier inside interval

I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)

replace missing values in categorical data

Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells
red
green
red
blue
NaN
I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be
col1 | col2 | col3
1 0 0
0 1 0
1 0 0
0 0 1
0.5 0.25 0.25
Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice?
0.25 0.125 0.125
The simplest strategy for handling missing data is to remove records that contain a missing value.
The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
The Imputer class operates directly on the NumPy array instead of the DataFrame.
Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.
It depends on what you want to do with the data.
Is the average of these colors useful for your purpose?
You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.
In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).
Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.
In addition to Lan's answer's approach, which seems most commonly used, you can use something based on matrix factorization. For example there is a variant of Generalized Low Rank Models that can impute such data, just as probabilistic matrix factorization is used to impute continuous data.
GLRMs can be used from H2O which provides bindings for both Python and R.

Categorical and ordinal feature data difference in regression analysis?

I am trying to completely understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:
Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect
Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct
Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Example for categorical:
data = {'color': ['blue', 'green', 'green', 'red']}
Numeric format after One-Hot encoding:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Example for ordinal:
data = {'con': ['old', 'new', 'new', 'renovated']}
Numeric format after using mapping: Old < renovated < new → 0, 1, 2
0 0
1 2
2 2
3 1
In my data price increases as condition changes from "old" to "new". "Old" in numeric was encoded as '0'. 'New' in numeric was encoded as '2'. So, as condition increases, then price also increases. Correct.
Now lets have a look at 'color' feature. In my case, different colors also affect price. For example, 'black' will be more expensive than 'white'. But from above mentioned numeric representation of categorical data, I do not see increasing dependancy as it was with 'condition' feature. Does it mean that change in color does not affect price in regression model if using one-hot encoding? Why to use one-hot encoding for regression if it does not affect price anyway? Can you clarify it?
UPDATE TO QUESTION:
First I introduce formula for linear regression:
Let have a look at data representations for color:
Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding:
In this case different thetas for different colors will exist and prediction will be:
Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$ (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$ (thetas are assumed for example)
Ordinal encoding for color:
In this case all colors have common theta but multipliers differ:
Price (1 item) = 0 + 20*10 = 200$ (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$ (theta assumed for example)
In my model White < Red < Black in prices. Seem to be that it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?
You will see not increasing dependency. The whole point of this discrimination is that colour is not a feature you can meaningfully place on a continuum, as you've already noted.
The one-hot encoding makes it very convenient for the software to analyze this dimension. Instead of having a feature "colour" with the listed values, you have a set of boolean (present / not-present) features. For instance, your row 0 above has features color_blue = true, color_green = false, and color_red = false.
The prediction data you get should show each of these as a separate dimension. For instance, presence of color_blue may be worth $200, while green is -$100.
Summary: don't look for a linear regression line running across a (non-existent) color axis; rather, look for color_* factors, one for each color. As far as your analysis algorithm is concerned, these are utterly independent features; the "one-hot" encoding (a term from digital circuit design) is merely our convention for dealing with this.
Does this help your understanding?
After your edit of the question 02:03 Z 04 Dec 2015:
No, your assumption is not correct: the two representations are not merely a matter of convenience. The ordering of colors works for this example -- because the effect happens to be a neat, linear function of the chosen encoding. As your example shows, your simpler encoding assumes that White-to-Red-to-Black pricing is a linear progression. What do you do when Green, Blue, and Brown are all $25, the rare Yellow is worth $500, and Transparent reduces the price by $1,000?
Also, how is it that you know in advance that Black is worth more than White, in turn worth more than Red?
Consider the case of housing prices based on elementary school district, with 50 districts in the area. If you use a numerical coding -- school district number, ordinal position alphabetically, or some other arbitrary ordering -- the regression software will have great trouble finding a correlation between that number and the housing price. Is PS 107 a more expensive district than PS 32 or PS 15? Are Addington and Bendemeer preferred to Union City and Ventura?
Splitting these into 50 different features under that one-hot principle decouples the feature from the encoding, and allows the analysis software to treat with them in a mathematically meaningful manner. It's not perfect by any means -- expanding from, say, 20 features to 70 means that it will take longer to converge -- but we do get meaningful results for the school district.
If you wish, you could now encode that feature in the expected order of value, and get a reasonable fit with little loss of accuracy and faster prediction from your model (fewer variables).
You cannot use ordinal encoding for a categorical variable where order doesn't matter. Main purpose of building a regression model is to see how much change in one variable has how much effect on the response variable. When you obtain the regression formula this is how you read it: "1 unit change in variable X causes theta_x change in response variable".
For example, let's say you built a regression model on housing prices and you got this: price = 1000 + (-50)*age_of_house. This means 1 year increase in the age of the house causes the price go down by 50.
When you have a categorical variable you cannot mention a unit change in that variable. You cannot say 1 unit increase/decrease in the color... etc. So, one-hot encoding, as Prune said in his/her answer, is merely a convention for dealing with categorical variables. It allows you to interpret the results like, if the house is white it adds $200 to the value when coefficient of color_white in your final model is +200. If the house is not white, that variable has no impact on your response variable because the value will be 0.
Don't forget that "Linear Regression" models can only explain linear relations between variables.
I hope this helps.

Having trouble creating my Neural Network inputs

I'm currently working on a neural network that should have N parameters in input. Each parameters can have M different values (discrete values), let's say {A,B,C,…,M}. It also has a discrete number of outputs.
How can I create my inputs from this situation? Should I have N×M inputs (having 0 or 1 as value), or should I think of a different approach?
You can either have NxM boolean inputs or have N inputs where each one is a float that goes from 0 to 1. In the latter case the float values would be: {A/M, B/M, C/M, ... 1}. For example if you have 4 inputs each one with discrete values: {1,2,3,4} then you can change the domain values to {0.25 , 0.50 , 0.75 , 1.00}.
Actually there are a lot of ways to encode your inputs, but I have found better results when my inputs lie in the domain [0,1] (as there are some ML functions that expect that).

Numerically representing Nominal Data whilst retaining data semantics

I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

Resources