I am working on basic Machine Learning Linear Regression Model creation.
I have Categorical Features which are having kind of skewed counts like
AllPub 1459
NoSeWa 1
Name: Utilities, dtype: int64
As one can see that AllPub is the one which is contributed more. So is it useful in model creation? Shall i use it or not??
As you can see most of the values are of AllPub, only one value is of NoSeWa. It will not make much difference if you keep or remove.
Another way of thinking might be a outlier. Since there is count of only one, it might have entered incorrectly. You can impute that value with mode.
While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/
I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator
I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.
However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?
Thanks.
Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.
My dataset is composed of millions of row and a couple (10's) of features.
One feature is a label composed of 1000 differents values (imagine each row is a user and this feature is the user's firstname :
Firstname,Feature1,Feature2,....
Quentin,1,2
Marc,0,2
Gaby,1,0
Quentin,1,0
What would be the best representation for this feature (to perform clustering) :
I could convert the data as integer using a LabelEncoder, but it doesn't make sense here since there is no logical "order" between two differents label
Firstname,F1,F2,....
0,1,2
1,0,2
2,1,0
0,1,0
I could split the feature in 1000 features (one for each label) with 1 when the label match and 0 otherwise. However this would result in a very big matrix (too big if I can't use sparse matrix in my classifier)
Quentin,Marc,Gaby,F1,F2,....
1,0,0,1,2
0,1,0,0,2
0,0,1,1,0
1,0,0,1,0
I could represent the LabelEncoder value as a binary in N columns, this would reduce the dimension of the final matrix compared to the previous idea, but i'm not sure of the result :
LabelEncoder(Quentin) = 0 = 0,0
LabelEncoder(Marc) = 1 = 0,1
LabelEncoder(Gaby) = 2 = 1,0
A,B,F1,F2,....
0,0,1,2
0,1,0,2
1,0,1,0
0,0,1,0
... Any other idea ?
What do you think about solution 3 ?
Edit for some extra explanations
I should have mentioned in my first post, but In the real dataset, the feature is the more like the final leaf of a classification tree (Aa1, Aa2 etc. in the example - it's not a binary tree).
A B C
Aa Ab Ba Bb Ca Cb
Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2
So there is a similarity between 2 terms under the same level (Aa1 Aa2 and Aa3are quite similar, and Aa1 is as much different from Ba1 than Cb2).
The final goal is to find similar entities from a smaller dataset : We train a OneClassSVM on the smaller dataset and then get a distance of each term of the entiere dataset
This problem is largely one of one-hot encoding. How do we represent multiple categorical values in a way that we can use clustering algorithms and not screw up the distance calculation that your algorithm needs to do (you could be using some sort of probabilistic finite mixture model, but I digress)? Like user3914041's answer, there really is no definite answer, but I'll go through each solution you presented and give my impression:
Solution 1
If you're converting the categorical column to an numerical one like you mentioned, then you face that pretty big issue you mentioned: you basically lose meaning of that column. What does it really even mean if Quentin in 0, Marc 1, and Gaby 2? At that point, why even include that column in the clustering? Like user3914041's answer, this is the easiest way to change your categorical values into numerical ones, but they just aren't useful, and could perhaps be detrimental to the results of the clustering.
Solution 2
In my opinion, depending upon how you implement all of this and your goals with the clustering, this would be your best bet. Since I'm assuming you plan to use sklearn and something like k-Means, you should be able to use sparse matrices fine. However, like imaluengo suggests, you should consider using a different distance metric. What you can consider doing is scaling all of your numeric features to the same range as the categorical features, and then use something like cosine distance. Or a mix of distance metrics, like I mention below. But all in all this will likely be the most useful representation of your categorical data for your clustering algorithm.
Solution 3
I agree with user3914041 in that this is not useful, and introduces some of the same problems as mentioned with #1 -- you lose meaning when two (probably) totally different names share a column value.
Solution 4
An additional solution is to follow the advice of the answer here. You can consider rolling your own version of a k-means-like algorithm that takes a mix of distance metrics (hamming distance for the one-hot encoded categorical data, and euclidean for the rest). There seems to be some work in developing k-means like algorithms for mixed categorical and numerical data, like here.
I guess it's also important to consider whether or not you need to cluster on this categorical data. What are you hoping to see?
Solution 3:
I'd say it has the same kind of drawback as using a 1..N encoding (solution 1), in a less obvious fashion. You'll have names that both give a 1 in some column, for no other reason than the order of the encoding...
So I'd recommend against this.
Solution 1:
The 1..N solution is the "easy way" to solve the format issue, as you noted it's probably not the best.
Solution 2:
This looks like it's the best way to do it but it is a bit cumbersome and from my experience the classifier does not always performs very well with a high number of categories.
Solution 4+:
I think the encoding depends on what you want: if you think that names that are similar (like John and Johnny) should be close, you could use characters-grams to represent them. I doubt this is the case in your application though.
Another approach is to encode the name with its frequency in the (training) dataset. In this way what you're saying is: "Mainstream people should be close, whether they're Sophia or Jackson does not matter".
Hope the suggestions help, there's no definite answer to this so I'm looking forward to see what other people do.