What is the need of converting integer encoding to binary encoding? I have checked different websites, textbooks but couldn't get what exactly it does and what is the need. We have categorical data and we can convert it to integer so far so good. However, what is the need for binary encoding?
I have check the answer at
Why does one hot encoding improve machine learning performance?
However, it's still not clear. It says we can get their own weights but wasn't that possible even if we had integer value?
If you encode i.e. categorical variables A, B, C to integers 1, 2, 3 many classifiers will «assume» that A (=1) is less than B (=2) or C (=3). This simply is a wrong assumption about the relationship of your categoricals.
Therefore you have to one hot encode.
Related
I am a beginner who learns machine learning.
I try to make some model(FNN) and this model has too many output labels to use a one-hot encoding.
Could you help me?
I want to solve this problem :
labeling data is for fruits:
Type (Apple, Grapes, Peach), Quality(Good, Normal, Bad), Price(Expensive, Normal, Cheap), Size(Big, Normal, Small)
So, If I make one-hot encoding, the data size up to 3*3*3*3, 81
I think that the labeling data looks like 4 one-hot-encoding sequence data.
Is there any way to make labeling data in small-dimension, not 81 dimension one hot encoding?
I think binary encoding also can be used, but recognized some shortcoming to use binary encoding in NN.
Thanks :D
If you one hot encode your 4 variables you will have 3+3+3+3=12 variables, not 81.
The concept is that you need to create a binary variable for every category in a categorical feature, not one for every possible combination of categories in the four features.
Nevertheless, other possible approaches are Numerical Encoding, Binary Encoding (as you mentioned), or Frequency Encoding (change every category with its frequency in the dataset). The results often depend on the problem, so try different approaches and see what best fits yours!
But even if you use One-Hot-Encoding, as #DavideDn pointed out, you will have 12 features, not 81, which isn't a concerning number.
However, let's say the number was indeed 81, you could still use dimensionality reduction techniques (like Principal Component Analysis) to solve the problem.
While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/
Specifically, what is the difference in how H2O treats enum and string data types in contrast to 'int's and 'numerical' types?
For example, say I have a binary classifier that takes input samples that have features
x1=(1 of 10 possible favorite ice cream flavors (enum))
x2=(some random phrase (string))
x3=(some number (int))
What would be the difference in how the classifier treats these types during training?
When uploading data into h2o Flow UI, I get the option to convert certain data types (like enum) to 'numerical.' This makes me think that there is more than just string-to-number mapping going on when I just leave the 'enum' as an 'enum' (not converting to 'numerical' type), but I can't find information on what that difference is.
Clarification would be appreciated, thanks.
The "enum" type is the type of encoding you'll want to use for categorical features. If the categorical features are encoded as "enum", then the tree-based algorithms like Random Forest and GBM will be able to handle these features in a smart way. Most other implementations of RFs and GBM force you to do a one-hot expansion of the categorical features (into K dummy columns), but in H2O, the tree-based methods can use these features without any expansion. The exact whay that the variables are handled can be controlled using the categorical_encoding argument.
If you have an ordered categorical variable, then it might be okay to encode that as "int", however, the effect of doing that on model performance will depend on the data.
If you were to convert an "enum" column to "numeric" that would simply encode each category as an integer and you'd lose the notion that those numbers represent categories (so it's not recommended).
You should not use the "string" type in H2O unless you are going to exclude that column from the set of predictors. It would make sense to use a "string" column for text, but you'll probably want to parse (e.g. tokenize) that text to generate new numeric or enum features that will be included in the set of predictors.
Met a tricky issue when trying to vectorize my feature. I have a feature like this:
most of it is numeric, like 0, 1, 33.3, 100, etc.
some of is empty, which represents not provided.
some of it is "auto", which means it adapts the context.
Now my question is, how to encode this feature into vectors effectively? One thing I can do is just to treat all numerical value as categorical too, but that will result in an explosion in the feature space, also not good for representing similar data points. What should I do?
Thanks!
--- THE ALGORITHM/MODEL I'M USING ---
It's LSTM (Long Short Term Memory) neural network. Currently I'm going with the following approach say I have 2 data points:
col1
entry1: 1.0
entry2: auto
It'll be encoded into:
col1-a col1-b
entry1: 1.0 0
entry2: dummy 1
So col1-b will represent whether it's auto or not. The dummy number will be the median of all the numeric data. Will this work?
Also, I for each numeric value they have a unit associated, so there's another column which has value like 'px', 'pt', in this case, does the numeric value still has meaning if I extracted the unit into another column? They has actual meaning when associated (numeric + unit), but can the NN notice that if they are on different dimensions?
That depends on what type of algorithm you you will be using. If you want to use something like association rule classification then you will have to treat all of your variables as categorical data. If you want to use logistic regression, then that isn't needed. You'd have to provide more details to get a better answer.
edit
I made some edits after reading your edit.
It sounds like what you have is at least reasonable. I've read books where people use the mean/median/mode to fill in missing values for numeric data. As for which specific one works the best for you I don't know. Can you try training your classifier with each version?
As for your issue with the "auto" column, it sounds like you want to do something similar to running a regression with categorical data. I don't have much experience with neural networks, but I know that if you were to use something like logistic regression then this is the approach you would want to use. Hopefully this gives you an idea of what you have to research.
As far as treating all of your numerical data as categorical data, you can do that as well, but you have to normalize it first. You can do something like min-max normalization and then just take the interger part of the number. Now your data will be the same as categorical data.
What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?
Should I map each nominal value to real value?
For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include
{ gender(boolean), age(real), job(nominal) }
where dependent variable may be the number of web-site login.
The variable job may be one of
{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.
Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?
Do a one-hot encoding, if anything.
If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.
If you read the book called "Machine Learning with Spark", the author
wrote,
Categorical features
Categorical features cannot be used as input in their raw form, as they are not
numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.
:
To transform categorical variables into a numerical representation, we can use a
common approach known as 1-of-k encoding. An approach such as 1-of-k encoding
is required to represent nominal variables in a way that makes sense for machine
learning tasks. Ordinal variables might be used in their raw form but are often
encoded in the same way as nominal variables.
:
I had exactly the same thought.
I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.
Actually I've done some projects where I had to do that way and
there was no issue raised concerning the performance of learning system.
To someone who took a vote against my question,
please cancel your evaluation.