Difference between One hot encoding and Label Encoding of target/output label - machine-learning

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?

Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Using sklearn DictVectorizer in real-time systems

Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - http://svmlight.joachims.org/) format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?
You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . The last one requires more complex software.

Vowpal Wabbit Readable Model

I was using Vowpal Wabbit and was generating the classifier trained as a readable model.
My dataset had 22 features and the readable model gave as output:
Version 7.2.1
Min label:-50.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
101143:0.035237
101144:0.033885
101145:0.013357
101146:-0.007537
101147:-0.039093
101148:-0.013357
101149:0.001748
116060:0.499471
157941:-0.037318
157942:0.008038
157943:-0.011337
196772:0.138384
196773:0.109454
196774:0.118985
196775:-0.022981
196776:-0.301487
196777:-0.118985
197006:-0.000514
197007:-0.000373
197008:-0.000288
197009:-0.004444
197010:-0.006072
197011:0.000270
Can somebody please explain to me how to interpret the last portion of the file (after options: )? I was using logistic regression and I need to check how iterating over the training updates my classifier so that I can understand when I reach a convergence...
Thanks in advance :)
The values you see are the hash-values and weights of all your 22 features and one additional "Constant" feature (its hash value is 116060) in the resulting trained model.
The format is:
hash_value:weight
In order to see your original feature names instead of the hash value, you may use one of two methods:
Use the utl/vw-varinfo (in the source tree) utility on your training set with the same options you used for training. Try utl/vw-varinfo for a help/usage message
Use the relatively new --invert_hash readable.model option
BTW: inverting the hash values back to the original feature names is not the default due to the large performance penalty. By default, vw applies the one way hash on each feature string it sees. It doesn't maintain a hash-map between feature-names and their hash-values at all.
Edit:
Another little tidbit that may be of interest is the first entry after options: which reads:
:0
It essentially means that any "other" feature (all those which are not in the training-set, and thus, not hashed into the weight vector) defaults to a weight of 0. This means that it is redundant in vowpal-wabbit to train on features with values of zero, which is the default anyway. Explicit :0 value features simply won't contribute anything to the model. When you leave-out a weight in your training-set, as in: feature_name without a trailing :<value> vowpal wabbit implicitly assumes that it is a binary feature, with a TRUE value. IOW: it defaults all value-less features, to a value of one (:1) rather than a value of zero (:0). HTH.
Vowpal Wabbit also now has an --invert_hash option, which will give you a readable model with the actual variables, as well as just the hashes.
It consumes a LOT more memory, but since your model seems to be pretty small it will probably work.

Resources