I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.


Difference between One hot encoding and Label Encoding of target/output label

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?
Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

How to verify if two text datasets are from different distribution?

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset :
Eval 1:
Eval 2:
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?
The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.
I wrote an article which is similar to your problem but not exactly the same.
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.

NLP Transformers: Best way to get a fixed sentence embedding-vector shape?

I'm loading a language model from torch hub (CamemBERT a French RoBERTa-based model) and using it do embed some french sentences:
import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval() # disable dropout (or leave in train mode to finetune)
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
embeddings = all_layers[0]
return embeddings
# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence
u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])
Imagine now in order to do some semantic search, I want to calculate the cosine distance between the vectors (tensors in our case) u and v :
cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`
I'm asking what is the best method to use in order to always get the same embedding shape for a sentence regardless the count of its tokens?
=> The first solution I'm thinking of is calculating the mean on axis=1 (embedding of a sentence is the mean embedding its tokens) since axis=0 and axis=2 have always the same size:
cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269
But, I'm afraid that I'm hurting the embedding of the sentence when calculating the mean since it gives the same weight for each token (maybe multiplying by TF-IDF?).
=> The second solution is to pad shorter sentences out. That means:
giving a list of sentences to embed at a time (instead of embedding sentence by sentence)
look up for the sentence with the longest tokens and embed it, get its shape S
for the rest of sentences embed then pad zero to get the same shape S (the sentence has 0 in the rest of dimensions)
What are your thoughts?
What other techniques would you use and why?
Thanks in advance!
This is quite a general question, as there is no one specific right answer.
As you found out, of course the shapes differ because you get one output per token (depending on the tokenizer, those can be subword units). In other words, you have encoded all tokens into their own vector. What you want is a sentence embedding, and there are a number of ways to get those (with not one specifically right answer).
Particularly for sentence classification, we'd often use the output of the special classification token when the language model has been trained on it (CamemBERT uses <s>). Note that depending on the model, this can be the first (mostly BERT and children; also CamemBERT) or the last token (CTRL, GPT2, OpenAI, XLNet). I would suggest to use this option when available, because that token is trained exactly for this purpose.
If a [CLS] (or <s> or similar) token is not available, there are some other options that fall under the term pooling. Max and mean pooling are often used. What this means is that you take the max value token or the mean over all tokens. As you say, the "danger" is that you then reduce the vector value of the whole sentence to "some average" or "some max" that might not be very representative of the sentence. However, literature shows that this works quite well as well.
As another answer suggests, the layer whose output you use can play a difference as well. IIRC the Google paper on BERT suggests that they got the best score when concatenating the last four layers. This is more advanced and I will not go into it here unless requested.
I have no experience with fairseq, but using the transformers library, I'd write something like this (CamemBERT is available in the library from v2.2.0):
import torch
from transformers import CamembertModel, CamembertTokenizer
text = "Salut, comment vas-tu ?"
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
# load model
model = CamembertModel.from_pretrained('camembert-base')
# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
Printed output is (in order) the tokens after tokenisation, the token IDs, and the final size.
['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[ 5, 5340, 7, 404, 4660, 26, 744, 106, 6]])
Bert-as-service is a great example of doing exactly what you are asking about.
They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.
EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."
In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.
You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...
Q: How do you get the fixed representation? Did you do pooling or something?
A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
So, to do Bert-as-service's default behavior, you'd do
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
pooling_layer = all_layers[-2]
embedded = pooling_layer.mean(1) # 1 is the dimension you want to average ovber
# note, using numpy to take the mean is bad if you want to stay on GPU
return embedded
Take a look at sentence-transformers. Your model can be implemented as:
from sentence_transformers import SentenceTransformer
word_embedding_model = models.CamemBERT('camembert-base')
dim = word_embedding_model.get_word_embedding_dimension()
pooling_model = models.Pooling(dim, pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sentences = ['sentence 1', 'sentence 3', 'sentence 3']
sentence_embeddings = model.encode(sentences)
In the benchmark section you can see a comparison to several embedding methods such as Bert as a Service which I wouldn't recommend for similarity tasks. Additionally you can fine tune the embeddings for your task.
Also interesting to try a multilingual model:
model = SentenceTransformer('distiluse-base-multilingual-cased')
Will probably yield better results than mean pooling CamemBert.

Stanford GloVe's lack of punctuation?

I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?
Is there a pre-made data set that includes such things? Would this even work?
I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.
pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.
I have worked a bit with the word vectors used by Senna, and I am looking at the vocab list.
I definitely see entries for punctuation.
A trick for handling numbers is to replace every digit with 0, and then learn a distribution for each pattern. For instance 1999 is mapped to 0000, 01-01-2015 is mapped to 00-00-0000, etc...
Senna has entries for these patterns like 0000, etc...
I will look over GloVe and try to update this answer soon...
It is totally ok and also common to also handle punctuation as single tokens for word vector generation. Also see for example word2vec papers. I assume that the prebuilt word2vec datasets have punctuations. And i'm sure the prebuilt glove vectors have also punctuations.
The are a lot of tokenizers separating the punctuations as seperate word. One I know for sure is the ARK Tweet Tokenizer.
I have used such a conversition for numbers and punctiotions. It is not a good way but slightly can be useful.
for numbers I convert all numbers to "NUM".
ex: 178 = "NUM" or 654 = "NUM"
for punctiotions I convert them to "PUNC".
ex: apple, orange, banana = apple "PUNC" orange "PUNC" banana
this is not a good solution but works someway.

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?
You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see . The last one requires more complex software.
