How to train a word embedding representation with gensim fasttext wrapper? - machine-learning

I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:
In:
from gensim.models.fasttext import FastText as FT_gensim
# Set file names for train and test data
corpus = df['sentences'].values.tolist()
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim
Out:
<gensim.models.fasttext.FastText at 0x7f6087cc70f0>
In:
# train the model
model_gensim.train(
sentences = corpus,
epochs = model_gensim.epochs,
total_examples = model_gensim.corpus_count,
total_words = model_gensim.corpus_total_words
)
print(model_gensim)
Out:
FastText(vocab=107, size=100, alpha=0.025)
However, when I try to look in a vocabulary words:
print('return' in model_gensim.wv.vocab)
I get False, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:
model_gensim.most_similar("return")
[('R', 0.15871645510196686),
('2', 0.08545402437448502),
('i', 0.08142799884080887),
('b', 0.07969795912504196),
('a', 0.05666942521929741),
('w', 0.03705815598368645),
('c', 0.032348938286304474),
('y', 0.0319858118891716),
('o', 0.027745068073272705),
('p', 0.026891689747571945)]
What is the correct way of using gensim's fasttext wrapper?

The gensim FastText class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.
Tokenize each item of your corpus into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:
corpus = [s.split() for s in corpus]
But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.

In order to looking in vocabulary words, the vocabulary words should be written to a text file in order to become visible from that text file. This could be helpful for you:
with open("vocab.txt", "w", encoding="utf8") as vocab_out:
for word in model_gensim.wv.vocab:
vocab_out.write(word + "\n")

Related

Sentence embeddings BERT

I need an info.
I used this: https://towardsdatascience.com/improving-sentence-embeddings-with-bert-and-representation-learning-dfba6b444f6b to extract features but I got word embeddings.
If I want sentence embeddings using BERT traines on my data, how can I do?
Example: sentence "I want running" --> result [1,768] array embeddings
Thanks.
I recommend several aproaches. If you use HuggingFace, try the following suggestion:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) #
Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the
output tuple
I invite you to use Sentence_Transformers. The project fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with a siamese or triplet network structure to produce semantically meaningful sentence embeddings. You can employ Flair to test the Sentence Transformer.
Alternatively you may try Flair TransformerDocumentEmbeddings. See examples.

what is meant by empty rows as feature vectors in text analysis?

I am doing the movie review sentiment analysis using the data available for Kaggle dataset here using Python. https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data
I do not have errors here but trying to understand why the rows collecting feature vectors are empty for some cases.
After preprocessing my text such as removing stop words, missing data removal, removing punctuations I tokenize my text into sequences using the following codes.
tokenizer = Tokenizer(num_words=nwords, filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n', lower=True,split=' ')
tokenizer.fit_on_texts(df_train["cleansed_txt"].values)
tokenizer.fit_on_texts(df_test["cleansed_txt"].values)
X_train = tokenizer.texts_to_sequences(df_train["cleansed_txt"].values)
X_test = tokenizer.texts_to_sequences(df_test["cleansed_txt"].values)
And when I check how does my X_train looks like I find it to be this way. My question is what should I understand from these numbers. Why are some of them empty?

NLP Transformers: Best way to get a fixed sentence embedding-vector shape?

I'm loading a language model from torch hub (CamemBERT a French RoBERTa-based model) and using it do embed some french sentences:
import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval() # disable dropout (or leave in train mode to finetune)
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
embeddings = all_layers[0]
return embeddings
# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence
u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])
Imagine now in order to do some semantic search, I want to calculate the cosine distance between the vectors (tensors in our case) u and v :
cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`
I'm asking what is the best method to use in order to always get the same embedding shape for a sentence regardless the count of its tokens?
=> The first solution I'm thinking of is calculating the mean on axis=1 (embedding of a sentence is the mean embedding its tokens) since axis=0 and axis=2 have always the same size:
cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269
But, I'm afraid that I'm hurting the embedding of the sentence when calculating the mean since it gives the same weight for each token (maybe multiplying by TF-IDF?).
=> The second solution is to pad shorter sentences out. That means:
giving a list of sentences to embed at a time (instead of embedding sentence by sentence)
look up for the sentence with the longest tokens and embed it, get its shape S
for the rest of sentences embed then pad zero to get the same shape S (the sentence has 0 in the rest of dimensions)
What are your thoughts?
What other techniques would you use and why?
Thanks in advance!
This is quite a general question, as there is no one specific right answer.
As you found out, of course the shapes differ because you get one output per token (depending on the tokenizer, those can be subword units). In other words, you have encoded all tokens into their own vector. What you want is a sentence embedding, and there are a number of ways to get those (with not one specifically right answer).
Particularly for sentence classification, we'd often use the output of the special classification token when the language model has been trained on it (CamemBERT uses <s>). Note that depending on the model, this can be the first (mostly BERT and children; also CamemBERT) or the last token (CTRL, GPT2, OpenAI, XLNet). I would suggest to use this option when available, because that token is trained exactly for this purpose.
If a [CLS] (or <s> or similar) token is not available, there are some other options that fall under the term pooling. Max and mean pooling are often used. What this means is that you take the max value token or the mean over all tokens. As you say, the "danger" is that you then reduce the vector value of the whole sentence to "some average" or "some max" that might not be very representative of the sentence. However, literature shows that this works quite well as well.
As another answer suggests, the layer whose output you use can play a difference as well. IIRC the Google paper on BERT suggests that they got the best score when concatenating the last four layers. This is more advanced and I will not go into it here unless requested.
I have no experience with fairseq, but using the transformers library, I'd write something like this (CamemBERT is available in the library from v2.2.0):
import torch
from transformers import CamembertModel, CamembertTokenizer
text = "Salut, comment vas-tu ?"
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
print(tokens)
# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
print(token_ids)
# load model
model = CamembertModel.from_pretrained('camembert-base')
# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
print(cls_out.size())
Printed output is (in order) the tokens after tokenisation, the token IDs, and the final size.
['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[ 5, 5340, 7, 404, 4660, 26, 744, 106, 6]])
torch.Size([768])
Bert-as-service is a great example of doing exactly what you are asking about.
They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.
EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."
In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.
You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...
Q: How do you get the fixed representation? Did you do pooling or something?
A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
So, to do Bert-as-service's default behavior, you'd do
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
pooling_layer = all_layers[-2]
embedded = pooling_layer.mean(1) # 1 is the dimension you want to average ovber
# note, using numpy to take the mean is bad if you want to stay on GPU
return embedded
Take a look at sentence-transformers. Your model can be implemented as:
from sentence_transformers import SentenceTransformer
word_embedding_model = models.CamemBERT('camembert-base')
dim = word_embedding_model.get_word_embedding_dimension()
pooling_model = models.Pooling(dim, pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sentences = ['sentence 1', 'sentence 3', 'sentence 3']
sentence_embeddings = model.encode(sentences)
In the benchmark section you can see a comparison to several embedding methods such as Bert as a Service which I wouldn't recommend for similarity tasks. Additionally you can fine tune the embeddings for your task.
Also interesting to try a multilingual model:
model = SentenceTransformer('distiluse-base-multilingual-cased')
model.encode([...])
Will probably yield better results than mean pooling CamemBert.

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :
bigrams + trigrams + word-marks vocabulary
He means by word-marks here, the words that are specific to a certain dialect.
How can I tweak those parameters in countVectorizer?
word marks
So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.
word_marks=['love', 'funny', 'happy', 'amazing']
Those are used to classify a text.
Also, in the this post:
Understanding the `ngram_range` argument in a CountVectorizer in sklearn
There was this answer :
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?
You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).
See the accepted answer to this question for more details.
Please provide example of "word-marks" for the other part of your question.
You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.
Here's a SO answer on concatenating arrays

How to construct a character based seq2seq model in tensorflow

What changes are required to the existing seq2seq model in tensorflow so that I can use character units rather then the existing word units for the seq2seq task? And will this be a good configuration for a predictive ext application?
The following function signatures may need modification for this task:
def embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
output_projection=None, feed_previous=False,
dtype=dtypes.float32, scope=None):
Apart from reduced input output vocabulary what other parameter changes would be be required to implement such a character level seq2seq model ?
I think you could use the existing seq2seq model in tensorflow without any code changes for character-based units if you prepare your input data files by whitespace separating your training examples like this:
The quick brown fox.
Becomes:
T h e _SPACE_ q u i c k _SPACE_ b r o w n _SPACE_ f o x .
Then your vocabulary naturally becomes characters not words.
You can experiment vocab sizes, with embedding size, eliminate embedding layer, etc. to see what works best for your data.

Resources