Sentence embeddings BERT - embedding

I need an info.
I used this: https://towardsdatascience.com/improving-sentence-embeddings-with-bert-and-representation-learning-dfba6b444f6b to extract features but I got word embeddings.
If I want sentence embeddings using BERT traines on my data, how can I do?
Example: sentence "I want running" --> result [1,768] array embeddings
Thanks.

I recommend several aproaches. If you use HuggingFace, try the following suggestion:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) #
Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the
output tuple
I invite you to use Sentence_Transformers. The project fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with a siamese or triplet network structure to produce semantically meaningful sentence embeddings. You can employ Flair to test the Sentence Transformer.
Alternatively you may try Flair TransformerDocumentEmbeddings. See examples.

Related

How to update vocabulary of pre-trained bert model while doing my own training task?

I am now working on a task of predicting masked word using BERT model. Unlike others, the answer needs to be chosen from specific options.
For instance:
sentence: "In my daily [MASKED], ..."
options: A.word1 B.word2 C.word3 D.word4
the predict word will be chosen from four given words
I use hugging face's BertForMaskedLM to do this task. This model will give me a probability matrix which representing every word's probability of appearing in the [MASK] and I just need to compare the probability of word in options to select the answser.
# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
#predicted_index = torch.argmax(predictions[0, masked_index]).item()
#predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
A = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option1])]
B = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option2])]
C = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option3])]
D = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option4])]
#And then select from ABCD
But the problem is:
If the options are not in the “bert-vocabulary.txt”, the above method is not going to work since the output matrix does not give their probability. The same problem will also appear if the option is not a single word.
Should I update the vocabulary and how to do that? Or how to train the model
to add new words on the basis of pre-training?

How to train a word embedding representation with gensim fasttext wrapper?

I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:
In:
from gensim.models.fasttext import FastText as FT_gensim
# Set file names for train and test data
corpus = df['sentences'].values.tolist()
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim
Out:
<gensim.models.fasttext.FastText at 0x7f6087cc70f0>
In:
# train the model
model_gensim.train(
sentences = corpus,
epochs = model_gensim.epochs,
total_examples = model_gensim.corpus_count,
total_words = model_gensim.corpus_total_words
)
print(model_gensim)
Out:
FastText(vocab=107, size=100, alpha=0.025)
However, when I try to look in a vocabulary words:
print('return' in model_gensim.wv.vocab)
I get False, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:
model_gensim.most_similar("return")
[('R', 0.15871645510196686),
('2', 0.08545402437448502),
('i', 0.08142799884080887),
('b', 0.07969795912504196),
('a', 0.05666942521929741),
('w', 0.03705815598368645),
('c', 0.032348938286304474),
('y', 0.0319858118891716),
('o', 0.027745068073272705),
('p', 0.026891689747571945)]
What is the correct way of using gensim's fasttext wrapper?
The gensim FastText class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.
Tokenize each item of your corpus into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:
corpus = [s.split() for s in corpus]
But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.
In order to looking in vocabulary words, the vocabulary words should be written to a text file in order to become visible from that text file. This could be helpful for you:
with open("vocab.txt", "w", encoding="utf8") as vocab_out:
for word in model_gensim.wv.vocab:
vocab_out.write(word + "\n")

Implementing Luong and Manning's hybrid model

hybrid word character model
As shown in the above image I need to create a hybrid encoder-decoder network(seq2seq) which takes in both word and character embeddings as input.
As shown in image consider the sentence:
A cute cat
Hypothetically the words in vocabulary are:
a , cat
and Out of vocabulary words are:
cute
we feed the words a, cat as their respective embeddings
but since cute is out of vocabulary we generally feed it with embedding of a universal token.
But instead in this case I need to pass that unique word (cute which is out of vocabulary) through another seq2seq layer character by character to generate its embedding on the fly.
The both seq2seq layers must be trained jointly end to end.
The following is a snippet of my code where I tried the main encoder decoder network which takes word based inputs in Keras
model=Sequential()
model.add(Embedding(X_vocab_len+y_vocab_len, 300,weights=[embedding_matrix], input_length=X_max_len, mask_zero=True))
for i in range(num_layers):
return_sequences = i != num_layers-1
model.add(LSTM(hidden_size,return_sequences=return_sequences))
model.add(RepeatVector(y_max_len))
# Creating decoder network
for _ in range(num_layers):
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(y_vocab_len)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
here X is my input sentence and y is the sentence to be generated ,vocabulary size is what I fixed consisting of frequent words and rare words are considered out of vocabulary based on vocabulary size
here I created a sequential model in Keras where I added embeddings from pre-trained vectors generated by GloVe(embedding_matrix)
How to model input to achieve such senario ?
The reference paper is :
http://aclweb.org/anthology/P/P16/P16-1100.pdf

How to get paragraph vector for a new paragraph?

I have a set of users and their content(1 document per user containing tweets of that user). I am planning to use a distributed vector representation of some size N for each user. One way is to take pre trained wordvectors on twitter data and average them to get distributed vector of an user. I am planning to use doc2vec for better results.But I am not quite sure if I understood the DM model given in Distributed Representations of Sentences and Documents.
I understand that we are assigning one vector per paragraph and while predicting next word we are using that and then backpropagating the error to update the paragraph vector as well as word vector. How to use this to predict paragraph vector of a new paragraph?
Edit : Any toy code for gensim to compute paragraph vector of new document would be appreciated.
The following code is based on gensim's doc2vec tutorial. We can instantiate and train a doc2vec model to generate embeddings of size 300 with a context window of size 10 as follows:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(size=300, window=10, min_count=2, iter=64, workers=16)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
Having trained our model, we can compute a vector for a new unseen document as follows:
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_simlar([inferred_vector], topn=len(model.docvecs))
This will return a 300-dimensional representation of our test document and compute top-N most similar documents from the training set based on cosine similarity.

Using pre-trained word2vec with LSTM for word generation

LSTM/RNN can be used for text generation.
This shows way to use pre-trained GloVe word embeddings for Keras model.
How to use pre-trained Word2Vec word embeddings with Keras LSTM
model? This post did help.
How to predict / generate next word when the model is provided with the sequence of words as its input?
Sample approach tried:
# Sample code to prepare word2vec word embeddings
import gensim
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]
word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)
# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation
embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')
Sample code / psuedocode to train LSTM and predict will be appreciated.
I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.
I'll highlight the most important parts here.
Gensim Word2Vec
Your code is fine, except for the number of iterations to train it. The default iter=5 seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer. iter=100 looks better.
word_model = gensim.models.Word2Vec(sentences, vector_size=100, min_count=1,
window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
most_similar = ', '.join('%s (%.2f)' % (similar, dist)
for similar, dist in word_model.most_similar(word)[:8])
print(' %s -> %s' % (word, most_similar))
def word2idx(word):
return word_model.wv.vocab[word].index
def idx2word(idx):
return word_model.wv.index2word[idx]
The result embedding matrix is saved into pretrained_weights array which has a shape (vocab_size, emdedding_size).
Keras model
Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be categorical_crossentropy or sparse_categorical_crossentropy. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size,
weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
Note passing the pre-trained weights to weights.
Data preparation
In order to work with sparse_categorical_crossentropy loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.
train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
for t, word in enumerate(sentence[:-1]):
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(sentence[-1])
Sample generation
This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as argmax. The temperature based random sampling I've used is described here.
def sample(preds, temperature=1.0):
if temperature <= 0:
return np.argmax(preds)
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
def generate_next(text, num_generated=10):
word_idxs = [word2idx(word) for word in text.lower().split()]
for i in range(num_generated):
prediction = model.predict(x=np.array(word_idxs))
idx = sample(prediction[-1], temperature=0.7)
word_idxs.append(idx)
return ' '.join(idx2word(idx) for idx in word_idxs)
Examples of generated text
deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes
Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).
The link to the complete runnable script.

Resources