Using pre-trained word2vec with LSTM for word generation - machine-learning

LSTM/RNN can be used for text generation.
This shows way to use pre-trained GloVe word embeddings for Keras model.
How to use pre-trained Word2Vec word embeddings with Keras LSTM
model? This post did help.
How to predict / generate next word when the model is provided with the sequence of words as its input?
Sample approach tried:
# Sample code to prepare word2vec word embeddings
import gensim
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]
word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)
# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation
embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')
Sample code / psuedocode to train LSTM and predict will be appreciated.

I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.
I'll highlight the most important parts here.
Gensim Word2Vec
Your code is fine, except for the number of iterations to train it. The default iter=5 seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer. iter=100 looks better.
word_model = gensim.models.Word2Vec(sentences, vector_size=100, min_count=1,
window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
most_similar = ', '.join('%s (%.2f)' % (similar, dist)
for similar, dist in word_model.most_similar(word)[:8])
print(' %s -> %s' % (word, most_similar))
def word2idx(word):
return word_model.wv.vocab[word].index
def idx2word(idx):
return word_model.wv.index2word[idx]
The result embedding matrix is saved into pretrained_weights array which has a shape (vocab_size, emdedding_size).
Keras model
Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be categorical_crossentropy or sparse_categorical_crossentropy. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size,
weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
Note passing the pre-trained weights to weights.
Data preparation
In order to work with sparse_categorical_crossentropy loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.
train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
for t, word in enumerate(sentence[:-1]):
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(sentence[-1])
Sample generation
This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as argmax. The temperature based random sampling I've used is described here.
def sample(preds, temperature=1.0):
if temperature <= 0:
return np.argmax(preds)
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
def generate_next(text, num_generated=10):
word_idxs = [word2idx(word) for word in text.lower().split()]
for i in range(num_generated):
prediction = model.predict(x=np.array(word_idxs))
idx = sample(prediction[-1], temperature=0.7)
word_idxs.append(idx)
return ' '.join(idx2word(idx) for idx in word_idxs)
Examples of generated text
deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes
Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).
The link to the complete runnable script.

Related

How to use pretrained BERT word embedding vector to finetune (initialize) other networks?

When I used to do classification work with textcnn, I had experience finetuning textcnn using pretrained word embedding with like Word2Vec and fasttext. And I use this process:
Create an embedding layer in textcnn
Load the embedding matrix of the words used this time by Word2Vec or
fasttext
Since the vector value of the embedding layer will change during training, the network is
being finetuning.
Recently I also want to try BERT to do this. I thought, 'As there should be few differences to use BERT pretrained embedding to initial other networks' embedding layer and finetuning, it should be easy!' But in fact yesterday I tried all day and still cannot do it.
The fact I found is that, as BERT's embedding is a contextual embedding, especially when extracting the word embeddings, the vector of each word from each sentence will vary, so it seems that there is no way to use that embedding to initialize the embedding layer of another network as usual...
Finally, I thought up one method to 'finetuning', as the following steps:
First, do not define an embedding layer in textcnn.
Instead of using embedding layer, in the network training part, I
firstly pass sequence tokens to the pretrained BERT model and get
the word embeddings for each sentence.
Put the BERT word embedding from 2. into textcnn and train the
textcnn network.
By using this method I was finally able to train, but thinking seriously, I don't think I'm doing a finetuning at all...
Because as you can see, every time when I start a new training loop, the word embedding generated from BERT is always the same vector, so just input these unchanged vectors to the textcnn wouldn't let the textcnn be finetuned at all, right?
UPDATE:
I thought up a new method to use the BERT embeddings and 'train' BERT and textcnn together.
Some part of my code is:
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I think by enable BERTmodel.train() and delete torch.no_grad() when get the embedding, the loss gradient could be backward to BERTmodel, too. The training process of TextCNNmodel also went smoothly.
To use this model later, I saved the parameters of both TextCNNmodel and BERTmodel.
Then to experiment whether the BERTmodel was really being trained and changed, in another program I load the BERTModel, and input a sentence to test that whether the BERTModel was really being trained.
However, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are the same, which is disappointing...
I really have no idea why the BERTmodel part did not change...
Here I would like to thanks #Jindřich , thank you for giving me the important hint!
I think I am almost there when using my updated version code, but I forgot to set an optimizer for BERTmodel.
After I set the optimizer and did the training process again, this time when I load my BERTmodel, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are finally different, which means this BERTmodel is changed and should be finetuned.
Here is my final codes, hope it could help you, too.
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
optimizer_bert = torch.optim.Adamw(BERTmodel.parameters(), lr=2e-5, weight_decay=1e-2)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
optimizer_bert.zero_grad()
loss.backward()
optimizer.step()
optimizer_bert.step()
I will continue my experiments to see if my BERTmodel is really being finetuned.

SpaCy-transformers regression output

I would like to have a regression output instead of the classification. For instance: instead of n classes I want a floating point output value from 0 to 1.
Here is the minimalistic example from the package github page:
import spacy
from spacy.util import minibatch
import random
import torch
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
torch.set_default_tensor_type("torch.cuda.FloatTensor")
nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
optimizer = nlp.resume_training()
for i in range(10):
random.shuffle(TRAIN_DATA)
losses = {}
for batch in minibatch(TRAIN_DATA, size=8):
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
print(i, losses)
nlp.to_disk("/bert-textcat")
Is there an easy way to make trf_textcat work as a regressor? Or would it mean extending the library?
I have figured out a workaround: extract vector representations from the nlp pipeline as:
vector_repres = nlp('Test text').vector
After doing so for all the text entries, You end up with a fixed-dimensional representation of the texts. Assuming You have the continuous output values, feel free to use any estimator, including Neural Network with a linear output.
Note that the vector representation is an average of the vector embeddings of all the words in the text - it might be a sub-optimal solution for Your case.

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

Implementing Luong and Manning's hybrid model

hybrid word character model
As shown in the above image I need to create a hybrid encoder-decoder network(seq2seq) which takes in both word and character embeddings as input.
As shown in image consider the sentence:
A cute cat
Hypothetically the words in vocabulary are:
a , cat
and Out of vocabulary words are:
cute
we feed the words a, cat as their respective embeddings
but since cute is out of vocabulary we generally feed it with embedding of a universal token.
But instead in this case I need to pass that unique word (cute which is out of vocabulary) through another seq2seq layer character by character to generate its embedding on the fly.
The both seq2seq layers must be trained jointly end to end.
The following is a snippet of my code where I tried the main encoder decoder network which takes word based inputs in Keras
model=Sequential()
model.add(Embedding(X_vocab_len+y_vocab_len, 300,weights=[embedding_matrix], input_length=X_max_len, mask_zero=True))
for i in range(num_layers):
return_sequences = i != num_layers-1
model.add(LSTM(hidden_size,return_sequences=return_sequences))
model.add(RepeatVector(y_max_len))
# Creating decoder network
for _ in range(num_layers):
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(y_vocab_len)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
here X is my input sentence and y is the sentence to be generated ,vocabulary size is what I fixed consisting of frequent words and rare words are considered out of vocabulary based on vocabulary size
here I created a sequential model in Keras where I added embeddings from pre-trained vectors generated by GloVe(embedding_matrix)
How to model input to achieve such senario ?
The reference paper is :
http://aclweb.org/anthology/P/P16/P16-1100.pdf

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources