How to use NN LookupTable in Torch - lua

How do I use nn.LookupTable to convert my vector of vocab indices to a tensor of embedding vectors?
(https://github.com/torch/nn/blob/master/LookupTable.lua)
The network is set up as:
self.llstm = LSTM
self.rlstm = LSTM
local modules = nn.Parallel()
:add(nn.LookupTable(self.vocab_size, self.emb_size))
:add(nn.Collapse(2))
:add(self.llstm)
:add(self.my_module)
self.params, self.grad_params = modules:getParameters
In the train step:
input = dataset[i]
emb_input = ?
self.llstm.forward(emb_input, reverse)

Related

Invert embedding layer does not update weight

In this post (https://stackoverflow.com/a/64526124/15693663) I see a very simple and short solution to invert the embedding layer.
I used the inverse embedding layer, but it does not update the weights in the network. The proposed inverse embedding layer is copied from the post here (bellow):
import torch
embeddings = torch.nn.Embedding(1000, 100)
my_sample = torch.randn(1, 100)
distance = torch.norm(embeddings.weight.data - my_sample, dim=1)
nearest = torch.argmin(distance)
What i did:
I used an embedding layer that gets one input and generated 16D output. Then, I add two hidden dense layers (64->16) and one inverse embedding layer. In short
X -> embedding ->dense layer(64D)->dense layer(16D) -> inverse embedding -> X'
X and X' are integer numbers.
To compute the loss, I used torch.norm(X - X'). But it does not update the weights. I can not figure out the problem and why there is no update in weights.
A short implementation is shown bellow:
# lS_o = Offset, lS_i = input number
optimizer = opts['sgd'](parameters, lr=args.learning_rate)
#--------------------------------------
model forward(self, lS_o, lS_i):
out_emb1 = self.embl_inp(lS_o, lS_i) # 16D == embedding layer
out_dl1= self.DLyr1(out_emb1) # 64D == Dense Layer 1
out_dl2 = self.DLyr2(out_dl1) # 16D == Dense Layer 2
ly = out_dl2
distance = self.emb_out.weight.data-ly[i,None] #subtract each row of weight matrix with each row in ly
out = torch.argmin(torch.norm(distance, dim=1), dim=0)
return torch.stack(out)
train_ds = Dataset(...
train_ld = DataLoader(train_ds, ...
pbar = tq.tqdm(enumerate(train_ld, total=len(train_ld))
for j, inputBatch in pbar:
lS_o, lS_i = unpack_batch(inputBatch)
ae_out = model(lS_o, lS_i, use_gpu=True)
loss = torch.norm(ae_out - lS_i)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I have studied the gumbel_softmax because argmin and argmax are not differentiable. So I have changed the code in the "model forward()" function as the following:
distance = torch.norm(distance, dim=1)
out = 1 - torch.nn.functional.gumbel_softmax(dist)
Because I am going to find the minimum, I subtract the output of the gumbel_softmax by 1.

ValueError: Data cardinality is ambiguous. Make sure all arrays contain the same number of samples?

I am trying to use a MobileNet model but facing above mentioned issue . I don't know if it is
occuring due to train_test_split or else . Architecture is shown below
Can I use model.fit instead of model.fit_generator here ?
mobilenet = MobileNet(input_shape=(224,224,3) , weights='imagenet', include_top=False)
# don't train existing weights
for layer in mobilenet.layers:
layer.trainable = False
folders = glob('/content/drive/MyDrive/AllClasses/*')
print("Total number of classes are",len(folders))
x = Flatten()(mobilenet.output)
prediction = Dense(len(folders), activation='softmax')(x)
model = Model(inputs=mobilenet.input, outputs=prediction)
model.summary()
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
dataset = ImageDataGenerator(rescale=1./255)
dataset = dataset.flow_from_directory('/content/drive/MyDrive/AllClasses',target_size=(224, 224),batch_size=32,class_mode='categorical',color_mode='grayscale')
train_data, test_data = train_test_split(dataset,random_state=42, test_size=0.20,shuffle=True)
r = model.fit(train_data,validation_data=(test_data),epochs=5)

Concatenating two pre-trained BERT

max_length = 50
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
encodings = tokenizer.batch_encode_plus(comments,max_length=max_length,pad_to_max_length=True, truncation=True) # tokenizer's encoding method
train_inputs = encodings['input_ids']
train_masks = encodings['attention_mask']
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
batch_size = 48
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
model = RobertaForSequenceClassification.from_pretrained('roberta-large', num_labels=num_labels)
model.cuda()
Hi, i'm using HuggingFace library for classification, i want to concatenate two types of BERT together. this is not the entire code. i just want you to know how i've used tokenizer and encodings. now i have 2 questions:
1: how can i see the created vectors? it's dimension and the vector it self. 2: in which step should i concatenate two BERT together? their vectors? or their output (logit) maybe?

I want to calculate cross entropy at all outputs of LSTM

I am writing a program of classification problem using LSTM.
However, I do not know how to calculate cross entropy with all the output of LSTM.
Here is a part of my program.
cell_fw = tf.nn.rnn_cell.LSTMCell(num_hidden)
cell_bw = tf.nn.rnn_cell.LSTMCell(num_hidden)
outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw,cell_bw,inputs = inputs3, dtype=tf.float32,sequence_length= seq_len)
outputs = tf.concat(outputs,axis=2)
#outputs [batch_size,max_timestep,num_features]
outputs = tf.reshape(outputs, [-1, num_hidden*2])
W = tf.Variable(tf.truncated_normal([num_hidden*2,
num_classes],
stddev=0.1))
b = tf.Variable(tf.constant(0., shape=[num_classes]))
logits = tf.matmul(outputs, W) + b
How can I apply crossentropy error to this?
Should I create a vector that represents the same class as the number of max_timestep for each batch and calculate the error with that?
Have you looked at cross_entropy documentation: https://www.tensorflow.org/api_docs/python/tf/losses/softmax_cross_entropy ?
The dimension of onehot_labels should answer your question.

Stacked RNN model setup in TensorFlow

I'm kind of lost in building up a stacked LSTM model for text classification in TensorFlow.
My input data was something like:
x_train = [[1.,1.,1.],[2.,2.,2.],[3.,3.,3.],...,[0.,0.,0.],[0.,0.,0.],
...... #I trained the network in batch with batch size set to 32.
]
y_train = [[1.,0.],[1.,0.],[0.,1.],...,[1.,0.],[0.,1.]]
# binary classification
The skeleton of my code looks like:
self._input = tf.placeholder(tf.float32, [self.batch_size, self.max_seq_length, self.vocab_dim], name='input')
self._target = tf.placeholder(tf.float32, [self.batch_size, 2], name='target')
lstm_cell = rnn_cell.BasicLSTMCell(self.vocab_dim, forget_bias=1.)
lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=self.dropout_ratio)
self.cells = rnn_cell.MultiRNNCell([lstm_cell] * self.num_layers)
self._initial_state = self.cells.zero_state(self.batch_size, tf.float32)
inputs = tf.nn.dropout(self._input, self.dropout_ratio)
inputs = [tf.reshape(input_, (self.batch_size, self.vocab_dim)) for input_ in
tf.split(1, self.max_seq_length, inputs)]
outputs, states = rnn.rnn(self.cells, inputs, initial_state=self._initial_state)
# We only care about the output of the last RNN cell...
y_pred = tf.nn.xw_plus_b(outputs[-1], tf.get_variable("softmax_w", [self.vocab_dim, 2]), tf.get_variable("softmax_b", [2]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_pred, self._target))
correct_pred = tf.equal(tf.argmax(y_pred, 1), tf.argmax(self._target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
initializer = tf.random_uniform_initializer(-0.04, 0.04)
with tf.variable_scope("model", reuse=True, initializer=initializer):
sess.run(init)
# generate batches here (omitted for clarity)
print sess.run([train_op, loss, accuracy], feed_dict={self._input: batch_x, self._target: batch_y})
The problem is that no matter how large the dataset is, the loss and accuracy has no sign of improvement (looks completely stochastic). Am I doing anything wrong?
Update:
# First, load Word2Vec model in Gensim.
model = Doc2Vec.load(word2vec_path)
# Second, build the dictionary.
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.vocab.keys(), allow_update=True)
w2indx = {v: k + 1 for k, v in gensim_dict.items()}
w2vec = {word: model[word] for word in w2indx.keys()}
# Third, read data from a text file.
for fname in fnames:
i = 0
with codecs.open(fname, 'r', encoding='utf8') as fr:
for line in fr:
tmp = []
for t in line.split():
tmp.append(t)
X_train.append(tmp)
i += 1
if i is samples_count:
break
# Fourth, convert words into vectors, and pad each sentence with ZERO arrays to a fixed length.
result = np.zeros((len(data), self.max_seq_length, self.vocab_dim), dtype=np.float32)
for rowNo in xrange(len(data)):
rowLen = len(data[rowNo])
for colNo in xrange(rowLen):
word = data[rowNo][colNo]
if word in w2vec:
result[rowNo][colNo] = w2vec[word]
else:
result[rowNo][colNo] = [0] * self.vocab_dim
for colPadding in xrange(rowLen, self.max_seq_length):
result[rowNo][colPadding] = [0] * self.vocab_dim
return result
# Fifth, generate batches and feed them to the model.
... Trivias ...
Here are few reasons it may not be training and suggestions to try:
You are not allowing to update word vectors, space of pre-learned vectors may be not working properly.
RNNs really need gradient clipping when trained. You can try adding something like this.
Unit scale initialization seems to work better, as it accounts for the size of the layer and allows gradient to be scaled properly as it goes deeper.
You should try removing dropout and second layer - just to check if your data passing is correct and your loss is going down at all.
I also can recommend trying this example with your data: https://github.com/tensorflow/skflow/blob/master/examples/text_classification.py
It trains word vectors from scratch, already has gradient clipping and uses GRUCells which usually are easier to train. You can also see nice visualizations for loss and other things by running tensorboard logdir=/tmp/tf_examples/word_rnn.

Resources