What is the initial value of Embedding layer? - machine-learning

I am studying embedding for word representations. In many dnn libraries, they support embedding layer. And this is really nice tutorial.
Word Embeddings: Encoding Lexical Semantics
But I am not still sure how to calculate embed value. In below example, it outputs some value even before any trainings. Does it use some random weights? I realize a purpose of Embedding(2, 5), but not sure its initial calculation. And I am no sure about how to learn weights of its Embedding too.
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["hello"]])
hello_embed = embeds(autograd.Variable(lookup_tensor))
print(hello_embed)
--------
Variable containing:
-2.9718 1.7070 -0.4305 -2.2820 0.5237
[torch.FloatTensor of size 1x5]
I break down my thought to be sure. First of all, upper Embedding(2, 5) is a matrix of shape (2, 5).
Embedding(2, 5) =
[[0.1,-0.2,0.3,0.4,0.1],
[-0.2,0.1,0.8,0.2,0.3]] # initiated by some function, like random normal distribution
Then, hello is [1, 0]. Then hello representation is calculated by [1, 0].dot(Embedding(2, 5)) = [0.1,-0.2,0.3,0.4,0.1]. This is actually first row of the Embedding. Am I understanding right?
Updates
I found a code of embedding which is exactly use normal distribution for its value. Yes, but it is just a default value, and we can set arbitrary weights for embedding layers.
https://github.com/chainer/chainer/blob/adba7b846d018b9dc7d19d52147ef53f5e555dc8/chainer/links/connection/embed_id.py#L58

Initializations define the way to set the initial random weights of layers. You can use any value to do it. But initial values affect Word Embedding. There are many approach for Pre-trained Word Embedding that they try to choose better initial values like this.

Yes. You start off with random weights. I think it is more common to use a truncated normal distribution instead of the regular normal distribution. But, that probably doesn't make much of a difference.

Related

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?
As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.
thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

keras model output vector of one hot vectors, is it possible? are there any alternatives if not?

So I have an output vector of dim=7 and 4 possible classes for each position, so my question is, is it possible to feed the keras model a vector of one hot vectors, where each position of the vector is a one hot vector? something like this [[1000],[1000],[0100],[0010],[0001],[0001],[0010]].
If this is not possible are there any alternatives?
If you want your output of your model to be like that when your model = keras.models.Model(...), the answer is not possible because the output that you provide (which is like a step respond "[1000] => [0000]") will have a gradient of infinity at step and 0 at other point.
What people do is to create a model that give distribution over different action and select the highest probability as predicted value and using cross entropy loss to optimize the model. For example, from your output [1,0,0,0] you will have something like [0.9,0.01,0.01,0.08] instead. Then you can pick first instance as predicted value. This will make sure that your model have gradient at all point.
So if you really want your model to have dim = 7 and 4 different value, you can create output size of 28 = 7*4 with sigmoid activation, then pick first 4 as your dimension 1 (maybe using something like np.argmax(output[0:4])) and so on.

NLP Transformers: Best way to get a fixed sentence embedding-vector shape?

I'm loading a language model from torch hub (CamemBERT a French RoBERTa-based model) and using it do embed some french sentences:
import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval() # disable dropout (or leave in train mode to finetune)
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
embeddings = all_layers[0]
return embeddings
# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence
u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])
Imagine now in order to do some semantic search, I want to calculate the cosine distance between the vectors (tensors in our case) u and v :
cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`
I'm asking what is the best method to use in order to always get the same embedding shape for a sentence regardless the count of its tokens?
=> The first solution I'm thinking of is calculating the mean on axis=1 (embedding of a sentence is the mean embedding its tokens) since axis=0 and axis=2 have always the same size:
cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269
But, I'm afraid that I'm hurting the embedding of the sentence when calculating the mean since it gives the same weight for each token (maybe multiplying by TF-IDF?).
=> The second solution is to pad shorter sentences out. That means:
giving a list of sentences to embed at a time (instead of embedding sentence by sentence)
look up for the sentence with the longest tokens and embed it, get its shape S
for the rest of sentences embed then pad zero to get the same shape S (the sentence has 0 in the rest of dimensions)
What are your thoughts?
What other techniques would you use and why?
Thanks in advance!
This is quite a general question, as there is no one specific right answer.
As you found out, of course the shapes differ because you get one output per token (depending on the tokenizer, those can be subword units). In other words, you have encoded all tokens into their own vector. What you want is a sentence embedding, and there are a number of ways to get those (with not one specifically right answer).
Particularly for sentence classification, we'd often use the output of the special classification token when the language model has been trained on it (CamemBERT uses <s>). Note that depending on the model, this can be the first (mostly BERT and children; also CamemBERT) or the last token (CTRL, GPT2, OpenAI, XLNet). I would suggest to use this option when available, because that token is trained exactly for this purpose.
If a [CLS] (or <s> or similar) token is not available, there are some other options that fall under the term pooling. Max and mean pooling are often used. What this means is that you take the max value token or the mean over all tokens. As you say, the "danger" is that you then reduce the vector value of the whole sentence to "some average" or "some max" that might not be very representative of the sentence. However, literature shows that this works quite well as well.
As another answer suggests, the layer whose output you use can play a difference as well. IIRC the Google paper on BERT suggests that they got the best score when concatenating the last four layers. This is more advanced and I will not go into it here unless requested.
I have no experience with fairseq, but using the transformers library, I'd write something like this (CamemBERT is available in the library from v2.2.0):
import torch
from transformers import CamembertModel, CamembertTokenizer
text = "Salut, comment vas-tu ?"
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
print(tokens)
# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
print(token_ids)
# load model
model = CamembertModel.from_pretrained('camembert-base')
# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
print(cls_out.size())
Printed output is (in order) the tokens after tokenisation, the token IDs, and the final size.
['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[ 5, 5340, 7, 404, 4660, 26, 744, 106, 6]])
torch.Size([768])
Bert-as-service is a great example of doing exactly what you are asking about.
They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.
EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."
In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.
You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...
Q: How do you get the fixed representation? Did you do pooling or something?
A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
So, to do Bert-as-service's default behavior, you'd do
def embed(sentence):
tokens = camembert.encode(sentence)
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
pooling_layer = all_layers[-2]
embedded = pooling_layer.mean(1) # 1 is the dimension you want to average ovber
# note, using numpy to take the mean is bad if you want to stay on GPU
return embedded
Take a look at sentence-transformers. Your model can be implemented as:
from sentence_transformers import SentenceTransformer
word_embedding_model = models.CamemBERT('camembert-base')
dim = word_embedding_model.get_word_embedding_dimension()
pooling_model = models.Pooling(dim, pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sentences = ['sentence 1', 'sentence 3', 'sentence 3']
sentence_embeddings = model.encode(sentences)
In the benchmark section you can see a comparison to several embedding methods such as Bert as a Service which I wouldn't recommend for similarity tasks. Additionally you can fine tune the embeddings for your task.
Also interesting to try a multilingual model:
model = SentenceTransformer('distiluse-base-multilingual-cased')
model.encode([...])
Will probably yield better results than mean pooling CamemBert.

LSTM-RNN : How to shape multivariate Inputs

Hi everybody I am struggeling with the tensorflow RNN implementation:
The problem:
I want to train an LSTM implentation of an RNN to detect malicious connections in the KDD99 dataset. Its a dataset with 41 features and (after some preprocessing) a label vector of the size 5.
[
[x1, x2, x3, .....x40, x41],
...
[x1, x2, x3, .....x40, x41]
]
[
[0, 1, 0, 0, 0],
...
[0, 0, 1, 0, 0]
]
As a basic architurecture I would like to implement the following:
cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 3, state_is_tuple=True)
My question is: In order to feed it to the model, how would i need to reshape the input features?
Would I not just have to reshape the input features, but to build sliding window sequences?
What I mean by that:
Assuming a sequence length of ten, the first suqence would contains data point 0 - 9, the second one contains data points 1 - 10, 2 - 11 and so on.
Thanks!
I do not know the dataset but I think that you problem is the following: you have a very long sequence and you want to know how to shape this sequence in order to provide this to the network.
The 'tf.contrib.rnn.static_rnn' has the following signature:
tf.contrib.rnn.static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None)
where
inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size], or a nested tuple of such elements.
So the inputs need to be shaped into lists, where each element of the list is the element of the input sequence at each time step.
The length of this list depend on your problem and/or on computational issues.
In Natural Language Processing, for example, the length of this list can be the maximum sentence length of your document, where shorter sentences are padded to that length. As in this case, in many domains the length of the sequence is driven by the problem
However, you can have no such evidences in your problem or still having a long sequence. Long sequences are very heavy from a computational point of view. The BPTT algorithm, used to optimize this models, "unfolds" the recurrent network in a very deep feedforward network with shared parameters and back propagates over it. In this cases, it is still convenient to "cut" the sequence to a fixed length.
And here we arrive at your question, given this fixed length, let us say 10, how do I shape my input?
Usually, what is done is to cut the dataset in non overlapping windows (in your example, we will have 1-9, 10-19, 20-29, etc. What happens here is that the network only looks a the last 10 elements of the sequence each time it updates the weights with BPTT.
However, since the sequence has been arbitrarily cut, it is likely that predictions need to exploit evidences that are far back in the sequence, outside the current window. To do this, we initialize the initial state of the RNN at window i with the final state of the window i-1 using the parameter:
initial_state: (optional) An initial state for the RNN.
Finally, I give you two sources to go into more details:
RNN Tutorial This is the official tutorial of tensorflow. It is applied to the task of Language Modeling. At a certain point of the code, you will see that the final state is fed to the network from one run to the following one, in order to implement what said above.
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
DevSummit 2017 This is a video of a talk during the Tensorflow DevSummit 2017 where, in the first section (Reading and Batching Sequence Data), it is explained how and using which functions you should shape your sequence inputs.
Hope this helps :)

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources