How does the Transformer Model Compute Self Attention? - machine-learning

In the transformer model, https://arxiv.org/pdf/1706.03762.pdf there is self-attention which is computed using softmax on Query (Q) and Key (K) vectors:
I am trying to understand the matrix multiplications:
Q = batch_size x seq_length x embed_size
K = batch_size x seq_length x embed_size
QK^T = batch_size x seq_length x seq_length
Softmax QK^T = Softmax (batch_size x seq_length x seq_length)
How is the softmax computed since there are seq_length x seq_length values per batch element?
A reference to Pytorch computation will be very helpful.
Cheers!

How is the softmax computed since there are seq_length x seq_length values per batch element?
The softmax is performed on w.r.t the last axis (torch.nn.Softmax(dim=-1)(tensor) where tensor is of shape batch_size x seq_length x seq_length) to get the probability of attending to every element for each element in the input sequence.
Let's assume, we have a text sequence "Thinking Machines", so we have a matrix of shape "2 x 2" (where seq_length = 2) after performing QK^T.
I am using the following illustration (reference) to explain self-attention computation. As you know, first scaled-dot-product is performed QK^T/square_root(d_k) and then softmax is computed for each sequence element.
Here, Softmax is performed for the first sequence element "Thinking". The raw score of 14 and 12 is turned into a probability of 0.88 and 0.12 by doing softmax. These probability indicates that the token "Thinking" would attend itself with 88% probability, and the token "Machines" with 12% probability. Similarly, the attention probability is computed for the token "Machines" too.
Note. I strongly suggest reading this excellent article on Transformer. For implementation, you can take a look at OpenNMT.

The QKᵀ multiplication is a batched matrix multiplication -- it's doing a separate seq_length x embed_size by embed_size x seq_length multiplication batch_size times. Each one gives a result of size seq_length x seq_length, which is how we end up with QKᵀ having the shape batch_size x seq_length x seq_length.
Gabriela Melo's suggested resource uses the following PyTorch code for this operation:
torch.matmul(query, key.transpose(-2, -1))
This works because torch.matmul does a batched matrix multiplication when an input has at least 3 dimensions (see https://pytorch.org/docs/stable/torch.html#torch.matmul).

Related

In multi-class logistic regression, does SGD one training example update all the weights?

In multi-class logistic regression, lets say we use softmax and cross entropy.
Does SGD one training example update all the weights or only a portion of the weights which are associated to the label ?
For example, the label is one-hot [0,0,1]
Does the whole matrix W_{feature_dim \times num_class} updated or only W^{3}_{feature_dim \times 1} updated ?
Thanks
All of your weights are updated.
You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.
Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.
I hope this is clear

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

Why input is scaled in tf.nn.dropout in tensorflow?

I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a = np.dot(Wp.T, x)
h[str(p)].append(a)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.

Probability for a vector x

In machine learning suppose we have a GDA (Gaussian Discriminant Analysis) model for classification.
If y can take values 0 or 1 and x represents the vector with n features(n x 1 dimensional)
What does p(x| y=0) or p(x|y=1) signify for a particular training example?
x is actually a vector..how is conditional probability defined for this?
Any help would be much appreciated.
Say that X0 is the set of vectors x that mapped to output 0, and X1 is the set of vectors x that mapped to output 1. Take the mean of each set's vectors, and, similarly, approximate the covariance.
Now build two multivariate normal distributions, with these means and covariances, respectively.
Once you have these distribution, simply plug in the vector you want into the PDF to obtain its density. Note that since the probabilities are continuous, the probability about which you asked is 0, in general.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

Resources