I do understand conceptually what an LSTM or GRU should (thanks to this question What's the difference between "hidden" and "output" in PyTorch LSTM?) BUT when I inspect the output of the GRU h_n and output are NOT the same while they should be...
(Pdb) rnn_output
tensor([[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
(Pdb) hidden
tensor([[[ 0.1700, 0.2388, -0.4159, ..., -0.1949, 0.0692, -0.0630],
[ 0.1304, 0.0426, -0.2874, ..., 0.0882, 0.1394, -0.1899],
[-0.0071, 0.1512, -0.1558, ..., -0.1578, 0.1990, -0.2468],
...,
[ 0.0856, 0.0962, -0.0985, ..., 0.0081, 0.0906, -0.1234],
[ 0.1773, 0.2808, -0.0300, ..., -0.0415, -0.0650, -0.0010],
[ 0.2207, 0.3573, -0.2493, ..., -0.2371, 0.1349, -0.2982]],
[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
they are some transpose of each other...why?
They are not really the same. Consider that we have the following Unidirectional GRU model:
import torch.nn as nn
import torch
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True)
Please make sure you observe the input shape carefully.
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
Definitely,
torch.equal(out, hn)
False
One of the most efficient ways that helped me to understand the output vs. hidden states was to view the hn as hn.view(num_layers, num_directions, batch, hidden_size) where num_directions = 2 for bidirectional recurrent networks (and 1 other wise, i.e., our case). Thus,
hn_conceptual_view = hn.view(3, 1, 1024, 50)
As the doc states (Note the italics and bolds):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len (i.e., for the last timestep)
In our case, this contains the hidden vector for the timestep t = 112, where the:
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
So, consequently, one can do:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
True
Explanation: I compare the last sequence from all batches in out[:, -1] to the last layer hidden vectors from hn[-1, 0, :, :]
For Bidirectional GRU (requires reading the unidirectional first):
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True bidirectional = True)
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
View is changed to (since we have two directions):
hn_conceptual_view = hn.view(3, 2, 1024, 50)
If you try the exact code:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
False
Explanation: This is because we are even comparing wrong shapes;
out[:, 0].shape
torch.Size([1024, 100])
hn_conceptual_view[-1, 0, :, :].shape
torch.Size([1024, 50])
Remember that for bidirectional networks, hidden states get concatenated at each time step where the first hidden_state size (i.e., out[:, 0, :50]) are the the hidden states for the forward network, and the other hidden_state size are for the backward (i.e., out[:, 0, 50:]). The correct comparison for the forward network is then:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 0, :, :])
True
If you want the hidden states for the backward network, and since a backward network processes the sequence from time step n ... 1. You compare the first timestep of the sequence but the last hidden_state size and changing the hn_conceptual_view direction to 1:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 1, :, :])
True
In a nutshell, generally speaking:
Unidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 1, B, H)
Where RECURRENT_MODULE is either GRU or LSTM (at the time of writing this post), B is the batch size, S sequence length, and E embedding size.
torch.equal(output[:, S, :], hn_conceptual_view[-1, 0, :, :])
True
Again we used S since the rnn_module is forward (i.e., unidirectional) and the last timestep is stored at the sequence length S.
Bidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True, bidirectional = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 2, B, H)
Comparison
torch.equal(output[:, S, :H], hn_conceptual_view[-1, 0, :, :])
True
Above is the forward network comparison, we used :H because the forward stores its hidden vector in the first H elements for each timestep.
For the backward network:
torch.equal(output[:, 0, H:], hn_conceptual_view[-1, 1, :, :])
True
We changed the direction in hn_conceptual_view to 1 to get hidden vectors for the backward network.
For all examples we used hn_conceptual_view[-1, ...] because we are only interested in the last layer.
There are three things you have to remember to make sense of this in PyTorch.
This answer is written on the assumption that you are using something like torch.nn.GRU or the like, and that if you are making a multi-layer RNN with it, that you are using the num_layers argument to do so (rather than building one from scratch out of individual layers yourself.)
The output will give you the hidden layer outputs of the network for each time-step, but only for the final layer. This is useful in many applications, particularly encoder-decoders using attention. (These architectures build up a 'context' layer from all the hidden outputs, and it is extremely useful to have them sitting around as a self-contained unit.)
The h_n will give you the hidden layer outputs for the last time-step only, but for all the layers. Therefore, if and only if you have a single layer architecture, h_n is a strict subset of output. Otherwise, output and h_n intersect, but are not strict subsets of one another. (You will often want these, in an encoder-decoder model, from the encoder in order to jumpstart the decoder.)
If you are using a bidirectional output and you want to actually verify that part of h_n is contained in output (and vice-versa) you need to understand what PyTorch does behind the scenes in the organization of the inputs and outputs. Specifically, it concatenates a time-reversed input with the time-forward input and runs them together. This is literal. This means that the 'forward' output at time T is in the final position of the output tensor sitting right next to the 'reverse' output at time 0; if you're looking for the 'reverse' output at time T, it is in the first position.
The third point in particular drove me absolute bonkers for about three hours the first time I was playing RNNs and GRUs. In fairness, it is also why h_n is provided as an output, so once you figure it out, you don't have to worry about it any more, you just get the right stuff from the return value.
Is Not the transpose ,
you can get rnn_output = hidden[-1] when the layer of lstm is 1
hidden is a output of every cell every layer, it's shound be a 2D array for a specifc input time step , but lstm return all the time step , so the output of a layer should be hidden[-1]
and this situation discussed when batch is 1 , or the dimention of output and hidden need to add one
I am having a data set of 144 student feedback with 72 positive and 72 negative feedback respectively. The data set has two attributes namely data and target which contain the sentence and the sentiment(positive or negative) respectively.
Consider the following code:
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
8 good communication skills. positive
9 good teaching methods. positive
10 posseses very good and thorough knowledge of t... positive
11 posseses superb ability to provide a lots of i... positive
12 good conceptual skills and knowledge for subject. positive
13 no commuication outside class. negative
14 rude behaviour. negative
15 very negetive attitude towards students. negative
16 good communication skills, lacks time punctual... positive
17 explains in a better way by giving practical e... positive
18 hardly comes on time. negative
19 good communication skills. positive
20 to make students comfortable with the subject,... negative
21 associated to original world. positive
22 lacks time punctuality. negative
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
print(target)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#The below line gives the error
clf.fit(X , target)
I do not know what is wrong. Please help
The error comes from the way X as been done. You cannot use directly X in the Fit method. You need first to transform it a little bit more (i could not have told you that for the other problem as i did not have the info)
right now you have the following:
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
...
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
Which is enough to do a split.
We are just going to transform it you can understand and so will the fit method:
X = list([list(x.toarray()[0]) for x in X])
What we do is convert the sparse matrix to a numpy array, take the first element (it has only one element) and then convert it to a list to make sure it has the right dimension.
Now why are we doing this:
X is something like that
>>>X[0]
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
so we transform it to see what it realy is:
>>>X[0].toarray()
array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0]], dtype=int64)
and then as you see there is a slight issue with the dimension so we take the first element.
going back to a list does nothing, it's just for you to understand well what you see. (you can dump it for speed)
your code is now this:
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)
I am implementing a custom connection between two different keras layers. The neural network begins something like below:
model = tf.keras.Sequential()
c1 = model.add(Conv2D(6, kernel_size=[5,5], strides=(stride,stride), padding="valid", input_shape=(32,32,1),
activation = 'tanh'))
s2 = model.add(AveragePooling2D(pool_size=2, strides=2, padding='valid'))
Now, the output of s2 has a size of 14*14*6
Here, I want to apply my custom connection to convolution layer c3 which has an output size of 10*10*16 (that is, 16 filters need to be applied on s2 of size 14*14*6 and get an output of 10*10*16). For this, I need to use kernal_size = 5*5, filers=16, stride = 1, and padding=valid.
However, all the 6 feature maps (of s2) are not connected to 16 feature maps of (c3). The connections are explained as given here.
For example (the explanation of given link above), to build your first feature map of C3, you convolve 3 of your input maps (of s2 of size 14*14*6) with 5x5 filters, which gives you 3 10x10 maps that are summed up to give your first feature map, which is then of size 10x10.
I read somewhere that, we need to use Functional API to build this.
But, I am not sure, how to proceed further. Can someone help on implementing this.
My initial approach of implementing this is as follows:
from keras.models import Model
from keras.layers import Conv2D, Input, Concatenate, Lambda, Add
inputTensor = Input(shape=(14, 14,6))
stride =1
group0_a = Lambda(lambda x: x[:,:,0])(inputTensor)
group0_b = Lambda(lambda x: x[:,:,1])(inputTensor)
group0_c = Lambda(lambda x: x[:,:,2])(inputTensor) # Take 0,1,2 feature map of s2
conv_group0_a = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_a)
conv_group0_b = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_b)
conv_group0_c = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_c) #Applying convolution on each of 0, 1, 2 feature maps of s2 with distinct kernals
added_0 = Add()([conv_group0_a, conv_group0_b, conv_group0_c]) #adding all the three to get one of the 10*10*16
#Repeat this for 16 neurons of c3 and then finally
output_layer = Concatenate()([]) #concatenate them
Mymodel = Model(inputTensor,output_layer)
I want to know, if my approach is correct (I know it is not because I am getting too many errors). So, I need help in recreating the custom connection as explained above. Any help is appreciated.
the above code is correct, the only change I made is group0_a = Lambda(lambda x: x[:,:,0:1])(inputTensor), that is instead of passing x as x[:,:,0] I passed it as x[:,:,0:1]
I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture https://i1.wp.com/karpathy.github.io/assets/rnn/charseq.jpeg
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow
According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.