Pytorch's GRUCell and inputs of higher dimension - machine-learning

The documentation for Pytorch's GRUCell claims that in torch.nn.GRUCell(input_size, hidden_size, bias=True), input_size is the number of expected features in the input. So it should be an int and you can think about the input as (batch_size, input_size). I am building a model that its input is in the form (batch_size, seq_len, input_size) i.e. a batch of time series of length seq_len where at each given time we have an input_size dimensional vector. Is it possible to implement something like this using GRUCell? (natively changing the input_size to a tuple gives types error.)

Related

sequence encoding by MultiHeadAttention

I am trying to encode a sequence of image embeddings to one bigger embedding using MultiHeadAttention. (order doesn't matter)
I want an operation similar to passing a sequence of shape (batch_size, sequence_length, embedding_dim) to an LSTM layer and taking the last hidden_state as an embedding holding all importatnt information about the sequence.. or we may say a sequence embedding.
I want to implement that using attention to get rid of the recurrent behavior..
# embeddings: shape(batch_size, sequence_length, embedding_dim)
multihead_attn = nn.MultiheadAttention(embedding_dim, num_heads)
attn_output, attn_output_weights = multihead_attn(embeddings, embeddings, embeddings)
but in this case the attention output will have a shape of (batch_size, sequence_length, embedding_dim) as well..
should I be doing
attn_output = attn_output.mean(1)
or what if I pass the query as the embeddings.mean(1).. will that give the intended behavior? will the output simulate a sequence embedding?
# embeddings: shape(batch_size, sequence_length, embedding_dim)
multihead_attn = nn.MultiheadAttention(embedding_dim, num_heads)
seq_emb, attn_output_weights = multihead_attn(embeddings.mean(1), embeddings, embeddings)

Can a dense layer on many inputs be represented as a single matrix multiplication?

Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

Weird accuracy in multilabel classification keras

I have a multilabel classification problem, I used the following code but the validation accuracy jumps to 99% in the first epoch which is weird given the complexity of the data as the input features are 2048 extracted from inception model (pool3:0) layer and the labels are [1000],(here is the link of a file contains samples of features and label : https://drive.google.com/file/d/0BxI_8PO3YBPPYkp6dHlGeExpS1k/view?usp=sharing ),
is there something I am doing wrong here ??
Note: labels are sparse vector contain only 1 ~ 10 entry as 1 the rest is zeros
model.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=['accuracy'])
The output of prediction is zeros !
What wrong I do in training the model to bother the prediction ?
#input is the features file and labels file
def generate_arrays_from_file(path ,batch_size=100):
x=np.empty([batch_size,2048])
y=np.empty([batch_size,1000])
while True:
f = open(path)
i = 1
for line in f:
# create Numpy arrays of input data
# and labels, from each line in the file
words=line.split(',')
words=map(float, words[1:])
x_= np.array(words[0:2048])
y_=words[2048:]
y_= np.array(map(int,y_))
x_=x_.reshape((1, -1))
#print np.squeeze(x_)
y_=y_.reshape((1,-1))
x[i]= x_
y[i]=y_
i += 1
if i == batch_size:
i=1
yield (x, y)
f.close()
model = Sequential()
model.add(Dense(units=2048, activation='sigmoid', input_dim=2048))
model.add(Dense(units=1000, activation="sigmoid",
kernel_initializer="uniform"))
model.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=
['accuracy'])
model.fit_generator(generate_arrays_from_file('train.txt'),
validation_data= generate_arrays_from_file('test.txt'),
validation_steps=1000,epochs=100,steps_per_epoch=1000,
verbose=1)
I think the problem with the accuracy is that your output are sparse.
Keras computes accuracy using this formula:
K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
So, in your case, having only 1~10 non zero labels, a prediction of all 0 will yield an accuracy of 99.9% ~ 99%.
As far as the problem not learning, I think the problem is that you are using a sigmoid as last activation and using 0 or 1 as output value. This is bad practice since, in order for the sigmoid to return 0 or 1 the values it gets as input must be very large or very small, which reflects on the net having very large (in absolute value) weights. Furthermore, since in each training output there are far less 1 than 0 the network will soon get to a stationary point in which it simply outputs all zeros (the loss in this case is not very large either, should be around 0.016~0.16).
What you can do is scale your output labels so that they are between (0.2, 0.8) for example so that the weights of the net won't become too big or too small. Alternatively you can use a relu as activation function.
Did you try to use the cosine similarity as loss function?
I had the same multi-label + high dimensionality problem.
The cosine distance takes account of the orientation of the model output (prediction) and the desired output (true class) vector.
It is the normalized dot-product between two vectors.
In keras the cosine_proximity function is -1*cosine_distance. Meaning that -1 corresponds to two vectors with the same size and orientation.

Label Propagation in sklearn is classifying every vector as 1

I have 2000 labelled data (7 different labels) and about 100K unlabeled data and I am trying to use sklearn.semi_supervised.LabelPropagation. The data has 1024 dimensions. My problem is that the classifier is labeling everything as 1. My code looks like this:
X_unlabeled = X_unlabeled[:10000, :]
X_both = np.vstack((X_train, X_unlabeled))
y_both = np.append(y_train, -np.ones((X_unlabeled.shape[0],)))
clf = LabelPropagation(max_iter=100).fit(X_both, y_both)
y_pred = clf.predict(X_test)
y_pred is all ones. Also, X_train is 2000x1024 and X_unlabeled is a subset of the unlabeled data which is 10000x1024.
I also get this error upon calling fit on the classifier:
/usr/local/lib/python2.7/site-packages/sklearn/semi_supervised/label_propagation.py:255: RuntimeWarning: invalid value encountered in divide
self.label_distributions_ /= normalizer
Have you tried different values for the gamma parameter ? As the graph is constructed by computing an rbf kernel, the computation includes an exponential and the python exponential functions return 0 if the value is a too big negative number (see http://computer-programming-forum.com/56-python/ef71e144330ffbc2.htm). And if the graph is filled with 0, the label_distributions_ is filled with "nan" (because of normalization) and a warning appears. (be careful, the gamma value in scikit implementation is multiplied to the euclidean distance, it's not the same thing as in the Zhu paper.)
The LabelPropagation will finally be fixed in version 0.19

Resources