Sampled softmax loss over variable sequence batches? - machine-learning

Background info: I'm working on sequence-to-sequence models, and right now my model accepts variable-length input tensors (not lists) with input shapes corresponding to [batch size, sequence length]. However, in my implementation, sequence length is unspecified (set to None) to allow for variable length inputs. Specifically, input sequence batches are padded only to the length of the longest sequence in that batch. This has sped up my training time considerably, so I'd prefer to keep it this way, as opposed to going back to bucketed models and/or padded all sequences in the training data to the same length. I'm using TensorFlow 1.0.0.
Problem: I'm currently using the following to compute the loss (which runs just fine).
loss = tf.losses.sparse_softmax_cross_entropy(
weights=target_labels, # shape: [batch size, None]
logits=outputs[:, :-1, :], # shape: [batch size, None, vocab size]
weights=target_weights[:, :-1]) # shape: [batch size, None]
where vocab size is typically about 40,000. I'd like to use a sampled softmax, but I've ran into an issue that's due to the unspecified nature of the input shape. According to the documentation for tf.nn.sampled_softmax_loss, it requires the inputs to be fed separately for each timestep. However, I can't call, for example,
tf.unstack(target_labels, axis=1)
since the axis is unknown beforehand.Does anyone know how I might go about implementing this? One would assume that since both dynamic_rnn and tf.losses.sparse_softmax_cross_entropy seem to have no issue doing this, that a workaround could be implemented with the sampled softmax loss somehow. After digging around in the source code and even models repository, I've come up empty handed. Any help/suggestions would be greatly appreciated.

Related

What is the purpose of having the same input and output in PyTorch nn.Linear function?

I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.

Can a dense layer on many inputs be represented as a single matrix multiplication?

Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.

Autoencoders: Find the important neurons

I have implemented Autoencoder using Keras that takes 112*112*3 neurons as input and 100 neurons as the compressed/encoded state. I want to find the neurons out of these 100 that learns the important features. So far i have calculated eigen values(e) and eigen vectors(v) using the following steps. And i found out that around first 30 values of (e) is greater than 0. Does that mean the first 30 modes are the important ones? Is there any other method that could find the important neurons?
Thanks in Advance
x_enc = enc_model.predict(x_train, batch_size=BATCH_SIZE) # shape (3156,100)
x_mean = np.mean(x_enc, axis=0) # shape (100,)
x_stds = np.std(x_enc, axis=0) # shape (100,)
x_cov = np.cov((x_enc - x_mean).T) # shape (100,100)
e, v = np.linalg.eig(x_cov) # shape (100,) and (100,100) respectively
I don't know if the approach you are using will actually give you any useful results since the way the network learns and what it exactly learns aren't known, I suggest you use a different kind of autoencoder, that automatically learns disentangled representations of the data in a latent space, this way you can be sure that all the parameters you find are actually contributing to the representation of your data. check this article

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

Torch, how to get a tensor of loss values during batch optimization

I am training a network with batch optimization over my training set, and I would like to get a loss vector containing the loss of each of my training examples.
More specifically I am using images (of size 3x64x64) in a batch of size 64. Therefore my input is a tensor of size 64x3x64x64.
During training when I write
output = net:forward(input)
loss = criterion:forward(input, target)
loss is a number, but I would like to get a tensor (of size 64) with one entry per image in my batch, corresponding to the loss value of this precise image.
Is there a way to do that without looping on the first dimension of my input tensor?
The forward method calls another method, the updateOutput method which can be overwritten.
For eg., in case of MSECriterion(), you can change the method by commenting the call to the THNN library and write on your own how you want the criterion to function, i.e., do a normal element wise subtraction and then square(again element wise) and divide by the total number of data points(again element wise); then return the output as a tensor.
You will also need to recompile the nn package once you have changed this using luarocks make rocks/[the scm file in the folder] after navigating to the nn folder.

Resources