Time Distributed LSTM - machine-learning

I want to use a time-distributed LSTM to binary classify a series of patient notes. Please note I am aware that I could concatenate these notes into one, but I don't want to have sequence lengths this long.
Each patient has several notes (i.e. timesteps: 30) with an arbitrary max sequence length (i.e. 200). I want to embed these notes with an embedding layer in some dimensions (i.e. 300) and then use the resulting data to train a time-distributed LSTM classifier. I think this is a many-to-one sequence problem.
So, my data shape is:
patient number, timesteps, max sequence length, embedding dimensions
i.e. (167, 30, 200, 300)
embedding_layer = Embedding(
sequence_input = Input(
embedding_sequences = embedding_layer(sequence_input)
x = TimeDistributed(LSTM(30))(x)
outputs = Dense(1, activation="sigmoid")(x)
model = Model(sequence_input, outputs)
metrics=["accuracy", tf.keras.metrics.Precision()],
From the error message I think I am doing something wrong conceptually, as it’s expecting a different output shape:
ValueError: `logits` and `labels` must have the same shape, received ((None, 30, 1) vs (None,)).


Pytorch VNet final softmax activation layer for segmentation. Different channel dimensions to labels. How do I get prediction output?

I am trying to build a V-Net. When I pass the images to segment during training, the output has 2 channels after the softmax activation (as specified in the architecture in the attached image) but the label and input has 1. How do I convert this such that output is the segmented image? Do I just take one of the channels as the final output when training (e.g output = output[:, 0, :, :, :]) and the other channel would be background?
outputs = network(inputs)
batch_size = 32
outputs.shape: [32, 2, 64, 128, 128]
inputs.shape: [32, 1, 64, 128, 128]
labels.shape: [32, 1, 64, 128, 128]
Here is my Vnet forward pass:
def forward(self, x):
# Initial input transition
out = self.in_tr(x)
# Downward transitions
out, residual_0 = self.down_depth0(out)
out, residual_1 = self.down_depth1(out)
out, residual_2 = self.down_depth2(out)
out, residual_3 = self.down_depth3(out)
# Bottom layer
out = self.up_depth4(out)
# Upward transitions
out = self.up_depth3(out, residual_3)
out = self.up_depth2(out, residual_2)
out = self.up_depth1(out, residual_1)
out = self.up_depth0(out, residual_0)
# Pass to convert to 2 channels
out = self.final_conv(out)
# return softmax
out = F.softmax(out)
return out [batch_size, 2, 64, 128, 128]
V Net architecture as described in (https://arxiv.org/pdf/1606.04797.pdf)
That paper has two outputs as they predict two classes:
The network predictions, which consist of two volumes having the same resolution as the original input data, are processed through a soft-max layer which
outputs the probability of each voxel to belong to foreground and to background.
Therefore this is not an autoencoder, where your inputs are passed back through the model as outputs. They use a set of labels which distinguish between their pixels of interest (foreground) and other (background). You will need to change your data if you wish to use the V-net in this manner.
It won't be as simple as designating a channel as output because this will be a classification task rather than a regression task. You will need annotated labels to work with this model architecture.

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
average_weights = sum_scaled_weights(scaled_local_weight_list)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

Is hidden and output the same for a GRU unit in Pytorch?

I do understand conceptually what an LSTM or GRU should (thanks to this question What's the difference between "hidden" and "output" in PyTorch LSTM?) BUT when I inspect the output of the GRU h_n and output are NOT the same while they should be...
(Pdb) rnn_output
tensor([[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
(Pdb) hidden
tensor([[[ 0.1700, 0.2388, -0.4159, ..., -0.1949, 0.0692, -0.0630],
[ 0.1304, 0.0426, -0.2874, ..., 0.0882, 0.1394, -0.1899],
[-0.0071, 0.1512, -0.1558, ..., -0.1578, 0.1990, -0.2468],
[ 0.0856, 0.0962, -0.0985, ..., 0.0081, 0.0906, -0.1234],
[ 0.1773, 0.2808, -0.0300, ..., -0.0415, -0.0650, -0.0010],
[ 0.2207, 0.3573, -0.2493, ..., -0.2371, 0.1349, -0.2982]],
[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
they are some transpose of each other...why?
They are not really the same. Consider that we have the following Unidirectional GRU model:
import torch.nn as nn
import torch
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True)
Please make sure you observe the input shape carefully.
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
torch.equal(out, hn)
One of the most efficient ways that helped me to understand the output vs. hidden states was to view the hn as hn.view(num_layers, num_directions, batch, hidden_size) where num_directions = 2 for bidirectional recurrent networks (and 1 other wise, i.e., our case). Thus,
hn_conceptual_view = hn.view(3, 1, 1024, 50)
As the doc states (Note the italics and bolds):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len (i.e., for the last timestep)
In our case, this contains the hidden vector for the timestep t = 112, where the:
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
So, consequently, one can do:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
Explanation: I compare the last sequence from all batches in out[:, -1] to the last layer hidden vectors from hn[-1, 0, :, :]
For Bidirectional GRU (requires reading the unidirectional first):
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True bidirectional = True)
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
View is changed to (since we have two directions):
hn_conceptual_view = hn.view(3, 2, 1024, 50)
If you try the exact code:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
Explanation: This is because we are even comparing wrong shapes;
out[:, 0].shape
torch.Size([1024, 100])
hn_conceptual_view[-1, 0, :, :].shape
torch.Size([1024, 50])
Remember that for bidirectional networks, hidden states get concatenated at each time step where the first hidden_state size (i.e., out[:, 0, :50]) are the the hidden states for the forward network, and the other hidden_state size are for the backward (i.e., out[:, 0, 50:]). The correct comparison for the forward network is then:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 0, :, :])
If you want the hidden states for the backward network, and since a backward network processes the sequence from time step n ... 1. You compare the first timestep of the sequence but the last hidden_state size and changing the hn_conceptual_view direction to 1:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 1, :, :])
In a nutshell, generally speaking:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 1, B, H)
Where RECURRENT_MODULE is either GRU or LSTM (at the time of writing this post), B is the batch size, S sequence length, and E embedding size.
torch.equal(output[:, S, :], hn_conceptual_view[-1, 0, :, :])
Again we used S since the rnn_module is forward (i.e., unidirectional) and the last timestep is stored at the sequence length S.
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True, bidirectional = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 2, B, H)
torch.equal(output[:, S, :H], hn_conceptual_view[-1, 0, :, :])
Above is the forward network comparison, we used :H because the forward stores its hidden vector in the first H elements for each timestep.
For the backward network:
torch.equal(output[:, 0, H:], hn_conceptual_view[-1, 1, :, :])
We changed the direction in hn_conceptual_view to 1 to get hidden vectors for the backward network.
For all examples we used hn_conceptual_view[-1, ...] because we are only interested in the last layer.
There are three things you have to remember to make sense of this in PyTorch.
This answer is written on the assumption that you are using something like torch.nn.GRU or the like, and that if you are making a multi-layer RNN with it, that you are using the num_layers argument to do so (rather than building one from scratch out of individual layers yourself.)
The output will give you the hidden layer outputs of the network for each time-step, but only for the final layer. This is useful in many applications, particularly encoder-decoders using attention. (These architectures build up a 'context' layer from all the hidden outputs, and it is extremely useful to have them sitting around as a self-contained unit.)
The h_n will give you the hidden layer outputs for the last time-step only, but for all the layers. Therefore, if and only if you have a single layer architecture, h_n is a strict subset of output. Otherwise, output and h_n intersect, but are not strict subsets of one another. (You will often want these, in an encoder-decoder model, from the encoder in order to jumpstart the decoder.)
If you are using a bidirectional output and you want to actually verify that part of h_n is contained in output (and vice-versa) you need to understand what PyTorch does behind the scenes in the organization of the inputs and outputs. Specifically, it concatenates a time-reversed input with the time-forward input and runs them together. This is literal. This means that the 'forward' output at time T is in the final position of the output tensor sitting right next to the 'reverse' output at time 0; if you're looking for the 'reverse' output at time T, it is in the first position.
The third point in particular drove me absolute bonkers for about three hours the first time I was playing RNNs and GRUs. In fairness, it is also why h_n is provided as an output, so once you figure it out, you don't have to worry about it any more, you just get the right stuff from the return value.
Is Not the transpose ,
you can get rnn_output = hidden[-1] when the layer of lstm is 1
hidden is a output of every cell every layer, it's shound be a 2D array for a specifc input time step , but lstm return all the time step , so the output of a layer should be hidden[-1]
and this situation discussed when batch is 1 , or the dimention of output and hidden need to add one

Transfer learning with CNTK and pre-trained ONNX model fails

I'm trying to use the ResNet-50 model from the ONNX model zoo and load and train it in CNTK for an image classification task. The first thing that confuses me is, that the batch axis (not sure what's the official name for it, dynamic axis?) is set to 1 in this model:
Why is that? Couldn't it simply be [3x224x224]? In this model for example, the input looks like this:
To load the model and use my own Dense layer, I use the following code:
def create_model(num_classes, input_features, freeze=False):
base_model = load_model("restnet-50.onnx", format=ModelFormat.ONNX)
feature_node = find_by_name(base_model, "gpu_0/data_0")
last_node = find_by_uid(base_model, "Reshape2959")
substitutions = {
feature_node : placeholder(name='new_input')
cloned_layers = last_node.clone(CloneMethod.clone, substitutions)
cloned_out = cloned_layers(input_features)
z = Dense(num_classes, activation=softmax, name="prediction") (cloned_out)
return z
For training I use (shortened):
# datasets = list of classes
feature = input_variable(shape=(1, 3, 224, 224))
label = input_variable(shape=(1,3))
model = create_model(len(datasets), feature)
loss = cross_entropy_with_softmax(model, label)
# some definitions for learner, epochs, ProgressPrinters missing
for epoch in range(epochs):
loss.train((X_current,y_current), parameter_learners=[learner], callbacks=[progress_printer])
X_current is a single image and y_current the corresponding class label both encoded as numpy arrays with the followings shapes
(1, 3, 224, 224)
(1, 3)
When I try to train the model, I get
"ValueError: ToBatchAxis7504 ToBatchAxisNode operation can only operate on tensor without minibatch data (no layout)"
What's wrong here?

How to fix Activation layer dimensions for LSTM in keras with masked layer

After looking at the following gist, and doing some basic tests, I am trying to create a NER system using a LSTM in keras. I am using a generator and calling fit_generator.
Here is my basic keras model:
model = Sequential([
Embedding(input_dim=max_features, output_dim=embedding_size, input_length=maxlen, mask_zero=True),
Bidirectional(LSTM(hidden_size, return_sequences=True)),
model.compile(loss='binary_crossentropy', optimizer='adam')
My input dimension seem right:
>>> generator = generate()
>>> i,t = next(generator)
>>> print( "Inputs: {}".format(model.input_shape))
>>> print( "Outputs: {}".format(model.output_shape))
>>> print( "Actual input: {}".format(i.shape))
Inputs: (None, 3949)
Outputs: (None, 3949, 1)
Actual input: (45, 3949)
However when I call:
model.fit_generator(generator, steps_per_epoch=STEPS_PER_EPOCH, epochs=EPOCHS)
I seem to get the following error:
Error when checking target:
expected activation_1 to have 3 dimensions,
but got array with shape (45, 3949)
I have seen a few other examples of similar issues, which leads me to believe I need to Flatten() my inputs before the Activation() but if I do so I get the following error.
Layer flatten_1 does not support masking,
but was passed an input_mask:
Tensor("embedding_37/NotEqual:0", shape=(?, 3949), dtype=bool)
As per previous questions, my generator is functionally equivalent to:
def generate():
while True:
inputs = np.random.randint(55604, size=maxlen)
targets = np.random.randint(2, size=maxlen)
yield inputs, targets
I am not assuming that I need to Flatten and I am open to additional suggestions.
You either need to return only the last element of the sequence (return_sequences=False):
model = Sequential([
Embedding(input_dim=max_features, output_dim=embedding_size, input_length=maxlen, mask_zero=True),
Or remove the masking (mask_zero=False) to be able to use Flatten:
model = Sequential([
Embedding(input_dim=max_features, output_dim=embedding_size, input_length=maxlen),
Bidirectional(LSTM(hidden_size, return_sequences=True)),
*Be careful that the output will be out_size x maxlen.
And I think you want the first option.
Edit 1: Looking at the example diagram, it makes a prediction on every timestep, so it need the softmax activation also TimeDistributed. The target dimension should be (None, maxlen, out_size):
model = Sequential([
Embedding(input_dim=max_features, output_dim=embedding_size, input_length=maxlen, mask_zero=True),
Bidirectional(LSTM(hidden_size, return_sequences=True)),
