Related
I tried to build a RNN by myself following this tutorial https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial. I built my own version with this following network architecture, which is different from the tutorial.a stands for input layer, h hidden, o output. Here's my code:
class RNN(nn.Module):
def __init__(self,input_size,hidden_size,output_size,initial_hidden):
super(RNN, self).__init__()
self.linear1 = nn.Linear(input_size,hidden_size)
self.linear2 = nn.Linear(hidden_size,hidden_size,bias=False)
self.linear3 = nn.Linear(hidden_size,output_size)
self.prev_hidden = initial_hidden
def forward(self,X):
input = torch.add(self.linear1(X).view(1,-1),self.linear2(self.prev_hidden.to(device))
hidden = nn.ReLU()(input)
self.prev_hidden = hidden.detach()
output = self.linear3(hidden)
return output
This model stops at loss = 12000 over all samples and doesn't really drop anymore. However, after switching to the model described in the tutorial, which the hidden and input layers share the same weight, the loss drops to 4000 with the same hyper parameter. Here's the code:
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def initHidden(self):
return torch.zeros(1, self.hidden_size)
Why does the model architecture in the tutorial outperforms my version so much?
This line
self.prev_hidden = hidden.detach()
Makes you never backprop through time through your RNN, that looks like pretty non standard idea for training a neural network, and definitely limits its ability to learn.
Other obvious differences is that your implementation does not output probability (it lacks some projection onto a simplex, e.g. a softmax) which is hard to verify how much of an issue it is as the full training code is missing.
I have a series of vectors representing a signal over time. I'd like to classify parts of the signal into two categories: 1 or 0. The reason for using LSTM is that I believe the network will need knowledge of the entire signal to classify.
My problem is developing the PyTorch model. Below is the class I've come up with.
class LSTMClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, label_size, batch_size):
self.lstm = nn.LSTM(input_dim, hidden_dim)
self.hidden2label = nn.Linear(hidden_dim, label_size)
self.hidden = self.init_hidden()
def init_hidden(self):
return (torch.zeros(1, self.batch_size, self.hidden_dim),
torch.zeros(1, self.batch_size, self.hidden_dim))
def forward(self, x):
lstm_out, self.hidden = self.lstm(x, self.hidden)
y = self.hidden2label(lstm_out[-1])
log_probs = F.log_softmax(y)
return log_probs
However this model is giving a bunch of shape errors, and I'm having trouble understanding everything going on. I looked at this SO question first.
You should follow PyTorch documentation, especially inputs and outputs part, always.
This is how the classifier should look like:
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, label_size):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
self.hidden2label = nn.Linear(hidden_dim, label_size)
def forward(self, x):
_, (h_n, _) = self.lstm(x)
return self.hidden2label(h_n.reshape(x.shape[0], -1))
clf = LSTMClassifier(100, 200, 1)
inputs = torch.randn(64, 10, 100)
clf(inputs)
Points to consider:
always use super().__init__() as it registers modules in your neural networks, allows for hooks etc.
Use batch_first=True so you can pass inputs of shape (batch, timesteps, n_features)
No need to init_hidden with zeros, it is the default value if left uninitialized
No need to pass self.hidden each time to LSTM. Moreover, you should not do that. It means that elements from each batch of data are somehow next steps, while batch elements should be disjoint and you probably do not need that.
_, (h_n, _) returns last hidden cell from last timestep, exactly of shape: (num_layers * num_directions, batch, hidden_size). In our case num_layers and num_directions is 1 so we get (1, batch, hidden_size) tensor as output
Reshape to (batch, hidden_size) so it can be passed through linear layer
Return logits without activation. Only one if it is a binary case. Use torch.nn.BCEWithLogitsLoss as loss for binary case and torch.nn.CrossEntropyLoss for multiclass case. Also sigmoid is proper activation for binary case, while softmax or log_softmax is appropriate for multiclass.
For binary only one output is needed. Any value below 0 (if returning unnormalized probabilities as in this case) is considered negative, anything above positive.
I have a basic neural network model in pytorch like this:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(Net, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
net = Net(400, 512,10)
How can I extract bias/intercept term from net.parameters()?
And is this model equivalent to using sequential()?
net = nn.Sequential(nn.Linear(input_dim, hidden_dim[0]),
nn.Sigmoid(),
nn.Linear(hidden_dim[0], hidden_dim[1]),
nn.Sigmoid(),
nn.Linear(hidden_dim[1], output_dim))
Is nn.Softmax() optional at the end of either model for multi-class classification? If I understand correctly, with software it outputs probability of a certain class but without it returns predicted output?
Thanks in advance for answering my newbie questions.
Let's answer questions one by one. is this model equivalent to using sequential()
Short answer: No. You can see that you have added two Sigmoid and two linear layers. You can print your net and see the result:
net = Net(400, 512,10)
print(net.parameters())
print(net)
input_dim = 400
hidden_dim = 512
output_dim = 10
model = Net(400, 512,10)
net = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Sigmoid(),
nn.Linear(hidden_dim, hidden_dim),
nn.Sigmoid(),
nn.Linear(hidden_dim, output_dim))
print(net)
The output is:
Net(
(fc1): Linear(in_features=400, out_features=512, bias=True)
(sigmoid): Sigmoid()
(fc2): Linear(in_features=512, out_features=10, bias=True)
)
Sequential(
(0): Linear(in_features=400, out_features=512, bias=True)
(1): Sigmoid()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Sigmoid()
(4): Linear(in_features=512, out_features=10, bias=True)
)
I hope you can see where they differ.
Your first question: How can I extract bias/intercept term from net.parameters()
The answer:
model = Net(400, 512,10)
bias = model.fc1.bias
print(bias)
the output is:
tensor([ 3.4078e-02, 3.1537e-02, 3.0819e-02, 2.6163e-03, 2.1002e-03,
4.6842e-05, -1.6454e-02, -2.9456e-02, 2.0646e-02, -3.7626e-02,
3.5531e-02, 4.7748e-02, -4.6566e-02, -1.3317e-02, -4.6593e-02,
-8.9996e-03, -2.6568e-02, -2.8191e-02, -1.9806e-02, 4.9720e-02,
---------------------------------------------------------------
-4.6214e-02, -3.2799e-02, -3.3605e-02, -4.9720e-02, -1.0293e-02,
3.2559e-03, -6.6590e-03, -1.2456e-02, -4.4547e-02, 4.2101e-02,
-2.4981e-02, -3.6840e-03], requires_grad=True)
you can you state_dict for extract bias of each layer or function in Model.
the two present of your network is same but if you want do some extend of netwrok I will suggest use Net one instead of Sequential One
WithOut softmax it will just output a tensor witch min -1 max 1 if use sigmoid , it can't be the predict.
AnyWay , you should answer your question seperate instead of one post for three question . good luck
I want to implement a ResNet network (or rather, residual blocks) but I really want it to be in the sequential network form.
What I mean by sequential network form is the following:
## mdl5, from cifar10 tutorial
mdl5 = nn.Sequential(OrderedDict([
('pool1', nn.MaxPool2d(2, 2)),
('relu1', nn.ReLU()),
('conv1', nn.Conv2d(3, 6, 5)),
('pool1', nn.MaxPool2d(2, 2)),
('relu2', nn.ReLU()),
('conv2', nn.Conv2d(6, 16, 5)),
('relu2', nn.ReLU()),
('Flatten', Flatten()),
('fc1', nn.Linear(1024, 120)), # figure out equation properly
('relu4', nn.ReLU()),
('fc2', nn.Linear(120, 84)),
('relu5', nn.ReLU()),
('fc3', nn.Linear(84, 10))
]))
but of course with the NN lego blocks being “ResNet”.
I know the equation is something like:
but I am not sure how to do it in Pytorch AND Sequential. Sequential is key for me!
Bounty:
I'd like to see an example with a fully connected net and where the BN layers would have to go (and the drop out layers would go too). Ideally on a toy example/data if possible.
Cross-posted:
https://discuss.pytorch.org/t/how-to-have-residual-network-using-only-sequential-blocks/51541
https://www.quora.com/unanswered/How-does-one-implement-my-own-ResNet-with-torch-nn-Sequential-in-Pytorch
https://www.reddit.com/r/pytorch/comments/uyyr28/how_to_implement_my_own_resnet_with/
You can't do it solely using torch.nn.Sequential as it requires operations to go, as the name suggests, sequentially, while yours are parallel.
You could, in principle, construct your own block really easily like this:
import torch
class ResNet(torch.nn.Module):
def __init__(self, module):
super().__init__()
self.module = module
def forward(self, inputs):
return self.module(inputs) + inputs
Which one can use something like this:
model = torch.nn.Sequential(
torch.nn.Conv2d(3, 32, kernel_size=7),
# 32 filters in and out, no max pooling so the shapes can be added
ResNet(
torch.nn.Sequential(
torch.nn.Conv2d(32, 32, kernel_size=3),
torch.nn.ReLU(),
torch.nn.BatchNorm2d(32),
torch.nn.Conv2d(32, 32, kernel_size=3),
torch.nn.ReLU(),
torch.nn.BatchNorm2d(32),
)
),
# Another ResNet block, you could make more of them
# Downsampling using maxpool and others could be done in between etc. etc.
ResNet(
torch.nn.Sequential(
torch.nn.Conv2d(32, 32, kernel_size=3),
torch.nn.ReLU(),
torch.nn.BatchNorm2d(32),
torch.nn.Conv2d(32, 32, kernel_size=3),
torch.nn.ReLU(),
torch.nn.BatchNorm2d(32),
)
),
# Pool all the 32 filters to 1, you may need to use `torch.squeeze after this layer`
torch.nn.AdaptiveAvgPool2d(1),
# 32 10 classes
torch.nn.Linear(32, 10),
)
Fact being usually overlooked (without real consequences when it comes to shallowe networks) is that skip connection should be left without any nonlinearities like ReLU or convolutional layers and that's what you can see above (source: Identity Mappings in Deep Residual Networks).
This is the first time I'm using tensorboard, as I am getting a weird bug for my graph.
This is what I get if I open up the 'STEP' window.
However, this is what I get if I open up the 'RELATIVE'. (Similary when opening the 'WALL' window).
In addition to that, to test the performance of the model, I apply cross-validation every few steps. The accuracy of this cross-validation drops from ~10% (random guessing), to 0% after some time. I am not sure where I have made a mistake, as I am not a pro with tensorflow, but I suspect my problem to be in the graph building. The code looks as follows:
def initialize_parameters():
global_step = tf.get_variable("global_step", shape=[], trainable=False,
initializer=tf.constant_initializer(1), dtype=tf.int64)
Weights = {
"W_Conv1": tf.get_variable("W_Conv1", shape=[3, 3, 1, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"W_Affine3": tf.get_variable("W_Affine3", shape=[128, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
Bias = {
"b_Conv1": tf.get_variable("b_Conv1", shape=[1, 16, 8, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"b_Affine3": tf.get_variable("b_Affine3", shape=[1, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
return Weights, Bias, global_step
def build_model(W, b, global_step):
keep_prob = tf.placeholder(tf.float32)
learning_rate = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool)
## 0.Layer: Input
X_input = tf.placeholder(shape=[None, 16, 8], dtype=tf.float32, name="X_input")
y_input = tf.placeholder(shape=[None, 10], dtype=tf.int8, name="y_input")
inputs = tf.reshape(X_input, (-1, 16, 8, 1)) #must be a 4D input into the CNN layer
inputs = tf.contrib.layers.batch_norm(
inputs,
center=False,
scale=False,
is_training=is_training
)
## 1. Layer: Conv1 (64, stride=1, 3x3)
inputs = layer_conv(inputs, W['W_Conv1'], b['b_Conv1'], is_training)
...
## 7. Layer: Affine 3 (128 units)
logits = layer_affine(inputs, W['W_Affine3'], b['b_Affine3'], is_training)
## 8. Layer: Softmax, or loss otherwise
predict = tf.nn.softmax(logits) #should be an argmax, or should this even go through
## Output: Loss functions and model trainers
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=y_input,
logits=logits
)
)
trainer = tf.train.GradientDescentOptimizer(
learning_rate=learning_rate
)
updateModel = trainer.minimize(loss, global_step=global_step)
## Test Accuracy
correct_pred = tf.equal(tf.argmax(y_input, 1), tf.argmax(predict, 1))
acc_op = tf.reduce_mean(tf.cast(correct_pred, "float"))
return X_input, y_input, loss, predict, updateModel, keep_prob, learning_rate, is_training
Now I suspect my error to be in the definition of the loss-function of the graph, but I am not sure. Any idea what the problem could be? Or does the model converge correctly and all those errors are expected?
Yes I think you are runing same model more than once with your cross-validation implementation.
Just try at the end of every loop
session.close()
I suspect you are getting such strange output (and I have seen similar myself) because you are running the same model more than once and it is saving the Tensorboard output in exactly the same place. I can't see in your code how you name the file where you are putting the output? Try to make the file path in this part of code unique:
`summary_writer = tf.summary.FileWriter(unique_path_to_log, sess.graph)`
You can also try to locate the directory where your existing output has bene put in and try to remove the files that have the older (or newer?) timestamps and this way Tensorboard will not be confused as to which one to use.