Why implementation of Resnet50 in Keras forbids images smaller than 32x32x3? - machine-learning

I am trying to understand why the implementation of ResNet50 in Keras forbids images smaller than 32x32x3.
Based on their implementation: https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py
The function that catches that is _obtain_input_shape
To overcome this problem, I made my own implementation based on their code and I removed the code that forbids minimal size. In my implementation I also add the possibility to work with pre-trained model with more than three channels by replicating the RGB weights for the first conv1 layer.
def ResNet50(load_weights=True,
input_shape=None,
pooling=None,
classes=1000):
img_input = Input(shape=input_shape, name='tuned_input')
x = ZeroPadding2D(padding=(3, 3), name='conv1_pad')(img_input)
# Stage 1 (conv1_x)
x = Conv2D(64, (7, 7),
strides=(2, 2),
padding='valid',
kernel_initializer=KERNEL_INIT,
name='tuned_conv1')(x)
x = BatchNormalization(axis=CHANNEL_AXIS, name='bn_conv1')(x)
x = Activation('relu')(x)
x = ZeroPadding2D(padding=(1, 1), name='pool1_pad')(x)
x = MaxPooling2D((3, 3), strides=(2, 2))(x)
# Stage 2 (conv2_x)
x = _convolution_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1))
for block in ['b', 'c']:
x = _identity_block(x, 3, [64, 64, 256], stage=2, block=block)
# Stage 3 (conv3_x)
x = _convolution_block(x, 3, [128, 128, 512], stage=3, block='a')
for block in ['b', 'c', 'd']:
x = _identity_block(x, 3, [128, 128, 512], stage=3, block=block)
# Stage 4 (conv4_x)
x = _convolution_block(x, 3, [256, 256, 1024], stage=4, block='a')
for block in ['b', 'c', 'd', 'e', 'f']:
x = _identity_block(x, 3, [256, 256, 1024], stage=4, block=block)
# Stage 5 (conv5_x)
x = _convolution_block(x, 3, [512, 512, 2048], stage=5, block='a')
for block in ['b', 'c']:
x = _identity_block(x, 3, [512, 512, 2048], stage=5, block=block)
# Condition on the last layer
if pooling == 'avg':
x = layers.GlobalAveragePooling2D()(x)
elif pooling == 'max':
x = layers.GlobalMaxPooling2D()(x)
inputs = img_input
# Create model.
model = models.Model(inputs, x, name='resnet50')
if load_weights:
weights_path = get_file(
'resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5',
WEIGHTS_PATH_NO_TOP,
cache_subdir='models',
md5_hash='a268eb855778b3df3c7506639542a6af')
model.load_weights(weights_path, by_name=True)
f = h5py.File(weights_path, 'r')
d = f['conv1']
# Used to work with more than 3 channels with pre-trained model
if input_shape[2] % 3 == 0:
model.get_layer('tuned_conv1').set_weights([d['conv1_W_1:0'][:].repeat(input_shape[2] / 3, axis=2),
d['conv1_b_1:0']])
else:
m = (3 * int(input_shape[2] / 3)) + 3
model.get_layer('tuned_conv1').set_weights(
[d['conv1_W_1:0'][:].repeat(m, axis=2)[:, :, 0:input_shape[2], :],
d['conv1_b_1:0']])
return model
I run my implementation with a 10x10x3 images and it seems to work. Thus I do not understand why they put this minimal bound.
They do not provide any information about this choice. I also check the original paper and I did not found any restriction mentioned about a minimal input shape. I suppose there is a reason for this bound but I do not know this one.
Thus I would like to know why such restriction has been done for the Resnet implementation.

ResNet50 has 5 stages of downsampling, between MaxPooling of 2x2 and Strided Convolution with strides of 2 px in each direction. This means that the minimum input size is 2^5 = 32, and this value is also the size of the receptive field.
There is not much point of using smaller images than 32x32, since then downsampling is not doing anything, and this will change the behavior of the network. For such small images then its better to use another network with less downsampling (like DenseNet) or with less depth.

Related

Deep Convolutional GAN (DCGAN) works really well on one set of data but not another

I am working on using PyTorch to create a DCGAN to use to generate trajectory data for a robot based on a dataset of 20000 other simple paths.
The DCGAN works great on the MNIST dataset, but does not work well on my custom dataset. I am trying to tweak/tune the DCGAN to give good results after being trained with my custom dataset.
Below is an example of output from the GAN (top), along with example training data for MNIST (bottom) after 20 epochs. The loss for the Generator and Discriminator each plateau at around 0.7.
Below is the output for my custom trajectory dataset after a similar number of epochs. The top figure shows the output, and bottom figure shows the training set of the batch.
It is clear that the same GAN is much better at making predictions for the MNIST dataset then for my custom dataset. It is interesting to note that the Discriminator and Generator losses also plateau at similar values of about 0.7 for this dataset too. This makes me think that there is some limit to the network of how low the loss can go.
Discriminator Code:
class Discriminator(nn.Module):
def __init__(self, channels_img, features_d):
super(Discriminator, self).__init__()
self.disc = nn.Sequential(
# input: N x channels_img x 64 x 64
nn.Conv2d(
channels_img, features_d, kernel_size=4, stride=2, padding=1
),
nn.LeakyReLU(0.2),
# _block(in_channels, out_channels, kernel_size, stride, padding)
self._block(features_d, features_d * 2, 4, 2, 1),
self._block(features_d * 2, features_d * 4, 4, 2, 1),
self._block(features_d * 4, features_d * 8, 4, 2, 1),
# After all _block img output is 4x4 (Conv2d below makes into 1x1)
nn.Conv2d(features_d * 8, 1, kernel_size=4, stride=2, padding=0),
nn.Sigmoid(),
)
def _block(self, in_channels, out_channels, kernel_size, stride, padding):
return nn.Sequential(
nn.Conv2d(
in_channels,
out_channels,
kernel_size,
stride,
padding,
bias=False,
),
nn.BatchNorm2d(out_channels),
nn.LeakyReLU(0.2),
)
def forward(self, x):
return self.disc(x)
Generator Code:
class Generator(nn.Module):
def __init__(self, channels_noise, channels_img, features_g):
super(Generator, self).__init__()
self.net = nn.Sequential(
# Input: N x channels_noise x 1 x 1
self._block(channels_noise, features_g * 16, 4, 1, 0), # img: 4x4
self._block(features_g * 16, features_g * 8, 4, 2, 1), # img: 8x8
self._block(features_g * 8, features_g * 4, 4, 2, 1), # img: 16x16
self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 32x32
nn.ConvTranspose2d(
features_g * 2, channels_img, kernel_size=4, stride=2, padding=1
),
# Output: N x channels_img x 64 x 64
nn.Tanh(),
)
def _block(self, in_channels, out_channels, kernel_size, stride, padding):
return nn.Sequential(
nn.ConvTranspose2d(
in_channels,
out_channels,
kernel_size,
stride,
padding,
bias=False,
),
nn.BatchNorm2d(out_channels),
nn.ReLU(),
)
def forward(self, x):
return self.net(x)
Training loop:
opt_gen = optim.Adam(gen.parameters(), lr=LEARNING_RATE_GEN, betas=(0.5, 0.999))
opt_disc = optim.Adam(disc.parameters(), lr=LEARNING_RATE_DISC, betas=(0.5, 0.999))
criterion = nn.BCELoss()
for epoch in range(NUM_EPOCHS):
# Target labels not needed! <3 unsupervised
# for batch_idx, (real, _) in enumerate(dataloader):
for batch_idx, real in enumerate(dataloader):
real = real.to(device)
noise = torch.randn(BATCH_SIZE, NOISE_DIM, 1, 1).to(device)
fake = gen(noise)
### Train Discriminator: max log(D(x)) + log(1 - D(G(z)))
disc_real = disc(real.float()).reshape(-1)
loss_disc_real = criterion(disc_real, torch.ones_like(disc_real))
disc_fake = disc(fake.detach()).reshape(-1)
loss_disc_fake = criterion(disc_fake, torch.zeros_like(disc_fake))
loss_disc = (loss_disc_real + loss_disc_fake) / 2
disc.zero_grad()
loss_disc.backward()
opt_disc.step()
### Train Generator: min log(1 - D(G(z))) <-> max log(D(G(z))
output = disc(fake).reshape(-1)
loss_gen = criterion(output, torch.ones_like(output))
gen.zero_grad()
loss_gen.backward()
opt_gen.step()
# Print losses occasionally and print to tensorboard
if batch_idx % 100 == 0:
print(
f"Epoch [{epoch}/{NUM_EPOCHS}] Batch {batch_idx}/{len(dataloader)} \
Loss D: {loss_disc:.4f}, loss G: {loss_gen:.4f}"
)
with torch.no_grad():
fake = gen(fixed_noise)
# take out (up to) 32 examples
img_grid_real = torchvision.utils.make_grid(
real[:BATCH_SIZE], normalize=True
)
img_grid_fake = torchvision.utils.make_grid(
fake[:BATCH_SIZE], normalize=True
)
writer_real.add_image("Real", img_grid_real, global_step=step)
writer_fake.add_image("Fake", img_grid_fake, global_step=step)
step += 1
(Not enough reputation to comment yet, so I'll reply)
Since your code works fine on the MNIST dataset, the main problem here is that the trajectories in your training data are practically imperceptible, even to a human observer. Note that the trajectory lines are much thinner than the digit lines in the MNIST dataset.

Pytorch Unfold and Fold: How do I put this image tensor back together again?

I am trying to filter a single channel 2D image of size 256x256 using unfold to create 16x16 blocks with an overlap of 8. This is shown below:
# I = [256, 256] image
kernel_size = 16
stride = bx/2
patches = I.unfold(1, kernel_size,
int(stride)).unfold(0, kernel_size, int(stride)) # size = [31, 31, 16, 16]
I have started to attempt to put the image back together with fold but I’m not quite there yet. I’ve tried to use view to get the image to ‘fit’ the way it’s supposed to but I don’t see how this would preserve the original image. Perhaps I’m overthinking this.
# patches.shape = [31, 31, 16, 16]
patches = = filt_data_block.contiguous().view(-1, kernel_size*kernel_size) # [961, 256]
patches = patches.permute(1, 0) # size = [951, 256]
Any help would be greatly appreciated. Thanks very much.
I believe you will benefit from using torch.nn.functional.fold and torch.nn.functional.unfold in this case, as these functions are built specifically for images (or any 4D tensors, that is with shape B X C X H X W).
Let's start with unfolding the image:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from sklearn.datasets import load_sample_image #Used to load a sample image
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor
#Load a flower image from sklearn.datasets, crop it to shape 1 X 3 X 256 X 256:
I = torch.from_numpy(load_sample_image('flower.jpg')).permute(2,0,1).unsqueeze(0).type(dtype)[...,128:128+256,256:256+256]
kernel_size = 16
stride = kernel_size//2
I_unf = F.unfold(I, kernel_size, stride=stride)
Here we obtain all the 16x16 image patches with strides of 8 by using the F.unfold function. This will result in a 3D tensor with shape torch.Size([1, 768, 961]). ie - 961 patches with 768 = 16 X 16 X 3 pixels within each.
Now, say we wish to fold it back to I:
I_f = F.fold(I_unf,I.shape[-2:],kernel_size,stride=stride)
norm_map = F.fold(F.unfold(torch.ones(I.shape).type(dtype),kernel_size,stride=stride),I.shape[-2:],kernel_size,stride=stride)
I_f /= norm_map
We use F.fold where we tell it the original shape of I, the kernel_size we used to unfold and the stride used. After folding I_unf we will obtain a summation with overlaps. This means that the resulting image will appear saturated. As a result, we need to compute a normalization map which will normalize multiple summation of pixels due to overlaps. A way to do this efficiently is to take a ones tensor and use unfold followed by fold - to mimic the summation with overlaps. This gives us the normalization map by which we normalize I_f to recover I.
Now, we wish to plot I_f and I to prove content is preserved:
#Plot I:
plt.imshow(I[0,...].permute(1,2,0).cpu()/255)
#Plot I_f:
plt.imshow(I_f[0,...].permute(1,2,0).cpu()/255)
This whole process will work also for single-channel images. One thing to notice is that if spatial dimensions of the image are not divisible by the stride, you will get norm_map with zeros (at the edges) due to some pixels not reachable but you can easily handle this case as well.
A slightly less elegant solution than that proposed by Gil:
I took inspiration from this post on the Pytorch forums, formatting my image tensor to be of standard shape B x C x H x W (1 x 1 x 256 x 256). Unfolding:
# CREATE THE UNFOLDED IMAGE SLICES
I = image # shape [256, 256]
kernel_size = bx #shape [16]
stride = int(bx/2) #shape [8]
I2 = I.unsqueeze(0).unsqueeze(0) #shape [1, 1, 256, 256]
patches2 = I2.unfold(2, kernel_size, stride).unfold(3, kernel_size, stride)
#shape [1, 1, 31, 31, 16, 16]
Following this, I do some transforms and filtering to my tensor stack. Before doing this I apply a cosine window and normalise:
# NORMALISE AND WINDOW
Pvv = torch.mean(torch.pow(win, 2))*torch.numel(win)*(noise_std**2)
Pvv = Pvv.double()
mean_patches = torch.mean(patches2, (4, 5), keepdim=True)
mean_patches = mean_patches.repeat(1, 1, 1, 1, 16, 16)
window_patches = win.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0).repeat(1, 1, 31, 31, 1, 1)
zero_mean = patches2 - mean_patches
windowed_patches = zero_mean * window_patches
#SOME FILTERING ....
#ADD MEAN AND WINDOW BEFORE FOLDING BACK TOGETHER.
filt_data_block = (filt_data_block + mean_patches*window_patches) * window_patches
The above code works for me, but a mask would be more simple. Next, I prepare my tensor of form [1, 1, 31, 31, 16, 16] to be transformed back into the original [1, 1, 256, 256]:
# REASSEMBLE THE IMAGE USING FOLD
patches = filt_data_block.contiguous().view(1, 1, -1, kernel_size*kernel_size)
patches = patches.permute(0, 1, 3, 2)
patches = patches.contiguous().view(1, kernel_size*kernel_size, -1)
IR = F.fold(patches, output_size=(256, 256), kernel_size=kernel_size, stride=stride)
IR = IR.squeeze()
This allowed me to create an overlapping sliding window and seamlessly stitch the image back together. Cutting out the filtering makes for an identical image.

Getting RuntimeError: Graph disconnected: cannot obtain value for tensor

I want to create a custom model of ResNet101 by retrieving one of its layer called 'avg_pool' and convert it to my custom layer. I have done this similar thing another pre-trained Imagnet model named resnet50, but getting an error in Resnet101. I am a newbie in transfer learning, please point me what is my mistake
def resnet101_model(weights_path=None):
eps = 1.1e-5
# Handle Dimension Ordering for different backends
global bn_axis
if K.image_dim_ordering() == 'tf':
bn_axis = 3
img_input = Input(shape=(224, 224, 3), name='data')
else:
bn_axis = 1
img_input = Input(shape=(3, 224, 224), name='data')
x = ZeroPadding2D((3, 3), name='conv1_zeropadding')(img_input)
x = Convolution2D(64, 7, 7, subsample=(2, 2), name='conv1', bias=False)(x)
x = BatchNormalization(epsilon=eps, axis=bn_axis, name='bn_conv1')(x)
x = Scale(axis=bn_axis, name='scale_conv1')(x)
x = Activation('relu', name='conv1_relu')(x)
x = MaxPooling2D((3, 3), strides=(2, 2), name='pool1')(x)
x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1))
x = identity_block(x, 3, [64, 64, 256], stage=2, block='b')
x = identity_block(x, 3, [64, 64, 256], stage=2, block='c')
x = conv_block(x, 3, [128, 128, 512], stage=3, block='a')
for i in range(1,3):
x = identity_block(x, 3, [128, 128, 512], stage=3, block='b'+str(i))
x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a')
for i in range(1,23):
x = identity_block(x, 3, [256, 256, 1024], stage=4, block='b'+str(i))
x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a')
x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b')
x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c')
x_fc = AveragePooling2D((7, 7), name='avg_pool')(x)
x_fc = Flatten()(x_fc)
x_fc = Dense(1000, activation='softmax', name='fc1000')(x_fc)
model = Model(img_input, x_fc)
# load weights
if weights_path:
model.load_weights(weights_path, by_name=True)
return model
im = cv2.resize(cv2.imread('human.jpg'), (224, 224)).astype(np.float32)
# Remove train image mean
im[:,:,0] -= 103.939
im[:,:,1] -= 116.779
im[:,:,2] -= 123.68
# Transpose image dimensions (Theano uses the channels as the 1st dimension)
if K.image_dim_ordering() == 'th':
im = im.transpose((2,0,1))
weights_path = 'resnet101_weights_th.h5'
else:
weights_path = 'resnet101_weights_tf.h5'
im = np.expand_dims(im, axis=0)
image_input = Input(shape=(224, 224, 3))
model = resnet101_model(weights_path)
model.summary()
last_layer = model.get_layer('avg_pool').output
x = Flatten(name='flatten')(last_layer)
out = Dense(num_classes, activation='softmax', name='fc1000')(x)
custom_resnet_model = Model(inputs=image_input,outputs= out)
custom_resnet_model.summary()
Graph disconnected happens, when your inputs and outputs are not connected. In your case image_input is not connected to out. You should pass it through Resnet model and then it should work

keras convolutional nerural network - output shape

Please forgive my ignorance as I am really new to the area. I am trying to get the correct output shape from my neural network which has 3 Conv2D layers then 2 Dense layers. My input shape is (140, 140, 4), which are 4 grayscale images. When I fit in 1 input, I am expecting an output of (1, 4) but I am getting a shape of (14, 14, 4) here. What am I doing wrong here? Thank you very much for your help in advance!
meta_layers = [Conv2D, Conv2D, Conv2D, Dense, Dense]
meta_inits = ['lecun_uniform'] * 5
meta_nodes = [32, 64, 64, 512, 4]
meta_filter = [(8,8), (4,4), (3,3), None, None]
meta_strides = [(4,4), (2,2), (1,1), None, None]
meta_activations = ['relu'] * 5
meta_loss = "mean_squared_error"
meta_optimizer=RMSprop(lr=0.00025, rho=0.9, epsilon=1e-06)
meta_n_samples = 1000
meta_epsilon = 1.0;
meta = Sequential()
meta.add(self.meta_layers[0](self.meta_nodes[0], init=self.meta_inits[0], input_shape=(140, 140, 4), kernel_size=self.meta_filters[0], strides=self.meta_strides[0]))
meta.add(Activation(self.meta_activations[0]))
for layer, init, node, activation, kernel, stride in list(zip(self.meta_layers, self.meta_inits, self.meta_nodes, self.meta_activations, self.meta_filters, self.meta_strides))[1:]:
if(layer == Conv2D):
meta.add(layer(node, init = init, kernel_size = kernel, strides = stride))
meta.add(Activation(activation))
elif(layer == Dense):
meta.add(layer(node, init=init))
meta.add(Activation(activation))
print("meta node: " + str(node))
meta.compile(loss=self.meta_loss, optimizer=self.meta_optimizer)
Your problem lies in the fact that in Keras with version >= 2.0, a Dense layer is applied to the last channel of the inputs (you may read about it here). So if you apply:
Dense(512)
to a Conv2D layer with shape (14, 14, 64) you'll get the output with shape (14, 14, 512) and then Dense(4) applied to it will give you output with shape (14, 14, 4). You can call model.summary() method to confirm my words.
In order to solve this you need to apply one of the following layers: GlobalMaxPooling2D, GlobalAveragePooling2D or Flatten to the output from the last convolutional layer in order to squash your output to be only 2 dimensional (with shape (batch_size, features).

Keras model predict train/test shape

I am training a CNN with Keras but with 30x30 patches from an image. I want to test the network with a full image but I get the following error:
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices
start at 0) has shape[1] == 30, but the output's size on that axis is
100. Apply node that caused the error: GpuElemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](GpuDimShuffle{0,2,3,1}.0, GpuReshape{4}.0,
GpuFromHost.0) Toposort index: 79 Inputs types:
[CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, (True, True,
True, False)), CudaNdarrayType(float32, 4D)] Inputs shapes: [(10, 100,
100, 3), (1, 1, 1, 3), (10, 30, 30, 3)] Inputs strides: [(30000, 100,
1, 10000), (0, 0, 0, 1), (2700, 90, 3, 1)] Inputs values: ['not
shown', CudaNdarray([[[[ 0.01060364 0.00988821 0.00741314]]]]), 'not
shown'] Outputs clients:
[[GpuCAReduce{pre=sqr,red=add}{0,1,1,1}(GpuElemwise{Composite{((i0 +
i1) - i2)}}[(0, 0)].0)]]
This is my model.predict:
predict_image = model.predict(np.array([test_images[1]]), batch_size=1)[0]
It's seems like the issue is that the input size cannot be anything other than 30x30 but the first input shape for the first layer of my network is none, none, 3.
model.add(Convolution2D(n1, f1, f1, border_mode='same', input_shape=(None, None, 3), activation='relu'))
Is it simply not possible to test an image with different dimensions to the ones I trained with?
As fchollet himself described here, you should be able to define the input as so:
input_shape=(1, None, None)
However this will fail if you have layers that use the Flatten operation.
This suggests that you should be able to accomplish your goal with a fully convolutional NN.

Resources