I read this article about using "resize convolutions" rather than the "deconvolution" (i.e. transposed convolution) method for generating images with neural networks. It's clear how this works with a stride size of 1, but how would you implement it for a stride size >1?
Here is how I've implemented this in TensorFlow. Note: This is the second "deconvolution" layer in the decoder part of an autoencoder network.
h_d_upsample2 = tf.image.resize_images(images=h_d_conv3,
size=(int(self.c2_size), int(self.c2_size)),
h_d_conv2 = tf.layers.conv2d(inputs=h_d_upsample2,
kernel_size=(FLAGS.c2_kernel, FLAGS.c2_kernel),

Resizing images really not a viable option for intermediate layers of network. you may try conv2d_transpose
how would you implement it for a stride size >1?
# best practice is to use the transposed_conv2d function, this function works with stride >1 .
# output_shape_width_height = stride * input_shape_width_height
# input_shape = [32, 32, 48], output_shape = [64, 64, 128]
stride = 2
filter_size_w =filter_size_h= 2
shape = [filter_size_w, filter_size_h, output_shape[-1], input_shape[-1]]
w = tf.get_variable(
output = tf.nn.conv2d_transpose(
input, w, output_shape=output_shape, strides=[1, stride, stride, 1])


What are the shapes of beta and gamma parameters in layer normalization layer?

In layer normalization, we compute mean and variance across the input layer (instead of across batch which is what we do in batch normalization). And then normalize the input layer according to mean and variance, and then return gamma times normalized layer plus beta.
My question is, are the gamma and beta scalars with shape (1, 1) and (1, 1) respectively or their shapes are (1, number of hidden units) and (1, number of hidden units) respectively.
Here is how I have implemented the layer normalization, is this correct!
def layernorm(layer, gamma, beta):
mean = np.mean(layer, axis = 1, keepdims = True)
variance = np.mean((layer - mean) ** 2, axis=1, keepdims = True)
layer_hat = (layer - mean) * 1.0 / np.sqrt(variance + 1e-8)
outpus = gamma * layer_hat + beta
return outpus
where gamma and beta are defined as below:
gamma = np.random.normal(1, 128)
beta = np.random.normal(1, 128)
According to the Tensorflow's implementation, assume the input has shape [B, rest], gamma and beta are of shape rest. rest could be (h, ) for a 2-dimensional input or (h, w, c) for a 4-dimensional input.

How to use readNet (or readFromDarknet) instead of readNetFromCaffe?

I did an object detection using opencv by loading pre-trained MobileNet SSD model. from this post.
It reads a video and detects objects without any problem. But I would like to use readNet (or readFromDarknet) instead of readNetFromCaffe
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
because I have pre-trained weights and cfg file of my own objects only in Darknet framework. Therefore I simply changed readNetFromCaffe into readNet in above post and got an error:
Traceback (most recent call last):
File "", line 124, in <module>
for i in np.arange(0, detections.shape[2]):
IndexError: tuple index out of range
Here detections is an output from
blob = cv2.dnn.blobFromImage(frame, 1.0/255.0, (416, 416), True, crop=False)
detections = net.forward()
Its shape is (1, 1, 100, 7) tuple (when using readNetFromCaffe).
I was kinda expecting it wouldn't work just by changing the model. Then I decided to look for an object detector code where readNet was used and I found it here. I read through the code and found the same lines as follows:
blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
outs = net.forward(get_output_layers(net))
Here, the shape of outs is (1, 845, 6) list. But in order for me to be able to use it right away (here), outs should be of the same size with detections. I've come up to this part and have no clue about how I should proceed.
If something isn't clear, I just need help to use readNet (or readFromDarknet) instead of readNetFromCaffe in this post
If we look at the code closely we can see that everying is dependent on the outputs of detections, line 121, and we should tweak its outputs to match them with the outs of this, line 63. After spending almost a day, I came to a reasonable (not the perfect) solution. Basically, it is all about output blobs of readNetFromCaffe and readFromDarknet, because they output a blob with a shape 1x1xNx7 and NxC, respectively. Here Ns are the number of detections, but with different size vectors, namely, N in 1x1xNx7 is is a number of detections and an every detection is a vector of values
[batchId, classId, confidence, left, top, right, bottom] and N in NxC a number of
detected objects and C is a number of classes + 4 where the first 4 numbers are [center_x, center_y, width, height]. After analyzing these, we may replace (124-130 lines)
for i in np.arange(0, detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > args["confidence"]:
idx = int(detections[0, 0, i, 1])
if CLASSES[idx] != "person":
box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
(startX, startY, endX, endY) = box.astype("int")
with equivalent lines
for i in np.arange(0, detections.shape[0]):
scores = detections[i][5:]
classId = np.argmax(scores)
confidence = scores[classId]
if confidence > args["confidence"]:
idx = int(classId)
if CLASSES[idx] != "person":
center_x = int(detections[i][0] * 416)
center_y = int(detections[i][1] * 416)
width = int(detections[i][2] * 416)
height = int(detections[i][3] * 416)
left = int(center_x - width / 2)
top = int(center_y - height / 2)
right = width + left - 1
bottom = height + top - 1
box = [left, top, width, height]
(startX, startY, endX, endY) = box
This way we can keep track of "person" class using Darknet's cfg and weights and count them up/down with a visualiation line.
Again, there might be some other more simpler ways of tracking the detections of Darknet weights file, but this works for this particular case.
A reference:
more about blobs output by readNetFromCaffe and readFromDarknet

How labelling works in image segmentation [SegNet]

I am trying to understand image segmentation using SegNet implementation in keras. I have read the original paper using the Conv and Deconv architechture and also using the Dilated conv layers. However, I have trouble understanding how the labelling of the pixel works.
I am considering the following implementation:
Here the pascal dataset attributes are used:
21 Classes:
# 0=background
# 1=aeroplane, 2=bicycle, 3=bird, 4=boat, 5=bottle
# 6=bus, 7=car, 8=cat, 9=chair, 10=cow
# 11=diningtable, 12=dog, 13=horse, 14=motorbike, 15=person
# 16=potted plant, 17=sheep, 18=sofa, 19=train, 20=tv/monitor
The classes are represented by:
pascal_nclasses = 21
pascal_palette = np.array([(0, 0, 0)
, (128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128)
, (0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0)
, (192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128)
, (0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128)], dtype=np.uint8)
I was trying to open the labelled images for cat and boat, as cat is in only in R space and boat only in blue. I used following to show the labelled images:
For boat:
label = cv2.imread("2008_000120.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,2])
For cat:
label = cv2.imread("2008_000056.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,0])
However, it doesnt matter which space I choose both images always gives same results. i.e. the following code also gives same results
For boat:
label = cv2.imread("2008_000120.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,1]) # changed to Green space
For cat:
label = cv2.imread("2008_000056.png")
label = np.multiply(label, 100)
cv2.imshow("kk", label[:,:,1]) # changed to Green space
My assumption was that I will see the cat only in Red color space and boat only in blue. However, the output in all cases:
I am confused now how these pixels are labelled and how are they read and uniquely used to pair with categories in the process of creating the logits.
It will be great if someone can explain or put some relevant links to understand this process. I tried to search but most of the tutorials only discuss the CNN architecture, not the labelling process or how these labels are used within the CNN.
I have attached the labelled images of cat and boat for reference.
The labels are just binary image masks so single channel images. The pixel value at each location of your label image changes depending on the class present at each pixel. So it will be value 0 when there is no object at a pixel and a value 1-20 depending on the class otherwise.
Semantic segmentation is a classification task so you are trying to classify each pixel with a class ( in this case class labels 0-20).
Your model will produce an output image and you want to perform softmax cross entropy between each output image pixel and each label image pixel.
In the multiclass case where you have K classes (like here K=21) each pixel will have K channels and you perform softmax cross entropy across the channels at each pixel. Why a channel for each class? Think about in classification we produce a vector of length K for K classes and this is compared to a one hot vector of length K.

Training convolutional nets on multi-channel image data sets

I am trying to implement a convolutional neural network from scratch and I am not able to figure out how to perform (vectorized)operations on multi-channel images like rgb, which have 3 dimensions. On following the articles and tutorials such as this CS231n tutorial , it's pretty clear to implement a network for a single input as the input layer will be a 3d matrix but there are always multiple data points in a dataset. so, I cannot figure out how to implement these networks for vectorized operation on entire datsets.
I have implemented a network which takes a 3d matrix as input but now I have realized that It will not work on entire dataset but I will have to propagate one input at a time.I don't really know whether conv nets are vectorized over entire dataset or not .But if they are, how can I vectorize my convolutional network for multi-channel images ?
If I got your question right, you're basically asking how to do convolutional layer for a mini-batch, which will be a 4-D tensor.
To put it simply, you want to treat each input in a batch independently and apply convolution to each one. It's fairly straightforward to code without vectorization using a loop.
A vectorization implementation is often based on im2col technique, which basically transforms the 4-D input tensor into a giant matrix and performs a matrix multiplication. Here's an implementation of a forward pass using numpy.lib.stride_tricks in python:
import numpy as np
def conv_forward(x, w, b, stride, pad):
N, C, H, W = x.shape
F, _, HH, WW = w.shape
# Check dimensions
assert (W + 2 * pad - WW) % stride == 0, 'width does not work'
assert (H + 2 * pad - HH) % stride == 0, 'height does not work'
# Pad the input
p = pad
x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')
# Figure out output dimensions
H += 2 * pad
W += 2 * pad
out_h = (H - HH) / stride + 1
out_w = (W - WW) / stride + 1
# Perform an im2col operation by picking clever strides
shape = (C, HH, WW, N, out_h, out_w)
strides = (H * W, W, 1, C * H * W, stride * W, stride)
strides = x.itemsize * np.array(strides)
x_stride = np.lib.stride_tricks.as_strided(x_padded,
shape=shape, strides=strides)
x_cols = np.ascontiguousarray(x_stride)
x_cols.shape = (C * HH * WW, N * out_h * out_w)
# Now all our convolutions are a big matrix multiply
res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1)
# Reshape the output
res.shape = (F, N, out_h, out_w)
out = res.transpose(1, 0, 2, 3)
out = np.ascontiguousarray(out)
return out
Note that it uses some non-trivial features of linear algebra library, which are implemented in numpy, but may be not in your library.
BTW, you generally don't want to push the entire data set as one batch - split it into several batches.

Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks [closed]

Can anyone please clearly explain the difference between 1D, 2D, and 3D convolutions in convolutional neural networks (in deep learning) with the use of examples?
I want to explain with picture from C3D.
In a nutshell, convolutional direction & output shape is important!
↑↑↑↑↑ 1D Convolutions - Basic ↑↑↑↑↑
just 1-direction (time-axis) to calculate conv
input = [W], filter = [k], output = [W]
ex) input = [1,1,1,1,1], filter = [0.25,0.5,0.25], output = [1,1,1,1,1]
output-shape is 1D array
example) graph smoothing
tf.nn.conv1d code Toy Example
import tensorflow as tf
import numpy as np
sess = tf.Session()
ones_1d = np.ones(5)
weight_1d = np.ones(3)
strides_1d = 1
in_1d = tf.constant(ones_1d, dtype=tf.float32)
filter_1d = tf.constant(weight_1d, dtype=tf.float32)
in_width = int(in_1d.shape[0])
filter_width = int(filter_1d.shape[0])
input_1d = tf.reshape(in_1d, [1, in_width, 1])
kernel_1d = tf.reshape(filter_1d, [filter_width, 1, 1])
output_1d = tf.squeeze(tf.nn.conv1d(input_1d, kernel_1d, strides_1d, padding='SAME'))
↑↑↑↑↑ 2D Convolutions - Basic ↑↑↑↑↑
2-direction (x,y) to calculate conv
output-shape is 2D Matrix
input = [W, H], filter = [k,k] output = [W,H]
example) Sobel Egde Fllter
tf.nn.conv2d - Toy Example
ones_2d = np.ones((5,5))
weight_2d = np.ones((3,3))
strides_2d = [1, 1, 1, 1]
in_2d = tf.constant(ones_2d, dtype=tf.float32)
filter_2d = tf.constant(weight_2d, dtype=tf.float32)
in_width = int(in_2d.shape[0])
in_height = int(in_2d.shape[1])
filter_width = int(filter_2d.shape[0])
filter_height = int(filter_2d.shape[1])
input_2d = tf.reshape(in_2d, [1, in_height, in_width, 1])
kernel_2d = tf.reshape(filter_2d, [filter_height, filter_width, 1, 1])
output_2d = tf.squeeze(tf.nn.conv2d(input_2d, kernel_2d, strides=strides_2d, padding='SAME'))
↑↑↑↑↑ 3D Convolutions - Basic ↑↑↑↑↑
3-direction (x,y,z) to calcuate conv
output-shape is 3D Volume
input = [W,H,L], filter = [k,k,d] output = [W,H,M]
d < L is important! for making volume output
example) C3D
tf.nn.conv3d - Toy Example
ones_3d = np.ones((5,5,5))
weight_3d = np.ones((3,3,3))
strides_3d = [1, 1, 1, 1, 1]
in_3d = tf.constant(ones_3d, dtype=tf.float32)
filter_3d = tf.constant(weight_3d, dtype=tf.float32)
in_width = int(in_3d.shape[0])
in_height = int(in_3d.shape[1])
in_depth = int(in_3d.shape[2])
filter_width = int(filter_3d.shape[0])
filter_height = int(filter_3d.shape[1])
filter_depth = int(filter_3d.shape[2])
input_3d = tf.reshape(in_3d, [1, in_depth, in_height, in_width, 1])
kernel_3d = tf.reshape(filter_3d, [filter_depth, filter_height, filter_width, 1, 1])
output_3d = tf.squeeze(tf.nn.conv3d(input_3d, kernel_3d, strides=strides_3d, padding='SAME'))
↑↑↑↑↑ 2D Convolutions with 3D input - LeNet, VGG, ..., ↑↑↑↑↑
Eventhough input is 3D ex) 224x224x3, 112x112x32
output-shape is not 3D Volume, but 2D Matrix
because filter depth = L must be matched with input channels = L
2-direction (x,y) to calcuate conv! not 3D
input = [W,H,L], filter = [k,k,L] output = [W,H]
output-shape is 2D Matrix
what if we want to train N filters (N is number of filters)
then output shape is (stacked 2D) 3D = 2D x N matrix.
conv2d - LeNet, VGG, ... for 1 filter
in_channels = 32 # 3 for RGB, 32, 64, 128, ...
ones_3d = np.ones((5,5,in_channels)) # input is 3d, in_channels = 32
# filter must have 3d-shpae with in_channels
weight_3d = np.ones((3,3,in_channels))
strides_2d = [1, 1, 1, 1]
in_3d = tf.constant(ones_3d, dtype=tf.float32)
filter_3d = tf.constant(weight_3d, dtype=tf.float32)
in_width = int(in_3d.shape[0])
in_height = int(in_3d.shape[1])
filter_width = int(filter_3d.shape[0])
filter_height = int(filter_3d.shape[1])
input_3d = tf.reshape(in_3d, [1, in_height, in_width, in_channels])
kernel_3d = tf.reshape(filter_3d, [filter_height, filter_width, in_channels, 1])
output_2d = tf.squeeze(tf.nn.conv2d(input_3d, kernel_3d, strides=strides_2d, padding='SAME'))
conv2d - LeNet, VGG, ... for N filters
in_channels = 32 # 3 for RGB, 32, 64, 128, ...
out_channels = 64 # 128, 256, ...
ones_3d = np.ones((5,5,in_channels)) # input is 3d, in_channels = 32
# filter must have 3d-shpae x number of filters = 4D
weight_4d = np.ones((3,3,in_channels, out_channels))
strides_2d = [1, 1, 1, 1]
in_3d = tf.constant(ones_3d, dtype=tf.float32)
filter_4d = tf.constant(weight_4d, dtype=tf.float32)
in_width = int(in_3d.shape[0])
in_height = int(in_3d.shape[1])
filter_width = int(filter_4d.shape[0])
filter_height = int(filter_4d.shape[1])
input_3d = tf.reshape(in_3d, [1, in_height, in_width, in_channels])
kernel_4d = tf.reshape(filter_4d, [filter_height, filter_width, in_channels, out_channels])
#output stacked shape is 3D = 2D x N matrix
output_3d = tf.nn.conv2d(input_3d, kernel_4d, strides=strides_2d, padding='SAME')
↑↑↑↑↑ Bonus 1x1 conv in CNN - GoogLeNet, ..., ↑↑↑↑↑
1x1 conv is confusing when you think this as 2D image filter like sobel
for 1x1 conv in CNN, input is 3D shape as above picture.
it calculate depth-wise filtering
input = [W,H,L], filter = [1,1,L] output = [W,H]
output stacked shape is 3D = 2D x N matrix.
tf.nn.conv2d - special case 1x1 conv
in_channels = 32 # 3 for RGB, 32, 64, 128, ...
out_channels = 64 # 128, 256, ...
ones_3d = np.ones((1,1,in_channels)) # input is 3d, in_channels = 32
# filter must have 3d-shpae x number of filters = 4D
weight_4d = np.ones((3,3,in_channels, out_channels))
strides_2d = [1, 1, 1, 1]
in_3d = tf.constant(ones_3d, dtype=tf.float32)
filter_4d = tf.constant(weight_4d, dtype=tf.float32)
in_width = int(in_3d.shape[0])
in_height = int(in_3d.shape[1])
filter_width = int(filter_4d.shape[0])
filter_height = int(filter_4d.shape[1])
input_3d = tf.reshape(in_3d, [1, in_height, in_width, in_channels])
kernel_4d = tf.reshape(filter_4d, [filter_height, filter_width, in_channels, out_channels])
#output stacked shape is 3D = 2D x N matrix
output_3d = tf.nn.conv2d(input_3d, kernel_4d, strides=strides_2d, padding='SAME')
Animation (2D Conv with 3D-inputs)
Original Link : LINK
The author: Martin Görner
Twitter: #martin_gorner
Google +:
Bonus 1D Convolutions with 2D input
↑↑↑↑↑ 1D Convolutions with 1D input ↑↑↑↑↑
↑↑↑↑↑ 1D Convolutions with 2D input ↑↑↑↑↑
Eventhough input is 2D ex) 20x14
output-shape is not 2D , but 1D Matrix
because filter height = L must be matched with input height = L
1-direction (x) to calcuate conv! not 2D
input = [W,L], filter = [k,L] output = [W]
output-shape is 1D Matrix
what if we want to train N filters (N is number of filters)
then output shape is (stacked 1D) 2D = 1D x N matrix.
Bonus C3D
in_channels = 32 # 3, 32, 64, 128, ...
out_channels = 64 # 3, 32, 64, 128, ...
ones_4d = np.ones((5,5,5,in_channels))
weight_5d = np.ones((3,3,3,in_channels,out_channels))
strides_3d = [1, 1, 1, 1, 1]
in_4d = tf.constant(ones_4d, dtype=tf.float32)
filter_5d = tf.constant(weight_5d, dtype=tf.float32)
in_width = int(in_4d.shape[0])
in_height = int(in_4d.shape[1])
in_depth = int(in_4d.shape[2])
filter_width = int(filter_5d.shape[0])
filter_height = int(filter_5d.shape[1])
filter_depth = int(filter_5d.shape[2])
input_4d = tf.reshape(in_4d, [1, in_depth, in_height, in_width, in_channels])
kernel_5d = tf.reshape(filter_5d, [filter_depth, filter_height, filter_width, in_channels, out_channels])
output_4d = tf.nn.conv3d(input_4d, kernel_5d, strides=strides_3d, padding='SAME')
Input & Output in Tensorflow
Following the answer from #runhani I am adding a few more details to make the explanation a bit more clear and will try to explain this a bit more (and of course with exmaples from TF1 and TF2).
One of the main additional bits I'm including are,
Emphasis on applications
Usage of tf.Variable
Clearer explanation of inputs/kernels/outputs 1D/2D/3D convolution
The effects of stride/padding
1D Convolution
Here's how you might do 1D convolution using TF 1 and TF 2.
And to be specific my data has following shapes,
1D vector - [batch size, width, in channels] (e.g. 1, 5, 1)
Kernel - [width, in channels, out channels] (e.g. 5, 1, 4)
Output - [batch size, width, out_channels] (e.g. 1, 5, 4)
TF1 example
import tensorflow as tf
import numpy as np
inp = tf.placeholder(shape=[None, 5, 1], dtype=tf.float32)
kernel = tf.Variable(tf.initializers.glorot_uniform()([5, 1, 4]), dtype=tf.float32)
out = tf.nn.conv1d(inp, kernel, stride=1, padding='SAME')
with tf.Session() as sess:
print(, feed_dict={inp: np.array([[[0],[1],[2],[3],[4]],[[5],[4],[3],[2],[1]]])}))
TF2 Example
import tensorflow as tf
import numpy as np
inp = np.array([[[0],[1],[2],[3],[4]],[[5],[4],[3],[2],[1]]]).astype(np.float32)
kernel = tf.Variable(tf.initializers.glorot_uniform()([5, 1, 4]), dtype=tf.float32)
out = tf.nn.conv1d(inp, kernel, stride=1, padding='SAME')
It's way less work with TF2 as TF2 does not need Session and variable_initializer for example.
What might this look like in real-life?
So let's understand what this is doing using a signal smoothing example. On the left you got the original and on the right you got output of a Convolution 1D which has 3 output channels.
What do multiple channels mean?
Multiple channels are basically multiple feature representations of an input. In this example you have three representations obtained by three different filters. The first channel is the equally-weighted smoothing filter. The second is a filter that weights the middle of the filter more than the boundaries. The final filter does the opposite of the second. So you can see how these different filters bring about different effects.
Deep learning applications of 1D convolution
1D convolution has been successful used for the sentence classification task.
2D Convolution
Off to 2D convolution. If you are a deep learning person, chances that you haven't come across 2D convolution is … well about zero. It is used in CNNs for image classification, object detection, etc. as well as in NLP problems that involve images (e.g. image caption generation).
Let's try an example, I got a convolution kernel with the following filters here,
Edge detection kernel (3x3 window)
Blur kernel (3x3 window)
Sharpen kernel (3x3 window)
And to be specific my data has following shapes,
Image (black and white) - [batch_size, height, width, 1] (e.g. 1, 340, 371, 1)
Kernel (aka filters) - [height, width, in channels, out channels] (e.g. 3, 3, 1, 3)
Output (aka feature maps) - [batch_size, height, width, out_channels] (e.g. 1, 340, 371, 3)
TF1 Example,
import tensorflow as tf
import numpy as np
from PIL import Image
im = np.array(<some image>).convert('L'))#/255.0
kernel_init = np.array(
[[[-1, 1.0/9, 0]],[[-1, 1.0/9, -1]],[[-1, 1.0/9, 0]]],
[[[-1, 1.0/9, -1]],[[8, 1.0/9,5]],[[-1, 1.0/9,-1]]],
[[[-1, 1.0/9,0]],[[-1, 1.0/9,-1]],[[-1, 1.0/9, 0]]]
inp = tf.placeholder(shape=[None, image_height, image_width, 1], dtype=tf.float32)
kernel = tf.Variable(kernel_init, dtype=tf.float32)
out = tf.nn.conv2d(inp, kernel, strides=[1,1,1,1], padding='SAME')
with tf.Session() as sess:
res =, feed_dict={inp: np.expand_dims(np.expand_dims(im,0),-1)})
TF2 Example
import tensorflow as tf
import numpy as np
from PIL import Image
im = np.array(<some image>).convert('L'))#/255.0
x = np.expand_dims(np.expand_dims(im,0),-1)
kernel_init = np.array(
[[[-1, 1.0/9, 0]],[[-1, 1.0/9, -1]],[[-1, 1.0/9, 0]]],
[[[-1, 1.0/9, -1]],[[8, 1.0/9,5]],[[-1, 1.0/9,-1]]],
[[[-1, 1.0/9,0]],[[-1, 1.0/9,-1]],[[-1, 1.0/9, 0]]]
kernel = tf.Variable(kernel_init, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1,1,1,1], padding='SAME')
What might this look like in real life?
Here you can see the output produced by above code. The first image is the original and going clock-wise you have outputs of the 1st filter, 2nd filter and 3 filter.
What do multiple channels mean?
In the context if 2D convolution, it is much easier to understand what these multiple channels mean. Say you are doing face recognition. You can think of (this is a very unrealistic simplification but gets the point across) each filter represents an eye, mouth, nose, etc. So that each feature map would be a binary representation of whether that feature is there in the image you provided. I don't think I need to stress that for a face recognition model those are very valuable features. More information in this article.
This is an illustration of what I'm trying to articulate.
Deep learning applications of 2D convolution
2D convolution is very prevalent in the realm of deep learning.
CNNs (Convolution Neural Networks) use 2D convolution operation for almost all computer vision tasks (e.g. Image classification, object detection, video classification).
3D Convolution
Now it becomes increasingly difficult to illustrate what's going as the number of dimensions increase. But with good understanding of how 1D and 2D convolution works, it's very straight-forward to generalize that understanding to 3D convolution. So here goes.
And to be specific my data has following shapes,
3D data (LIDAR) - [batch size, height, width, depth, in channels] (e.g. 1, 200, 200, 200, 1)
Kernel - [height, width, depth, in channels, out channels] (e.g. 5, 5, 5, 1, 3)
Output - [batch size, width, height, width, depth, out_channels] (e.g. 1, 200, 200, 2000, 3)
TF1 Example
import tensorflow as tf
import numpy as np
inp = tf.placeholder(shape=[None, 200, 200, 200, 1], dtype=tf.float32)
kernel = tf.Variable(tf.initializers.glorot_uniform()([5,5,5,1,3]), dtype=tf.float32)
out = tf.nn.conv3d(inp, kernel, strides=[1,1,1,1,1], padding='SAME')
with tf.Session() as sess:
res =, feed_dict={inp: np.random.normal(size=(1,200,200,200,1))})
TF2 Example
import tensorflow as tf
import numpy as np
x = np.random.normal(size=(1,200,200,200,1))
kernel = tf.Variable(tf.initializers.glorot_uniform()([5,5,5,1,3]), dtype=tf.float32)
out = tf.nn.conv3d(x, kernel, strides=[1,1,1,1,1], padding='SAME')
Deep learning applications of 3D convolution
3D convolution has been used when developing machine learning applications involving LIDAR (Light Detection and Ranging) data which is 3 dimensional in nature.
What... more jargon?: Stride and padding
Alright you're nearly there. So hold on. Let's see what is stride and padding is. They are quite intuitive if you think about them.
If you stride across a corridor, you get there faster in fewer steps. But it also means that you observed lesser surrounding than if you walked across the room. Let's now reinforce our understanding with a pretty picture too! Let's understand these via 2D convolution.
Understanding stride
When you use tf.nn.conv2d for example, you need to set it as a vector of 4 elements. There's no reason to get intimidated by this. It just contain the strides in the following order.
2D Convolution - [batch stride, height stride, width stride, channel stride]. Here, batch stride and channel stride you just set to one (I've been implementing deep learning models for 5 years and never had to set them to anything except one). So that leaves you only with 2 strides to set.
3D Convolution - [batch stride, height stride, width stride, depth stride, channel stride]. Here you worry about height/width/depth strides only.
Understanding padding
Now, you notice that no matter how small your stride is (i.e. 1) there is an unavoidable dimension reduction happening during convolution (e.g. width is 3 after convolving a 4 unit wide image). This is undesirable especially when building deep convolution neural networks. This is where padding comes to the rescue. There are two most commonly used padding types.
Below you can see the difference.
Final word: If you are very curious, you might be wondering. We just dropped a bomb on whole automatic dimension reduction and now talking about having different strides. But the best thing about stride is that you control when where and how the dimensions get reduced.
In summary, In 1D CNN, kernel moves in 1 direction. Input and output data of 1D CNN is 2 dimensional. Mostly used on Time-Series data.
In 2D CNN, kernel moves in 2 directions. Input and output data of 2D CNN is 3 dimensional. Mostly used on Image data.
In 3D CNN, kernel moves in 3 directions. Input and output data of 3D CNN is 4 dimensional. Mostly used on 3D Image data (MRI, CT Scans).
You can find more details here:
CNN 1D,2D, or 3D refers to convolution direction, rather than input or filter dimension.
For 1 channel input, CNN2D equals to CNN1D is kernel length = input length. (1 conv direction)
