Choice of Convolution Operator in Spatial Transformation Networks for Point Clouds

In many deep learning models for shape analysis, the input image/shape is first going through some Spatial Transform Network (STN) to align the input to a constant canonical space for better model learning and performance. I am also considering including a STN in my application, which takes inputs of 3D points clouds, so I am building a STN for point clouds.
Then, I try to reference from some existing models.
For example, here is the beginning part (the localization network part) of the STN in PointNet:
def input_transform_net(point_cloud, is_training, bn_decay=None, K=3):
""" Input (XYZ) Transform Net, input is BxNx3 gray image
Transformation matrix of size 3xK """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
input_image = tf.expand_dims(point_cloud, -1)
net = tf_util.conv2d(input_image, 64, [1,3],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv1', bn_decay=bn_decay)
net = tf_util.conv2d(net, 128, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv2', bn_decay=bn_decay)
net = tf_util.conv2d(net, 1024, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv3', bn_decay=bn_decay)
net = tf_util.max_pool2d(net, [num_point,1],
padding='VALID', scope='tmaxpool')
# and more...
Please let me try to summarize the steps taken here (hope I am making it right here):
1. expand input image so dimensions change from:
BxNx3 --> BxNx3x1.
2. 1st conv. op. has kernel size [1,3], so after 1st conv. the dimensions of image should be:
BxNx3x1 --> BxNx1x64
3. 2nd conv. op. has kernel size [1,1], so after 2nd conv. the dimensions of image should be:
BxNx1x64 --> BxNx1x128
4. ... and it goes on.
Just for my curiosity, I am wondering with the settings following would be equivalent as the one adopted in PointNet (shown above), as well as many other models.
I am wondering here if it will be equivalent if I replace Conv2d by Conv1d, and change it to something like:
# Input dimension: BxNx3
# not expanding dims this time
net = input_image
# After 1st conv. dims: BxNx3 --> BxNx64
net = conv1d(net, 64, kernel_size=[1], stride=[1], ...)
# After 2nd conv. dims: BxNx64 --> BxNx128
net = conv1d(net, 64, kernel_size=[1], stride=[1], ...)
# ... and it goes on.
Coming from the first setting, I am also wondering if the conv. op. here is actually equivalent to some matrix multiplication/fully connected layer (as in tensorflow)/linear layer (as in torch):
# Reshape input dims: BxNx3 --> (B*N)x3
# so data entries in the batch are "stacked together".
net = input_image.reshape(-1, 3)
# I use torch.nn.linear() from pytorch here for easy explanation.
# which basically does the follows:
# After 1st linear: (B*N)x3 --> (B*N)x64
net = torch.nn.linear(net, in_feat=3, out_feat=64)
# After 2nd linear: (B*N)x64 --> (B*N)x128
net = torch.nn.linear(net, in_feat=64, out_feat=128)
# ... and it goes on.
I actually prefer the second setting, if it is correct, because the first dimension of input image can be replaced by something like (N_1+N_2+...+N_B) instead of B*N. This means that I do not have to fix every data entry to have the same size. Of course, it this setting is correct.


How to create custom (convolution) connection between two different keras layers

I am implementing a custom connection between two different keras layers. The neural network begins something like below:
model = tf.keras.Sequential()
c1 = model.add(Conv2D(6, kernel_size=[5,5], strides=(stride,stride), padding="valid", input_shape=(32,32,1),
activation = 'tanh'))
s2 = model.add(AveragePooling2D(pool_size=2, strides=2, padding='valid'))
Now, the output of s2 has a size of 14*14*6
Here, I want to apply my custom connection to convolution layer c3 which has an output size of 10*10*16 (that is, 16 filters need to be applied on s2 of size 14*14*6 and get an output of 10*10*16). For this, I need to use kernal_size = 5*5, filers=16, stride = 1, and padding=valid.
However, all the 6 feature maps (of s2) are not connected to 16 feature maps of (c3). The connections are explained as given here.
For example (the explanation of given link above), to build your first feature map of C3, you convolve 3 of your input maps (of s2 of size 14*14*6) with 5x5 filters, which gives you 3 10x10 maps that are summed up to give your first feature map, which is then of size 10x10.
I read somewhere that, we need to use Functional API to build this.
But, I am not sure, how to proceed further. Can someone help on implementing this.
My initial approach of implementing this is as follows:
from keras.models import Model
from keras.layers import Conv2D, Input, Concatenate, Lambda, Add
inputTensor = Input(shape=(14, 14,6))
stride =1
group0_a = Lambda(lambda x: x[:,:,0])(inputTensor)
group0_b = Lambda(lambda x: x[:,:,1])(inputTensor)
group0_c = Lambda(lambda x: x[:,:,2])(inputTensor) # Take 0,1,2 feature map of s2
conv_group0_a = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_a)
conv_group0_b = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_b)
conv_group0_c = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_c) #Applying convolution on each of 0, 1, 2 feature maps of s2 with distinct kernals
added_0 = Add()([conv_group0_a, conv_group0_b, conv_group0_c]) #adding all the three to get one of the 10*10*16
#Repeat this for 16 neurons of c3 and then finally
output_layer = Concatenate()([]) #concatenate them
Mymodel = Model(inputTensor,output_layer)
I want to know, if my approach is correct (I know it is not because I am getting too many errors). So, I need help in recreating the custom connection as explained above. Any help is appreciated.
the above code is correct, the only change I made is group0_a = Lambda(lambda x: x[:,:,0:1])(inputTensor), that is instead of passing x as x[:,:,0] I passed it as x[:,:,0:1]

Weights from Conv Layer when applied to image layer gives saturated output

I am visualizing my first layer output with image_layer when applied with trained weight. However, when I try to visualize, I get white images as follows:
Ignore the last four, the filters are of size 7x7 and there are 32 of them.
The model is built on the following architecture (Code Attached):
import numpy as np
import tensorflow as tf
import cv2
from matplotlib import pyplot as plt
% matplotlib inline
model_path = "T_set_4/Model/model.ckpt"
# Define the model parameters
# Convolutional Layer 1.
filter_size1 = 7 # Convolution filters are 7 x 7 pixels.
num_filters1 = 32 # There are 32 of these filters.
# Convolutional Layer 2.
filter_size2 = 7 # Convolution filters are 7 x 7 pixels.
num_filters2 = 64 # There are 64 of these filters.
# Fully-connected layer.
fc_size = 512 # Number of neurons in fully-connected layer.
# Define the data dimensions
# We know that MNIST images are 48 pixels in each dimension.
img_size = 48
# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size
# Tuple with height and width of images used to reshape arrays.
img_shape = (img_size, img_size)
# Number of colour channels for the images: 1 channel for gray-scale.
num_channels = 1
# Number of classes, one class for each of 10 digits.
num_classes = 2
def new_weights(shape):
return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
def new_biases(length):
return tf.Variable(tf.constant(0.05, shape=[length]))
def new_conv_layer(input, # The previous layer.
num_input_channels, # Num. channels in prev. layer.
filter_size, # Width and height of each filter.
num_filters, # Number of filters.
use_pooling=True): # Use 2x2 max-pooling.
# Shape of the filter-weights for the convolution.
# This format is determined by the TensorFlow API.
shape = [filter_size, filter_size, num_input_channels, num_filters]
# Create new weights aka. filters with the given shape.
weights = new_weights(shape=shape)
# Create new biases, one for each filter.
biases = new_biases(length=num_filters)
# Create the TensorFlow operation for convolution.
# Note the strides are set to 1 in all dimensions.
# The first and last stride must always be 1,
# because the first is for the image-number and
# the last is for the input-channel.
# But e.g. strides=[1, 2, 2, 1] would mean that the filter
# is moved 2 pixels across the x- and y-axis of the image.
# The padding is set to 'SAME' which means the input image
# is padded with zeroes so the size of the output is the same.
layer = tf.nn.conv2d(input=input,
strides=[1, 1, 1, 1],
# Add the biases to the results of the convolution.
# A bias-value is added to each filter-channel.
layer += biases
# Rectified Linear Unit (ReLU).
# It calculates max(x, 0) for each input pixel x.
# This adds some non-linearity to the formula and allows us
# to learn more complicated functions.
layer = tf.nn.relu(layer)
# Use pooling to down-sample the image resolution?
if use_pooling:
# This is 2x2 max-pooling, which means that we
# consider 2x2 windows and select the largest value
# in each window. Then we move 2 pixels to the next window.
layer = tf.nn.max_pool(value=layer,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
# norm1
norm1 = tf.nn.lrn(layer, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
# Note that ReLU is normally executed before the pooling,
# but since relu(max_pool(x)) == max_pool(relu(x)) we can
# save 75% of the relu-operations by max-pooling first.
# We return both the resulting layer and the filter-weights
# because we will plot the weights later.
return layer, weights
def flatten_layer(layer):
# Get the shape of the input layer.
layer_shape = layer.get_shape()
# The shape of the input layer is assumed to be:
# layer_shape == [num_images, img_height, img_width, num_channels]
# The number of features is: img_height * img_width * num_channels
# We can use a function from TensorFlow to calculate this.
num_features = layer_shape[1:4].num_elements()
# Reshape the layer to [num_images, num_features].
# Note that we just set the size of the second dimension
# to num_features and the size of the first dimension to -1
# which means the size in that dimension is calculated
# so the total size of the tensor is unchanged from the reshaping.
layer_flat = tf.reshape(layer, [-1, num_features])
# The shape of the flattened layer is now:
# [num_images, img_height * img_width * num_channels]
# Return both the flattened layer and the number of features.
return layer_flat, num_features
def new_fc_layer(input, # The previous layer.
num_inputs, # Num. inputs from prev. layer.
num_outputs, # Num. outputs.
use_relu=True): # Use Rectified Linear Unit (ReLU)?
# Create new weights and biases.
weights = new_weights(shape=[num_inputs, num_outputs])
biases = new_biases(length=num_outputs)
# Calculate the layer as the matrix multiplication of
# the input and weights, and then add the bias-values.
layer = tf.matmul(input, weights) + biases
# Use ReLU?
if use_relu:
layer = tf.nn.relu(layer)
return layer
# Create the model
x = tf.placeholder(tf.float32, shape=[None, img_size_flat], name='x')
x_image = tf.reshape(x, [-1, img_size, img_size, num_channels])
y_true = tf.placeholder(tf.float32, shape=[None, num_classes],
y_true_cls = tf.argmax(y_true, dimension=1)
# Create the model footprint
layer_conv1, weights_conv1 = new_conv_layer(input=x_image,
layer_conv2, weights_conv2 = new_conv_layer(input=layer_conv1,
layer_flat, num_features = flatten_layer(layer_conv2)
layer_fc1 = new_fc_layer(input=layer_flat,
layer_fc2 = new_fc_layer(input=layer_fc1,
y_pred = tf.nn.softmax(layer_fc2)
y_pred_cls = tf.argmax(y_pred, dimension=1)
# Restore the model
saver = tf.train.Saver()
session = tf.Session()
saver.restore(session, model_path)
The code I followed to create visualized weights is from the following:
Source code
Can someone tell me is the training or network is too shallow?
This is a perfectly fine visualization of the feature maps (not the weights) produced by the first convolutional layer on the early stages of the training.
The first layers learn to extract simple features, the learning process is somehow slow and thus you first learn to "blur" the input images, but once the network starts to converge you'll see that the first layers will start extracting meaningful low-level features (edges and so on).
Just monitor the training process and let the network training a bit more.
If, instead, you got bad performance (always look at the validation accuracy) your feature maps will always look noisy and you should start tuning the hyperparameters (lowering the learning rate, regularize, ...) in order to extract meaningful features and thus get good results

How do I share weights across Parallel-streams?

Is there a way to share weights across parallel streams of a torch-model?
For example, I have the following model.
mlp = nn.Sequential();
c = nn.Parallel(1,2) -- Parallel container will associate a module to each slice of dimension 1
-- (row space), and concatenate the outputs over the 2nd dimension.
for i=1,10 do -- Add 10 Linear+Reshape modules in parallel (input = 3, output = 2x1)
local t=nn.Sequential()
t:add(nn.Linear(3,2)) -- Linear module (input = 3, output = 2)
t:add(nn.Reshape(2,1)) -- Reshape 1D Tensor of size 2 to 2D Tensor of size 2x1
And now I want to share the weight (including everything, weights, bias, gradients), of the nn.Linear layer above across different numbers of i (so, e.g. nn.Linear(3,2)[1] with nn.Linear(3,2)[9]). What options do I have to share those?
Or is it rather recommended to use a different container/the module-approach?
You can create the module that will be repeated:
t = nn.Sequential()
Then you can use the clone function of torch with additional parameters to share the weights (
mlp = nn.Sequential()
c = nn.Parallel(1,2)
for i = 1, 10 do
c:add(t:clone('weight', 'bias'))

Transposed convolution on feature maps using Theano

I asked similar question on CrossValidation for the image interpretation. I'm moving my detailed question here to include some code details.
The results I'm having are not fully desirable So maybe you have faced this issue before and you can help me find it out.
It is fully convolution neural network "no fully connected part".
Training part
first the images are transposed to match the convolution function. (batch_no,img_channels,width,height)
input.transpose(0, 3, 1, 2)
Learning optimized using learning rate:3e-6, Hu_uniform initialization and nestrove for 500 epochs until this convergence.
Training cost: 1.602449
Training loss: 4.610442
validation error: 5.126761
Test loss: 5.885714
Backward part
Loading Image
jpgfile = np.array(,img_name)))
Reshape to one batch
batch = jpgfile.reshape(1, jpgfile.shape[0], jpgfile.shape[1], 3)
Run the model to extract first feature map after activation using Relu
output = classifier.layer0.output
Test_model = theano.function(
layer_Fmaps = Test_model(test_set_x)
Apply backwork model to reconstruct the image using the only activated
bch, ch, row, col = layer_Fmaps.shape
output_grad_reshaped = layer_Fmaps.reshape((-1, 1, row, col))
output_grad_reshaped = output_grad_reshaped[0].reshape(1,1,row,col)
input_shape = (1, 3, 226, 226)
W = classifier.layer0.W.get_value()[0].reshape(1,3,7,7)
kernel = theano.shared(W)
inp = T.tensor4('inp')
deconv_out = T.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output_grad = inp,
input_shape= input_shape,
f = theano.function(
inputs = [inp],
outputs= deconv_out)
f_out = f(output_grad_reshaped)
deconved_relu = T.nnet.relu(f_out)[0].transpose(1,2,0)
deconved = f_out[0].transpose(1,2,0)
Here we have two images results, the first is the transposed image without activation and the second with relu since kernels might have some negative weights.
It is clear from the transposed convolution image that this kernel is learn to detect some useful feature related to this image. But the reconstructing part is breaking the image color scheme during the transpose convolution. It might be because the pixels values are small float numbers. Do you see where is the problem here ?

How to apply different cost functions to different output channels of a convolutional network?

I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
