Fractional Max Pooling in Tensorflow - machine-learning

When using the function tf.nn.fractional_max_pool in Tensorflow, in addition to the output pooled tensor it returns, it also returns a row_pooling_sequence and a col_pooling_sequence, which I presume is used in backpropagation to find the gradient of. This is in contrast to the normal $2 \times 2$ max pooling, which just returns the pooled tensor.
My question is: do we have to handle the row_pooling and col_pooling values ourselves? How would we include them into a network to get backpropagation working properly? I modified a simple convolutional neural network to use fractional max pooling instead of 2 x 2 max pooling without making use of these values and the results were much poorer, leading me to believe we must explicitly handle these.
Here's the relevant portion of my code that makes use of the FMP:
def add_layer_ops_FMP(conv_func, x_input, W, keep_prob_layer, training_phase):
h_conv = conv_func(x_input, W, stride_l = 1)
h_BN = batch_norm(h_conv, training_phase, epsilon)
h_elu = tf.nn.elu(h_BN) # Rectified unit layer - change accordingly
def dropout_no_training(h_elu=h_elu):
return dropout_op(h_elu, keep_prob = 1.0)
def dropout_in_training(h_elu=h_elu, keep_prob_layer=keep_prob_layer):
return dropout_op(h_elu, keep_prob = keep_prob_layer)
h_drop = tf.cond(training_phase, dropout_in_training, dropout_no_training)
h_pool, row_pooling_sequence, col_pooling_sequence = tf.nn.fractional_max_pool(h_drop) # FMP layer. See Ben Graham's paper
return h_pool
Link to function on github.

Do we need to handle row_pooling_sequence and col_pooling_sequence?
Even though the tf.nn.fractional_max_pool documentation says it turns 2 extra tensors which are needed to calculate gradient, I believe we do not need to specially handle these 2 extra tensors and add them into gradient calculation operation. The backpropagation of tf.nn.fractional_max_poolin TensorFlow is already registered into the gradient calculation flow by the _FractionalMaxPoolGrad function. As you can see in the _FractionalMaxPoolGrad, the row_pooling_sequence and col_pooling_sequence are extracted by op.outputs[1] and op.outputs[2] and used to calculate gradient.
#ops.RegisterGradient("FractionalMaxPool")
def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
"""..."""
return gen_nn_ops._fractional_max_pool_grad(op.inputs[0], op.outputs[0],
grad_0, op.outputs[1],
op.outputs[2],
op.get_attr("overlapping"))
Possible reasons for poorer performance after using fractional_max_pool (in my personal opinions).
In the fractional max pooling paper, the author used fractional max pooling in a spatially-sparse convolutional network. According to his spatially-sparse convolutional network design, he actually extended the image input spatial size by padding zeros. Additionally, fractional max pooling downsizes the input by a factor of pooling_ratio which is often less than 2. These two combined together allowed stacking more convolutional layers than using regular max pooling and hence building a deeper network. (i.e. imagine using CIFAR-10 dataset, the (non-padding) input spatial size is 32x32, the spatial size drops to 4x4 after 3 convolutional layers and 3 max pooling operations. If using fractional max pooling with pooling_ratio=1.4, the spatial size drops to 4x4 after 6 convolutional and 6 fractional max pooling layers). I experimented with building a CNN with 2-conv-layer+2-pooling-layer(regular max pool vs. fractional max pool with pooling_ratio=1.47)+2-fully-connected-layer on MNIST dataset. The one using regular max pooling also produced a better performance than the one using fractional max pooling (down by 15~20% performance). Comparing the spatial size before feeding into fully connected layers, the model with regular max pooling has spatial size of 7x7, the one with fractional max pooling has spatial size of 12x12. Adding one more conv+fractional_max_pool into the latter model (final spatial size dropped to be 8x8) improved the performance to a more comparative level with the former model with regular max pooling.
In summary, I personally think the good performance in the Fractional Max-Pooling paper is achieved by a combination of using spatially-sparse CNN with fractional max-pooling and small filters (and network in network) which enable building a deep network even when the input image spatial size is small. Hence in regular CNN network, simply replace regular max pooling with fractional max pooling does not necessarily give you a better performance.

Related

Initial bias values for a neural network

I am currently building a CNN in tensorflow and I am initialising my weight matrix using a He normal weight initialisation. However, I am unsure how I should initialise my bias values. I am using ReLU as my activation function between each convolutional layer. Is there a standard method to initialising bias values?
# Define approximate xavier weight initialization (with RelU correction described by He)
def xavier_over_two(shape):
std = np.sqrt(shape[0] * shape[1] * shape[2])
return tf.random_normal(shape, stddev=std)
def bias_init(shape):
return #???
Initializing the biases. It is possible and common to initialize the
biases to be zero, since the asymmetry breaking is provided by the
small random numbers in the weights. For ReLU non-linearities, some
people like to use small constant value such as 0.01 for all biases
because this ensures that all ReLU units fire in the beginning and
therefore obtain and propagate some gradient. However, it is not clear
if this provides a consistent improvement (in fact some results seem
to indicate that this performs worse) and it is more common to simply
use 0 bias initialization.
source: http://cs231n.github.io/neural-networks-2/
Be aware of the specific case of the last layer's bias. As Andrej Karpathy explains in his Recipe for Training Neural Networks:
init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

Keras VGG16 lower level features extraction

I am pulling lower level features from the VGG16 model included as Keras application. These features are exported as separate outputs of pre-trained input data for an add-on classifier. The conceptual idea was borrowed from Multi-scale recognition with DAG-CNNs
Using the model without the classifier top, features at the highest level are extracted from block_5 pulling layer using Flatten(): block_05 = Flatten(name='block_05')(block5_pool). This gives an output vector with dimension 8192. Flatten(), however does not work on lower pulling layers as the dimensions get too large (memory issues). Instead lower pulling layers (or any other layer) can be extracted using GlobalAveragePooling2D(): block_04 = GlobalAveragePooling2D(name='block_04')(block4_pool). The problem with this approach is however that the dimension of the feature vector reduces rapidly the lower you go: block_4 (512), block_3 (256), block_2 (128), block_1 (64).
What would be a suitable layer or set-up to retain more feature data from deeper layers?
For info, the output of the model looks like this, the add-on classifier has a corresponding number of inputs.
# Create model, output data in reverse order from top to bottom
model = Model(input=img_input, output=[block_05, # ch_00, layer 17, dim 8192
block_04, # ch_01, layer 13, dim 512
block_03, # ch_02, layer 9, dim 256
block_02, # ch_03, layer 5, dim 128
block_01]) # ch_04, layer 2, dim 64
The memory error you mentioned comes from flattening a huge array which makes the number of units extremely large. What you actually need to do is to downsample your input in a smart way. I will present you some way on how to do this:
MaxPooling: by simple usage of pooling - you could first downsample your feature maps and then Flatten them. The main advantage of this approach is its simplicity and lack of need of additional parameters. The main disadvantage : this might be a really rough method.
Intelligent downsampling: here you could add a Convolutional2D layers with huge subsampling (e.g. with filter size (4, 4) and subsample (4, 4)). This might be consider as intelligent pooling. A main disadvantage of this method is additional parameters need for this approach.

feature number in tensorflow tf.nn.conv2d

In the Tensorflow example "Deep MNIST for Experts" https://www.tensorflow.org/get_started/mnist/pros
I am not clear how to determine the feature number specified in weight of activation function.
For example:
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolution will compute 32 features for
each 5x5 patch.
W_conv1 = weight_variable([5, 5, 1, 32])
Why 32 is picked here?
In order to build a deep network, we stack several layers of this
type. The second layer will have 64 features for each 5x5 patch.
W_conv2 = weight_variable([5, 5, 32, 64])
Again, why 64 is picked?
Now that the image size has been reduced to 7x7, we add a
fully-connected layer with 1024 neurons to allow processing on the
entire image.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Why 1024 here?
Thanks
Each of these filters will actually do something, like check for edges, check for colour change, or right-shift, left-shit the image, sharpen, blur etc.
Each of these filters are actually working on finding out the meaning of the image by sharpening, enhancing, smoothening, intensifying etc.
For e.g. check this link which explains the meaning of these filters
http://setosa.io/ev/image-kernels/
So all these filters are actually neurons where the output will be max-pooled and eventually fed into a FC layer after some activation.
If you are looking for just understanding the filters, that is another approach. However if you are looking to learn how conv. architectures work but since these are tried and tested filters over the dataset, you hsould just go with it for now.
The filters also learn through Backprop.
32 and 64 are number of filters in the respective layers.
1024 is the number of output neurons in the fully connected layer.
Your question basically is about the reason behind the choice of these hyperparameters.
There is no mathematical or programming reason behind these specific choices. These have been picked up after experiments as they delivered a good accuracy over MNIST dataset.
You can change these numbers and that is one way by which you can modify a model.
Unfortunately you cannot yet explore the reason for the choice behind these parameters within TensorFlow or any other literature source.

Why are my TensorFlow network weights and costs NaN when I use RELU activations?

I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.

Resources