I dont understand why in official documentation he use the bias_variable of size 32, as i know the bias num is equal to num of neurons in the layer and in this case the number of neurons in the first layer is equal to 28 because the image pixels = 28 and he use the padding = "SAME", why it is equal 32 not 28?
Remember that mnist is using convolutional networks, not conventional neural networks and hence, you are dealing with convolutions(not neurons) and in this example , and in convolutions you commonly use a bias for every output channel, and this example uses 32 output channels in the first convolution layer and that gives you 32 biases.
They use bias of size 32 to be compatible with the weights:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
They use weights in conv2d function tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME').
tf.nn.conv2d() tells that the second parameter represent your filter and consists of [filter_height, filter_width, in_channels, out_channels]. So [5, 5, 1, 32] means that your in_channels is 1: you have a greyscale image, so no surprises here.
32 means that during our learning phase, the network will try to learn 32 different kernels which will be used during the prediction. You can change this number to any other number as it is a hyperparameter that you can tune.
Related
I'm new to Keras and wondering how to train an LTSM with (interrupted) time series of different lengths. Consider, for example, a continuous series from day 1 to day 10 and another continuous series from day 15 to day 20. Simply concatenating them to a single series might yield wrong results. I see basically two options to bring them to shape (batch_size, timesteps, output_features):
Extend the shorter series by some default value (0), i.e. for the above example we would have the following batch:
d1, ..., d10
d15, ..., d20, 0, 0, 0, 0, 0
Compute the GCD of the lengths, cut the series into pieces, and use a stateful LSTM, i.e.:
d1, ..., d5
d6, ..., d10
reset_state
d15, ..., d20
Are there any other / better solutions? Is training a stateless LSTM with a complete sequence equivalent to training a stateful LSTM with pieces?
Have you tried feeding the LSTM layer with inputs of different length? The input time-series can be of different length when LSTM is used (even the batch sizes can be different from one batch to another, but obvisouly the dimension of features should be the same). Here is an example in Keras:
from keras import models, layers
n_feats = 32
latent_dim = 64
lstm_input = layers.Input(shape=(None, n_feats))
lstm_output = layers.LSTM(latent_dim)(lstm_input)
model = models.Model(lstm_input, lstm_output)
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, None, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 24832
=================================================================
Total params: 24,832
Trainable params: 24,832
Non-trainable params: 0
As you can see the first and second axis of Input layer is None. It means they are not pre-specified and can be any value. You can think of LSTM as a loop. No matter the input length, as long as there are remaining data vectors of same length (i.e. n_feats), the LSTM layer processes them. Therefore, as you can see above, the number of parameters used in a LSTM layer does not depend on the batch size or time-series length (it only depends on input feature vector's length and the latent dimension of LSTM).
import numpy as np
# feed LSTM with: batch_size=10, timestamps=5
model.predict(np.random.rand(10, 5, n_feats)) # This works
# feed LSTM with: batch_size=5, timestamps=100
model.predict(np.random.rand(5, 100, n_feats)) # This also works
However, depending on the specific problem you are working on, this may not work; though I don't have any specific examples in my mind now in which this behavior may not be suitable and you should make sure all the time-series have the same length.
My model:
classifier = Sequential()
# Convolutional + MaxPooling -> 1
classifier.add(Conv2D(32, (3,3), input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)))
convout1 = Activation('relu')
classifier.add(convout1)
classifier.add(MaxPooling2D(pool_size = (2,2)))
classifier.add(Dropout(0.25))
I am running the following code to get weights
classifier.layers[0].get_weights()[0]
It returns an array of 3x3x3x32. Shouldn't it return 32 matrices of 3x3?
The weights shape is correct, because the convolutional filter is applied to the whole 3D input volume and the parameters for different channels are not shared (though they are shared spatially). See the picture from CS231n class:
Yes, the output volume is obtained by summing up the convolutions across the depth volume, but the parameters in each channel are different.
In your case, the channels are RGB (since input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)), the spatial filter size is 3x3 and there are 32 filters. Hence the result shape is 3x3x3x32 and shape of each filter is 3x3x3.
No, the return value has the right shape. What you are not considering is that each of the 32 filters is 3x3 in spatial dimensions, and has three channels, same as the input. This means that each filter also works on the channels dimension. What you expect would only be valid in the case of doing 2D convolution on a one channel image.
I know that bias is the same as if 1 would be added to input vectors of each layer or as if it was a neuron with constant output of 1. The weights going out of the bias neuron are normal weights which are trained during training.
Now I'm studying some codes of neural networks in Tensorflow. E.g. this one (it's just a part of a CNN (VGGnet), specifically the part of CNN where convolution ends and fully connected layers begin):
with tf.name_scope('conv5_3') as scope:
kernel = tf.Variable(tf.truncated_normal([3, 3, 512, 512], dtype=tf.float32,
stddev=1e-1), name='weights')
conv = tf.nn.conv2d(self.conv5_2, kernel, [1, 1, 1, 1], padding='SAME')
biases = tf.Variable(tf.constant(0.0, shape=[512], dtype=tf.float32),
trainable=True, name='biases')
out = tf.nn.bias_add(conv, biases)
self.conv5_3 = tf.nn.relu(out, name=scope)
self.parameters += [kernel, biases]
# pool5
self.pool5 = tf.nn.max_pool(self.conv5_3,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME',
name='pool4')
with tf.name_scope('fc1') as scope:
shape = int(np.prod(self.pool5.get_shape()[1:]))
fc1w = tf.Variable(tf.truncated_normal([shape, 4096],
dtype=tf.float32,
stddev=1e-1), name='weights')
fc1b = tf.Variable(tf.constant(1.0, shape=[4096], dtype=tf.float32),
trainable=True, name='biases')
pool5_flat = tf.reshape(self.pool5, [-1, shape])
fc1l = tf.nn.bias_add(tf.matmul(pool5_flat, fc1w), fc1b)
self.fc1 = tf.nn.relu(fc1l)
self.parameters += [fc1w, fc1b]
Now my question is, why is bias in convolution layers 0 and it's 1 in fully connected layers (every conv. layer from this model has 0 for bias and FC layers have 1)? Or does my explanation cover just fully connected layers and it's different with convolutional layers?
Bias (in any layer) is usually initialized with zeros, but random or specific small values are also possible. Quote from Stanford CS231n:
Initializing the biases. It is possible and common to initialize the
biases to be zero, since the symmetry breaking is provided by the
small random numbers in the weights. For ReLU non-linearities, some
people like to use small constant value such as 0.01 for all biases
because this ensures that all ReLU units fire in the beginning and
therefore obtain and propagate some gradient. However, it is not clear
if this provides a consistent improvement (in fact some results seem
to indicate that this performs worse) and it is more common to simply
use 0 bias initialization.
Other examples: tf.layers.dense function, which is a short-cut for creating FC layers, uses zero_initializer by default; and this sample CNN uses random init for all weights and biases and it doesn't hurt the performance.
So, in summary, bias init isn't that important (compared to weight init) and I'm pretty sure you'll get similar training speed with zero or small random init as well.
I understand how convolution kernels work and their function in neural networks. However, I'm not sure if in the typical CNN you would predefine what the convolutional kernel is or if that is something that the CNN "figures out." For example, in making a CNN would you simple define some 5x5 convolution kernel like this:
kernel = [[ 0, 1, -2, 1, 0]
[ 0, 2, -1, 2, 1]
[ 1, 0, 1, 0, 0]
[-1, -1, 0, -3, -1]
[-3, -2, 0, 1, 1]]
Or would you simply tell the CNN to find a 5x5 kernel and after training it will have come up with a good 5x5 kernel?
For CNN the kernels are trained during the optimization procedure, so each number in the matrix is a free parameter, adjusted according to partial derivative of the loss considered with respect to this particular variable.
So to answer
Or would you simply tell the CNN to find a 5x5 kernel and after training it will have come up with a good 5x5 kernel?
You would tell the model to use K kernels of given size, with given spacing, maybe in multiple layers, followed by other operations, and it will find all the kernels on its own.
I'm training a convolutional neural network on text (on the character level) and I want to do max-pooling. tf.nn.max_pool expects a rank 4 Tensor, but 1-d convnets are rank 3 in tensorflow ([batch, width, depth]), so when I pass the output of conv1d to the max pool function, this is the error:
ValueError: Shape (1, 144, 512) must have rank 4
I'm new to tensorflow and deep learning frameworks in general and would like advice on the best practice here, because I can imagine there are multiple workarounds. How can I perform max-pooling in the 1-d case?
Thanks.
A quick way would be to add an extra singleton dimension i.e. make the shape (1, 1, 144, 512), from there you can reduce it back with tf.squeeze.
I'm curious about other approaches though.