In the Tensorflow example "Deep MNIST for Experts" https://www.tensorflow.org/get_started/mnist/pros
I am not clear how to determine the feature number specified in weight of activation function.
For example:
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolution will compute 32 features for
each 5x5 patch.
W_conv1 = weight_variable([5, 5, 1, 32])
Why 32 is picked here?
In order to build a deep network, we stack several layers of this
type. The second layer will have 64 features for each 5x5 patch.
W_conv2 = weight_variable([5, 5, 32, 64])
Again, why 64 is picked?
Now that the image size has been reduced to 7x7, we add a
fully-connected layer with 1024 neurons to allow processing on the
entire image.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Why 1024 here?
Thanks
Each of these filters will actually do something, like check for edges, check for colour change, or right-shift, left-shit the image, sharpen, blur etc.
Each of these filters are actually working on finding out the meaning of the image by sharpening, enhancing, smoothening, intensifying etc.
For e.g. check this link which explains the meaning of these filters
http://setosa.io/ev/image-kernels/
So all these filters are actually neurons where the output will be max-pooled and eventually fed into a FC layer after some activation.
If you are looking for just understanding the filters, that is another approach. However if you are looking to learn how conv. architectures work but since these are tried and tested filters over the dataset, you hsould just go with it for now.
The filters also learn through Backprop.
32 and 64 are number of filters in the respective layers.
1024 is the number of output neurons in the fully connected layer.
Your question basically is about the reason behind the choice of these hyperparameters.
There is no mathematical or programming reason behind these specific choices. These have been picked up after experiments as they delivered a good accuracy over MNIST dataset.
You can change these numbers and that is one way by which you can modify a model.
Unfortunately you cannot yet explore the reason for the choice behind these parameters within TensorFlow or any other literature source.
Related
I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.
I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!
For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)
I am pulling lower level features from the VGG16 model included as Keras application. These features are exported as separate outputs of pre-trained input data for an add-on classifier. The conceptual idea was borrowed from Multi-scale recognition with DAG-CNNs
Using the model without the classifier top, features at the highest level are extracted from block_5 pulling layer using Flatten(): block_05 = Flatten(name='block_05')(block5_pool). This gives an output vector with dimension 8192. Flatten(), however does not work on lower pulling layers as the dimensions get too large (memory issues). Instead lower pulling layers (or any other layer) can be extracted using GlobalAveragePooling2D(): block_04 = GlobalAveragePooling2D(name='block_04')(block4_pool). The problem with this approach is however that the dimension of the feature vector reduces rapidly the lower you go: block_4 (512), block_3 (256), block_2 (128), block_1 (64).
What would be a suitable layer or set-up to retain more feature data from deeper layers?
For info, the output of the model looks like this, the add-on classifier has a corresponding number of inputs.
# Create model, output data in reverse order from top to bottom
model = Model(input=img_input, output=[block_05, # ch_00, layer 17, dim 8192
block_04, # ch_01, layer 13, dim 512
block_03, # ch_02, layer 9, dim 256
block_02, # ch_03, layer 5, dim 128
block_01]) # ch_04, layer 2, dim 64
The memory error you mentioned comes from flattening a huge array which makes the number of units extremely large. What you actually need to do is to downsample your input in a smart way. I will present you some way on how to do this:
MaxPooling: by simple usage of pooling - you could first downsample your feature maps and then Flatten them. The main advantage of this approach is its simplicity and lack of need of additional parameters. The main disadvantage : this might be a really rough method.
Intelligent downsampling: here you could add a Convolutional2D layers with huge subsampling (e.g. with filter size (4, 4) and subsample (4, 4)). This might be consider as intelligent pooling. A main disadvantage of this method is additional parameters need for this approach.
When using the function tf.nn.fractional_max_pool in Tensorflow, in addition to the output pooled tensor it returns, it also returns a row_pooling_sequence and a col_pooling_sequence, which I presume is used in backpropagation to find the gradient of. This is in contrast to the normal $2 \times 2$ max pooling, which just returns the pooled tensor.
My question is: do we have to handle the row_pooling and col_pooling values ourselves? How would we include them into a network to get backpropagation working properly? I modified a simple convolutional neural network to use fractional max pooling instead of 2 x 2 max pooling without making use of these values and the results were much poorer, leading me to believe we must explicitly handle these.
Here's the relevant portion of my code that makes use of the FMP:
def add_layer_ops_FMP(conv_func, x_input, W, keep_prob_layer, training_phase):
h_conv = conv_func(x_input, W, stride_l = 1)
h_BN = batch_norm(h_conv, training_phase, epsilon)
h_elu = tf.nn.elu(h_BN) # Rectified unit layer - change accordingly
def dropout_no_training(h_elu=h_elu):
return dropout_op(h_elu, keep_prob = 1.0)
def dropout_in_training(h_elu=h_elu, keep_prob_layer=keep_prob_layer):
return dropout_op(h_elu, keep_prob = keep_prob_layer)
h_drop = tf.cond(training_phase, dropout_in_training, dropout_no_training)
h_pool, row_pooling_sequence, col_pooling_sequence = tf.nn.fractional_max_pool(h_drop) # FMP layer. See Ben Graham's paper
return h_pool
Link to function on github.
Do we need to handle row_pooling_sequence and col_pooling_sequence?
Even though the tf.nn.fractional_max_pool documentation says it turns 2 extra tensors which are needed to calculate gradient, I believe we do not need to specially handle these 2 extra tensors and add them into gradient calculation operation. The backpropagation of tf.nn.fractional_max_poolin TensorFlow is already registered into the gradient calculation flow by the _FractionalMaxPoolGrad function. As you can see in the _FractionalMaxPoolGrad, the row_pooling_sequence and col_pooling_sequence are extracted by op.outputs[1] and op.outputs[2] and used to calculate gradient.
#ops.RegisterGradient("FractionalMaxPool")
def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
"""..."""
return gen_nn_ops._fractional_max_pool_grad(op.inputs[0], op.outputs[0],
grad_0, op.outputs[1],
op.outputs[2],
op.get_attr("overlapping"))
Possible reasons for poorer performance after using fractional_max_pool (in my personal opinions).
In the fractional max pooling paper, the author used fractional max pooling in a spatially-sparse convolutional network. According to his spatially-sparse convolutional network design, he actually extended the image input spatial size by padding zeros. Additionally, fractional max pooling downsizes the input by a factor of pooling_ratio which is often less than 2. These two combined together allowed stacking more convolutional layers than using regular max pooling and hence building a deeper network. (i.e. imagine using CIFAR-10 dataset, the (non-padding) input spatial size is 32x32, the spatial size drops to 4x4 after 3 convolutional layers and 3 max pooling operations. If using fractional max pooling with pooling_ratio=1.4, the spatial size drops to 4x4 after 6 convolutional and 6 fractional max pooling layers). I experimented with building a CNN with 2-conv-layer+2-pooling-layer(regular max pool vs. fractional max pool with pooling_ratio=1.47)+2-fully-connected-layer on MNIST dataset. The one using regular max pooling also produced a better performance than the one using fractional max pooling (down by 15~20% performance). Comparing the spatial size before feeding into fully connected layers, the model with regular max pooling has spatial size of 7x7, the one with fractional max pooling has spatial size of 12x12. Adding one more conv+fractional_max_pool into the latter model (final spatial size dropped to be 8x8) improved the performance to a more comparative level with the former model with regular max pooling.
In summary, I personally think the good performance in the Fractional Max-Pooling paper is achieved by a combination of using spatially-sparse CNN with fractional max-pooling and small filters (and network in network) which enable building a deep network even when the input image spatial size is small. Hence in regular CNN network, simply replace regular max pooling with fractional max pooling does not necessarily give you a better performance.