CS231n: Total memory of VGGnet - memory

I'm reading this CS231n tutorial, about convolutional neural networks. They give an example about VGGNet:
http://cs231n.github.io/convolutional-networks/
VGGNet in detail. Lets break down the VGGNet in more detail as a case
study. The whole VGGNet is composed of CONV layers that perform 3x3
convolutions with stride 1 and pad 1, and of POOL layers that perform
2x2 max pooling with stride 2 (and no padding). We can write out the
size of the representation at each step of the processing and keep
track of both the representation size and the total number of weights:
Then they give a detailed calculation of the network structure:
But the thing is, for total memory, the tutorial gives the result of 24M, but when I calculated it I only got about 15M ! I simply added all of the memories:
>>> 224*224*(3+64*2)+112*112*(64+128*2)+56*56*(128+256*3)+28*28*(256+512*3)+14*14*(512*4)+7*7*512+4096+4096+1000
15237608
Please help me.

Nice catch! Your calculation is correct, total memory of VGG representation is indeed
15.2M * 4 bytes ~= 61Mb
In fact, this error has been reported long time ago, but unfortunately CS231n staff doesn't spend too much time on website maintenance...
However, note that if you code VGG network in any framework (Caffe, Tensorflow, etc), the total model size will include the parameters and this part is much larger, as the authors also show in their calculations (which seems right).

Related

Why max_batches independent of the size of the dataset?

I am wondering why the number of images has no influence on the number of iterations when training. Here is an example to to make my question clearer:
Suppose we have 6400 images for a training to recognize 4 classes. Based on AlexeyAB explanations, we keep batch= 64, subdivisions = 16 and write max_batches = 8000 since max_batches is determined by #classes x 2000.
Since we have 6400 images, a complete epoch requires 100 iterations. Therefore this training ends after 80 epochs.
Now, suppose that we have 12800 images. In that case, an epoch needs 200 iterations. Therefore the training ends after 40 epochs.
Since an epoch refers to one cycle through the full training dataset, I'm wondering why we don't increase the number of iterations when our dataset increases, in order to keep the number of epochs constant.
Said differently, I'm asking for a simple explanation as to why the number of epochs seems to be irrelevant to the quality of the training. I feel that it's a consequence of Yolo's construction but I am not knowledgeable enough to understand how.
Why the number of images has no influence on the number of iterations when training?
In darknet yolo, the number of iterations depends on the max_batches parameter in .cfg file. After running for max_batches, the darknet saves the final_weights.
In each epoch, all the data samples are passed through the network, so if you have many images, the training time for one epoch (and iteration) will be higher, you can test that by increasing images in your data.
The sub-division accounts for the number of mini-batches. Let's say, you have 100 images in your dataset. your batch size is 10, sub-division is 2, max_batches is 20.
So, in each iteration, 10 images are passed to the network in two mini-batches (Each having 5 samples), once you have done 20 baches (20*10 data samples), the training will be completed. (The details can be a little different, I'm using a slightly modified darknet by original author pjreddie)
The instructions are updated now. max_batches is equal to classes*2000 but not less than number of training images and not less than 6000. Please find it at this link.

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

Should I normalize my features before throwing them into RNN?

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)
It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.
Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.
I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

feature number in tensorflow tf.nn.conv2d

In the Tensorflow example "Deep MNIST for Experts" https://www.tensorflow.org/get_started/mnist/pros
I am not clear how to determine the feature number specified in weight of activation function.
For example:
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolution will compute 32 features for
each 5x5 patch.
W_conv1 = weight_variable([5, 5, 1, 32])
Why 32 is picked here?
In order to build a deep network, we stack several layers of this
type. The second layer will have 64 features for each 5x5 patch.
W_conv2 = weight_variable([5, 5, 32, 64])
Again, why 64 is picked?
Now that the image size has been reduced to 7x7, we add a
fully-connected layer with 1024 neurons to allow processing on the
entire image.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Why 1024 here?
Thanks
Each of these filters will actually do something, like check for edges, check for colour change, or right-shift, left-shit the image, sharpen, blur etc.
Each of these filters are actually working on finding out the meaning of the image by sharpening, enhancing, smoothening, intensifying etc.
For e.g. check this link which explains the meaning of these filters
http://setosa.io/ev/image-kernels/
So all these filters are actually neurons where the output will be max-pooled and eventually fed into a FC layer after some activation.
If you are looking for just understanding the filters, that is another approach. However if you are looking to learn how conv. architectures work but since these are tried and tested filters over the dataset, you hsould just go with it for now.
The filters also learn through Backprop.
32 and 64 are number of filters in the respective layers.
1024 is the number of output neurons in the fully connected layer.
Your question basically is about the reason behind the choice of these hyperparameters.
There is no mathematical or programming reason behind these specific choices. These have been picked up after experiments as they delivered a good accuracy over MNIST dataset.
You can change these numbers and that is one way by which you can modify a model.
Unfortunately you cannot yet explore the reason for the choice behind these parameters within TensorFlow or any other literature source.

understanding "Deep MNIST for Experts"

I am trying to understand Deep MNIST for Experts. I have a quite clear idea of how Neural networks and deep learning works on a high level, but I struggle to understand the details.
In the tutorial the first write and run a simple one layer model. This includes defining the model x*W+b, calculating the entropy, minimizing the entropy by gradient decent and evaluating the result.
The first part I found quite easy to run and understand.
In the second part the build a simple multi level network, and apply some convolutions and pooling. However, here things start to get tricky. They write:
We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolutional will compute 32 features for each 5x5 patch.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
I think a visual model can greatly reduce the difficulty of understanding, so perhaps this can help you understand better:
http://scs.ryerson.ca/~aharley/vis/conv/
This is a 3D visualization of a convolutional neural network, it has two convolution layers and followed with two max pooling layers, you can click a 3D cube in each layer to check the value.
So in general you have to read a lot about CNNs/NN before trying to understand what is really going on. These examples are not really supposed to be introduction course to NN, these do assume you know what CNNs are.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
This is completely different 'level of abstraction', you are comparing unrelated objects to each other, which is obviously confusing. They are creating 32 filters, each will linearly map your whole image, through a 5x5 convolution kernel moving through your image. For example one such filter could be an edge detector:
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
another can detect diagonal lines
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
etc. Why 32? Just a magical number, tested empirically. This is actually quite small number in terms of CNNs (notice that just to detect basic edges in greyscale images you already need 8 different filters!).
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
This is initializer of the weights. This is not a "model for modelling handrwitten numbers", this is simply a starting point for optimization of this part of the parameters space. Why normal distribution? We have some mathematical intuition how to initalize neural nets, especially assuming ReLU activations. It is important to initialize in a random way, which ensures that many of your neurons will be initially active, so you do not get 0 derivatives (thus lack of ability to learn using typical optimizers).
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
In principle you can model everything with a single-hidden layer feed forward net, even without convolutions. However, it might require exponentialy as many hidden units, and perfect optimization strategies which we do not have (and maybe they do not even exist!). Depth of the network gives you ability to express more complex (and for same cases more useful) features with less parameters, plus we know more or less how to optimize it. However you should avoid an often pitfall of assuming "deeper is better". This is not true in general. This is true if important features of your data can be efficiently expressed as a hierarchical structure of abstraction. It is true for images (more and more complex patterns, first edges, then some lines and curves, then patches, then more complex concepct etc.) as well as text, sound etc. but before you try to apply DL for your new task you should ask yourself whether this is (or at least might be) true for your case. Using too complex model is usually way worse than too simple.

Resources