understanding "Deep MNIST for Experts" - machine-learning

I am trying to understand Deep MNIST for Experts. I have a quite clear idea of how Neural networks and deep learning works on a high level, but I struggle to understand the details.
In the tutorial the first write and run a simple one layer model. This includes defining the model x*W+b, calculating the entropy, minimizing the entropy by gradient decent and evaluating the result.
The first part I found quite easy to run and understand.
In the second part the build a simple multi level network, and apply some convolutions and pooling. However, here things start to get tricky. They write:
We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolutional will compute 32 features for each 5x5 patch.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?

I think a visual model can greatly reduce the difficulty of understanding, so perhaps this can help you understand better:
http://scs.ryerson.ca/~aharley/vis/conv/
This is a 3D visualization of a convolutional neural network, it has two convolution layers and followed with two max pooling layers, you can click a 3D cube in each layer to check the value.

So in general you have to read a lot about CNNs/NN before trying to understand what is really going on. These examples are not really supposed to be introduction course to NN, these do assume you know what CNNs are.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
This is completely different 'level of abstraction', you are comparing unrelated objects to each other, which is obviously confusing. They are creating 32 filters, each will linearly map your whole image, through a 5x5 convolution kernel moving through your image. For example one such filter could be an edge detector:
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
another can detect diagonal lines
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
etc. Why 32? Just a magical number, tested empirically. This is actually quite small number in terms of CNNs (notice that just to detect basic edges in greyscale images you already need 8 different filters!).
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
This is initializer of the weights. This is not a "model for modelling handrwitten numbers", this is simply a starting point for optimization of this part of the parameters space. Why normal distribution? We have some mathematical intuition how to initalize neural nets, especially assuming ReLU activations. It is important to initialize in a random way, which ensures that many of your neurons will be initially active, so you do not get 0 derivatives (thus lack of ability to learn using typical optimizers).
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
In principle you can model everything with a single-hidden layer feed forward net, even without convolutions. However, it might require exponentialy as many hidden units, and perfect optimization strategies which we do not have (and maybe they do not even exist!). Depth of the network gives you ability to express more complex (and for same cases more useful) features with less parameters, plus we know more or less how to optimize it. However you should avoid an often pitfall of assuming "deeper is better". This is not true in general. This is true if important features of your data can be efficiently expressed as a hierarchical structure of abstraction. It is true for images (more and more complex patterns, first edges, then some lines and curves, then patches, then more complex concepct etc.) as well as text, sound etc. but before you try to apply DL for your new task you should ask yourself whether this is (or at least might be) true for your case. Using too complex model is usually way worse than too simple.

Related

CS231n: Total memory of VGGnet

I'm reading this CS231n tutorial, about convolutional neural networks. They give an example about VGGNet:
http://cs231n.github.io/convolutional-networks/
VGGNet in detail. Lets break down the VGGNet in more detail as a case
study. The whole VGGNet is composed of CONV layers that perform 3x3
convolutions with stride 1 and pad 1, and of POOL layers that perform
2x2 max pooling with stride 2 (and no padding). We can write out the
size of the representation at each step of the processing and keep
track of both the representation size and the total number of weights:
Then they give a detailed calculation of the network structure:
But the thing is, for total memory, the tutorial gives the result of 24M, but when I calculated it I only got about 15M ! I simply added all of the memories:
>>> 224*224*(3+64*2)+112*112*(64+128*2)+56*56*(128+256*3)+28*28*(256+512*3)+14*14*(512*4)+7*7*512+4096+4096+1000
15237608
Please help me.
Nice catch! Your calculation is correct, total memory of VGG representation is indeed
15.2M * 4 bytes ~= 61Mb
In fact, this error has been reported long time ago, but unfortunately CS231n staff doesn't spend too much time on website maintenance...
However, note that if you code VGG network in any framework (Caffe, Tensorflow, etc), the total model size will include the parameters and this part is much larger, as the authors also show in their calculations (which seems right).

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

Should I normalize my features before throwing them into RNN?

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)
It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.
Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.
I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

LIBSVM: Get support vectors from model file

This may be a weird request so some explanation first. I recently had a sudden hd crash and lost a data file I was using to generate model files with libSVM. I do have the SVM model and scaling file that I generated from this data file and I was wondering if there is a way to generate a data file from the Support Vectors in the model file, something like model_sv_to_instances(model, &instances) since thhe process for obtaining instances is very costly. (I know it won't be the same as the original but still it's better than nothing) I'm using a probabilistic SVM with RBF kernel.
If you open a given model file in any text editor you would find something like this:
svm_type c_svc
kernel_type sigmoid
gamma 0.5
coef0 0
nr_class 2
total_sv 4
rho 0
label 0 1
nr_sv 2 2
SV
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Where the interesting thing for you is after the line with SV.
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Those are data points that were selected as support vectors, so you just have to parse the file. The format is as follows :
[label] [index1]:[value1] [index2]:[value2] ... [indexn][valuen]
For instance, from my example you can conclude that my training set was:
x y desired val
0 0 -1
0 1 1
1 0 1
1 1 -1
A few considerations and warnings. The ratio between number of SVs and data points depends on the parameters that you used. In some cases the ratio is big and you would have very few SVs in comparison with your data.
Another thing to keep in mind is that this reduction is likely to change the problem and if you train again just with SVs as data points you would probably get a complete different model with a complete different set of parameters.
Good luck!
In the case of RBF you are lucky. According to the libsvm FAQ you can extract the support vectors from the model file:
In the model file, after parameters and other informations such as labels , each line represents a support vector.
But remember, these are only the support vectors, which are only a fraction of your original input data.
To the best of my knowledge, SVM models in general, and libSVM models in particular, consist of only the support vectors. These vectors represent the borderline between the classes; most probably, they don't represent the vast majority of your data points. So, unfortunately, I don't think there's a way to regenerate your data from the model.
Having said that, I can think of an esoteric case where there might be some value to the model: there are companies specializing in recovering data in such cases (e.g. from crashed HDs). However, the recovered data sometimes has gaps; in certain cases, the model might be reverse-engineered to fill-in some missing spots. However, this is very theoretic.
EDIT: as the other answers state, the proportion of data points represented by the support vectors might vary, depending on the specific problem and parameters. However, as stated above, in most common cases, you'll be able to reconstruct only a small fraction of your original data set.

Resources