For the simplest MNIST solution with 1 hidden layer, can I understand the number of hidden neurons as how many parts we divide the inputs?
For example: [784,30,10]
Can I say I divide 784 pixels into 30 small images (784/30 pixels in each image) then do the calculation?
Thanks!
I'm not entirely sure what you mean. If you have a network that is layered like [784, 30, 10], you have 784 input neurons, 30 hidden neurons and 10 output neurons. The neurons don't know anything about 'pixels', they are just parameters. The network is basically calculating 30 values from the first input, and the output calculates 10 values from the previous 30 values.
Can I say I divide 784 pixels into 30 small images (784/30 pixels in each image) then do the calculation?
No, as a neuron is not an image.
Related
I want to experiment Capsule Networks on FER. For now I am using fer2013 Kaggle dataset.
One thing that I didn't understand in Capsule Net was in the first conv layer, size was reduced to 20x20 - having input image as 28x28 and filters as 9x9 with 1 stride. But in the capsules, the size reduces to 6x6. How did this happen? Because with input size as 20x20 and filters as 9x9 and 2 strides, I couldn't get 6x6. Maybe I missed something.
For my experiment, input size image is 48x48. Should I use the same hyperparams for the start or is there any suggested hyperparams that I can use?
At the beginning, the picture is 28*28 and you apply a kernel of size 9, so you lose (9-1) pixels. (4 for each side). So at the end of the first convolutional layer, you have (28-8)*(28-8)=20*20 pixels, and you apply the same kernel, so again, (20-8)*(20-8)=12*12. But for the second layer the stride is 2, so there is only 12/2=6 pixels left.
With 48*48 pixel, if you apply the same convolutional layer, you will have at the end 16*16 picture. ((48-8-8)/2)
Standard Capsnet has two convolution layers. the first one has stride of 1 and the second one has 2.
if you want to have 6*6 capsule numbers, your filter size should be 19*19.
because:
48 - (19-1) = 30
30 - (19-1) = 12
12 / 2 = 6 (because stride is 2)
I need a (non-trainable) layer in Keras which does a local maximum search for a 2-dim input and sums up all entries in a certain environment around that maximum. The input shape to the layer is something like (None, 1, 24, 16).
To make it more clear let's look at how the input would look like:
The desired output of the layer should then be:
So it should find every local maximum and then sum up all entries from the (up to) 8 neighbouring pixels to that maximum.
It's a high energy physics related problem, so the interpretation of these "images" should not be too important I think.
How can I accomplish that using Keras layers?
Any suggestions? Thanks!
As mentioned above, both
tf.nn.conv2d with strides = 2
and
tf.nn.max_pool with 2x2 pooling
can reduce the size of input to half, and I know the output may be different, but what I don't know is that affect the final training result or not, any clue about this, thanks.
In both your examples assume we have a [height, width] kernel applied with strides [2,2]. That means we apply the kernel to a 2-D window of size [height, width] on the 2-D inputs to get an output value, and then slide the window over by 2 either up or down to get the next output value.
In both cases you end up with 4x fewer outputs than inputs (2x fewer in each dimension) assuming padding='SAME'
The difference is how the output values are computed for each window:
conv2d
the output is a linear combination of the input values times a weight for each cell in the [height, width] kernel
these weights become trainable parameters in your model
max_pool
the output is just selecting the maximum input value within the [height, width] window of input values
there is no weight and no trainable parameters introduced by this operation
The results of the final training could actually be different as the convolution multiplies the tensor by a filter, which you might not want to do as it takes up extra computational time and also can overfit your model as it will have more weights.
I was looking at tensorflow examples by Aymeric Damien (https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/multilayer_perceptron.py) and in multilayer_perceptron.py he uses a neural net to classify MNIST digits. I think he is using a neural network with 784inputs, with 2 hidden layers with 256 neurons each, and 10 outputs. Am I correct? How do matrix dimensions in weights and biases in multilayer_perceptron.py correspond with ANN "dimensions" (#inputs, #hidden layers, #output, #neurons in each hidden layer, etc. Thank you!
This is a 3-layer neural network (2 hidden layers and an output layer).
The connection between the inputs to the first hidden layer has 784 x 256 weights with 256 biases. This configuration is due to the fact that each of the 784 inputs is fully connected to the 256 hidden layer nodes, and each hidden layer node has 1 bias.
The connection between that first hidden layer to the second hidden layer has 256 x 256 weights due to full connectivity between the layers. The second layer's 256 nodes each has 1 bias.
The connection between the second hidden layer and the output layer is similar. There are 256 x 10 weights (for the second hidden layer's 256 nodes and the output layer's 10 nodes), and each output node has 1 bias.
There are thus 785*256 + 256*256 + 256*10 = 269,056 weights and 256 + 256 + 10 = 522 biases.
The figure below should explain it fully.
So, I'm trying to learn fixed vector representations for segments of about 200 songs (~ 3-5 minutes per song) and wanted to use an LSTM-based Sequence-to-sequence Autoencoder for it.
I'm preprocessing the audio (using librosa) as follows:
I'm first just getting a raw audio signal time series of shape around (1500000,) - (2500000,) per song.
I'm then slicing each raw time series into segments and getting a lower-level mel spectrogram matrix of shape (512, 3000) - (512, 6000) per song. Each of these (512,) vectors can be referred to as 'mini-songs' as they represent parts of the song.
I vertically stack all these mini-songs of all the songs together to create the training data (let's call this X). X turns out to be (512, 600000) in size, where the first dimension (512) is the window size and the second dimension (600000) is the total number of 'mini-songs' in the dataset.
Which is to say, there are about 600000 mini-songs in X - each column in X represents a mini-song of length (512,).
Each of these (512,) mini-song vectors should be encoded into a (50,) vector per mini-song i.e. we will have 600000 (50,) vectors at the end of the process.
In more standard terminology, I have 600000 training samples each of length 512. [Think of this as being similar to an image dataset - 600000 images, each of length 784, where the images are of resolution 32x32. Except in my case I want to treat the 512-length samples as sequences that have temporal properties.]
I read the example here and was looking to extend that for my use case. I was wondering what the timesteps and input_dim parameters to the Input layer should be set to.
I'm setting timesteps = X.shape[0] (i.e. 512 in this case) and input_dim = X.shape[1] (i.e 600000). Is this the correct way to go about it?
Edit: Added clarifications above.
Your input is actually a 1D sequence not a 2D image.
The input tensor will be (600000, 512, 1) and you need to set the input_dim to 1 and the timesteps to 512.
The shape input does not take the first dimension of the tensor (i.e. 600000 in your case).