How to interpret patterns occuring in convolutional layer after training? - image-processing

I am pretty sure I understood' the principle of cnn and why they are prefered over just fully connected neural networks. What I try to comprehend is how to interpret the occuring patterns after training the model.
So let's assume I want to recognize the number "1" written on an 256x256 big image-plane (only 1 bit image, black/white) that is then forwared to the output that either says "is a one", or "is not a one".
If the model is untrained and the first handwritten "1" is forwared, the result could be "[0.28, 0.72] which is obiously wrong. I then calculate the error between [0.28, 0.72] and [1, 0] (for example based on the mean squared error), derive it and try to find the local minimas of the derivative (backpropagation). Then I calculate the delta values for each weight (by using chainrule and partial derivative) until I finally reach the convolutional layer for which delta values are also calculated.
But my question now is: What exactly do the patterns that will occur by adding up bunch of delta values to the convolutional layer "weights" mean? Why do they find certain features characteristical for the number "1"? Or is it more like, it does not find any specific features per se, but rather it "encodes" the relationship between handwritten "1"s and the desired output [1, 0] into the convolutional layers?
_

Related

Can a dense layer on many inputs be represented as a single matrix multiplication?

Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Why is there an activation function in each neural net layer, and not just one in the final layer?

I'm trying to teach myself machine learning and I have a similar question to this.
Is this correct:
For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):
x1 x2 x3 label
5 1 2 yes
3 9 8 no
1 2 3 yes
9 9 9 no
That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:
i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.
This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:
from keras.models import Sequential
from keras.models import Dense
model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))
What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.
Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.
Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.
So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,
No.
The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).
Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.
What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.
So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:
model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))
where in the functional API it would be written as:
inputs = Input(shape=(3,))
outputs = Dense(1, activation='linear')(inputs)
model = Model(inputs, outputs)
and that's all, i.e. it is actually just linear regression.
I know the bias doesn't have a weight
The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.
why the keras model has an activation function in each layer of the network and not just at the end
I trust this has been covered sufficiently in the other answer.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.
We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.
It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.
Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.
h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)
Let's do some manipulations. We'll simply replace h3 as a function of x,
h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))
In other words, h3 is in the following format.
h3 = Softmax(W.x + b)
So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.
Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function.
Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

How to generate the predicted label in caffe from the output of the last layer?

I have trained my own dataset of images (traffic light images 11x27) with LeNet, using caffe and DIGITS interface. I get 99% accuracy and when I give new images via DIGITS, it predicts the good label, so the network seems to work very well.
However, I struggle to predict the labels through Python/Matlab API for caffe. The last layer output (ip2) is a vector with 2 elements (I have 2 classes), which looks like [4.8060, -5.2608] for example (the first component is always positive, the second always negative and the absolute values range from 4 to 20). I know it from many tests in Python, Matlab and DIGITS.
My problem is :
Argmax can't work directly on this layer (it always gives 0)
If I use a softmax function, it will always give me [1, 0] (and that's actually the value of net.blobs['prob'] or out['prob'] in the python interface, no matter the class of my image)
So, how can I get the good label predicted ?
Thanks!

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources