Input Permutations in Feed-Forward Neural Networks

Input Permutations in Feed-Forward Neural Networks - machine-learning

Given a feed-forward neural-network, how to:
Ensure that it is independent on the order of the inputs? e.g., feeding [0.2, 0.3] would output the same result as [0.3, 0.2];
Ensure that it is independent on the order of groups of inputs? e.g., feeding [0.2, 0.3, 0.4, 0.5] would output the same result as [0.4, 0.5, 0.2, 0.3], but not [0.5, 0.4, 0.3, 0.2];
Ensure that a permutation on the input sequence would give a permutation on the output sequence. e.g., if [0.2, 0.3] gives as output [0.8, 0.7], then [0.3, 0.2] gives as output [0.7, 0.8].
Given the above:
Is there any other solution besides ensuring that the train set covers all the possible permutations?
Is the parity of the hidden layer somehow constrained (i.e., the number of neurons in the hidden layer must be odd or even)?
Does it make sense too look for some sort of symmetry in the weight matrix?

well, it looks like a hard job for NN but
1. I'd make some preprocessing and maybe postprocessing script which would take care of all your permutation, make sure that the easiest possible input is given to NN. I think pre(post)processing would be much easier to achieve your goal than adjusting NN (adding one or more hidden layers)
2&3 NN are usually perceived as blackboxes. It means you train it and analyse just input and output. In most cases it doesn't make sense(time-demanding) to try to understand how is it working inside (of course there are some exceptions eg if you have functional NN and you would like to mine some knowledge - butas i said - it is time-consuming).
In general, there are no constraints regarding to number of hidden neurons per layer. Also, looking for symetry in weight matrix doesn't make sense unless you are trying to find some knowledge ...

Here is my try to answer the questions as best as i can
How to
To get the required results you can either
train all permutations
sort the input data and train it (so it doesn't have to learn the permutations extra)
To get the requested result you do have again two possibilities
train all permutations (timeconsuming)
or better, use another type of network, for example a recurrent neural network with the echo state network training algorithm (paper here)
i would try to solve it again with the echo state network algorithm
I hope it helps even if the possible solutions for the second and third problem are no feed forward networks.
Answering the questions
3 I don't think that it makes any sense to look for symetries in the weight matrix.

Related

How to interpret patterns occuring in convolutional layer after training?

I am pretty sure I understood' the principle of cnn and why they are prefered over just fully connected neural networks. What I try to comprehend is how to interpret the occuring patterns after training the model.
So let's assume I want to recognize the number "1" written on an 256x256 big image-plane (only 1 bit image, black/white) that is then forwared to the output that either says "is a one", or "is not a one".
If the model is untrained and the first handwritten "1" is forwared, the result could be "[0.28, 0.72] which is obiously wrong. I then calculate the error between [0.28, 0.72] and [1, 0] (for example based on the mean squared error), derive it and try to find the local minimas of the derivative (backpropagation). Then I calculate the delta values for each weight (by using chainrule and partial derivative) until I finally reach the convolutional layer for which delta values are also calculated.
But my question now is: What exactly do the patterns that will occur by adding up bunch of delta values to the convolutional layer "weights" mean? Why do they find certain features characteristical for the number "1"? Or is it more like, it does not find any specific features per se, but rather it "encodes" the relationship between handwritten "1"s and the desired output [1, 0] into the convolutional layers?
_

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!

The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Do I need to add ReLU function before last layer to predict a positive value?

I am developing a model using linear regression to predict the age. I know that the age is from 0 to 100 and it is a possible value. I used conv 1 x 1 in the last layer to predict the real value. Do I need to add a ReLU function after the output of convolution 1x1 to guarantee the predicted value is a positive value? Currently, I did not add ReLU and some predicted value becomes negative value like -0.02 -0.4…

There's no compelling reason to use an activation function for the output layer; typically you just want to use a reasonable/suitable loss function directly with the penultimate layer's output. Specifically, a RELU doesn't solve your problem (or at most only solves 'half' of it) since it can still predict above 100. In this case -predicting a continuous outcome- there's a few standard loss functions like squared error or L1-norm.
If you really want to use an activation function for this final layer and are concerned about always predicting within a bounded interval, you could always try scaling up the sigmoid function (to between 0 and 100). However, there's nothing special about sigmoid here - any bounded function, ex. any CDF of a signed, continuous random variable, could be similarly used. Though for optimization, something easily differentiable is important.
Why not start with something simple like squared-error loss? It's always possible to just 'clamp' out-of-range predictions to within [0-100] (we can give this a fancy name like 'doubly RELU') when you need to actually make predictions (as opposed to during training/testing), but if you're getting lots of such errors, the model might have more fundamental problems.

Even for a regression problem, it can be good (for optimisation) to use a sigmoid layer before the output (giving a prediction in the [0:1] range) followed by a denormalization (here if you think maximum age is 100, just multiply by 100)
This tip is explained in this fast.ai course.
I personally think these lessons are excellent.

You should use a sigmoid activation function, and then normalize the targets outputs to the [0, 1] range. This solves both issues of being positive and with a limit.
You can easily then denormalize the neural network outputs to get an output in the [0, 100] range.

A model without any hidden layer shows 98 percent accuracy.Why?

I have build two models one is without any hidden layer and I used softmax at the output. And other is with one hidden layer and in hidden layer I used sigmoid as an activation function. I was expecting that the model with one hidden layer will give better performance but I am getting almost same performance in both models. I was wondering why the model without any hidden layer is showing such a high performance? In both cases I have used large amount of data to train the network.
Here is the out of the model without any hidden layer. Can someone please guide me why it is showing such a high accuracy. In literature I have read that deeper network has more expressive power.
`step: 4400, train_acc: 0.99, test_acc: 0.996
step: 4500, train_acc: 1.0, test_acc: 0.996
step: 4600, train_acc: 1.0, test_acc: 0.998
step: 4700, train_acc: 0.99, test_acc: 0.998
step: 4800, train_acc: 1.0,test_acc: 1.0
step: 4900, train_acc: 0.99,test_acc: 0.996`

it seems that your data set is linearly separable , which means a linear classifier can be used to get good accuracy on training set if not 100%. one neuron is all it takes to find a decision boundary for a linearly separable problem. adding more layers and more neurons in each layer with none linear activation functions, is only for the sake of making more complex classifiers for more complex patterns.
conclusion, if you get the most accuracy that is possible, what is more that you expect a more complex network would offer? computation cost of course.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!

Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.

I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples

The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart