Artificial Neural Network Training and Testing - machine-learning

I'm trying to create an ANN that will solve a simple classification problem the example I am using is a degree classification so the input will be a percentage between 0-100 and the output will be one of five (1st, 2:1, 2:2...).
Currently I have set up a neural network with three layers, 1 input neuron, 3 hidden neurons and 5 output neurons, I have managed to train the network using one input e.g. 60 and the output (1,0,0,0,0). I am unsure though how i would go about properly training the network for each input and output combination so that after training I would be able to input the percentage and the correct output neuron would be the number closest to 1.
The network uses standard feed forward and back propagation algorithms, random weights and the Sigmoid function.
I have a file which I was thinking would work with inputs 0-100 with the outputs inbetween:
0
1, 0, 0, 0, 0
1
1, 0, 0, 0, 0
.....
40
0, 1, 0, 0, 0
....
100
0, 0, 0, 0, 1
Thanks

I don't quite understand the function you are trying to learn, but it doesn't matter. The usual way to train an ANN is to use SGD (stochastic gradient descent) where backpropagation is used to compute the gradient for each example at a time. You just repeat looping this over all input examples until it has learned those examples.
One thing you didn't mention is that you need a loss function. In your case, a simple mean squared error might be appropriate.

I suggest that you take a look at the classifer.py python script used for classification at this link - http://www.marekrei.com/blog/theano-tutorial/
The complete code for the above tutorial is available at this link -
https://github.com/marekrei/theano-tutorial
The classifier script at above link is meant for predicting whether the GDP per capita for a country is more than the average GDP. I, however, used the script for different kind of dataset.
I was able to successfully train neural networks in Theano using the above classifier script for classifying a speech sound as the letter "A" or "E".

Related

MLP with both Pattern Recognition & Forecasting Inputs: Bad Idea?

This is regarding a 3-layer MLP (Input, Hidden, Output) in Ward Systems NeuroShell 2
I would prefer to split these input layer classes (PR & F) into 2 separate nets with their own hidden layers that then feed a single output layer - this would be a 3 layer network. There could be a 4 layer version using a new hidden layer to combine the 2 nets:
1) Inputs (partitioned into F and PR classes)
2) Hiddens (partitioned into F and PR classes)
3) Hiddens (fully connected "mixing" layer)
4) Output
These structures would be trained at once as opposed to training the two networks, getting the output/prediction, and then averaging those 2 numbers.
I've found that while averaging outputs works, "letting a net do it" works even better. But this requires layer partitioning which my platform (NeuralShell 2) cannot. And I've never read a paper where anyone attempts to do better than averaging.
FYI the ratio of PR to F inputs is 10:1.
Most discussion of nets is Forecasting with usually is of the Order of 10 inputs. Pattern Recognition has Orders more, 100's to 1000's and even more.
In fact, it seems that the two types of problems are virtually mutually exclusive when searching the research.
So my conclusion is having both types of structure in a single network is probably a very bad idea.
Agreed?
Not a bad idea at all! In fact this approach is a very common one that is used very frequently, you just missed out on some secret lingo.
Basically, what you're trying to do here is an ensemble prediction. The best way to approach this is to train two entirely separate nets for both halves of your problem. Then use those outputs as inputs to a new neural network.
The field is known as ensemble learning and the results are often quite good.
As far as your question of blending pattern recognition and forecasting, well it's really impossible to make a call on that without knowing more specifics around the data you're working with, but just because people haven't tried it doesn't mean you shouldn't either.
forecasting using time series creates a sliding window of data sequences for input into the network. I don't think you show mix forecasting with classification. The output is for classification can be either a softmax or a binary result. Whereas the output of a forecast will be a dense output with one neuron.
https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
[10, 20, 30, 40, 50, 60, 70, 80, 90]
X, y
10, 20, 30 40
20, 30, 40 50
30, 40, 50 60
The input X and y is then fit by the multi layer perceptron.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Input Permutations in Feed-Forward Neural Networks

Given a feed-forward neural-network, how to:
Ensure that it is independent on the order of the inputs? e.g., feeding [0.2, 0.3] would output the same result as [0.3, 0.2];
Ensure that it is independent on the order of groups of inputs? e.g., feeding [0.2, 0.3, 0.4, 0.5] would output the same result as [0.4, 0.5, 0.2, 0.3], but not [0.5, 0.4, 0.3, 0.2];
Ensure that a permutation on the input sequence would give a permutation on the output sequence. e.g., if [0.2, 0.3] gives as output [0.8, 0.7], then [0.3, 0.2] gives as output [0.7, 0.8].
Given the above:
Is there any other solution besides ensuring that the train set covers all the possible permutations?
Is the parity of the hidden layer somehow constrained (i.e., the number of neurons in the hidden layer must be odd or even)?
Does it make sense too look for some sort of symmetry in the weight matrix?
well, it looks like a hard job for NN but
1. I'd make some preprocessing and maybe postprocessing script which would take care of all your permutation, make sure that the easiest possible input is given to NN. I think pre(post)processing would be much easier to achieve your goal than adjusting NN (adding one or more hidden layers)
2&3 NN are usually perceived as blackboxes. It means you train it and analyse just input and output. In most cases it doesn't make sense(time-demanding) to try to understand how is it working inside (of course there are some exceptions eg if you have functional NN and you would like to mine some knowledge - butas i said - it is time-consuming).
In general, there are no constraints regarding to number of hidden neurons per layer. Also, looking for symetry in weight matrix doesn't make sense unless you are trying to find some knowledge ...
Here is my try to answer the questions as best as i can
How to
To get the required results you can either
train all permutations
sort the input data and train it (so it doesn't have to learn the permutations extra)
To get the requested result you do have again two possibilities
train all permutations (timeconsuming)
or better, use another type of network, for example a recurrent neural network with the echo state network training algorithm (paper here)
i would try to solve it again with the echo state network algorithm
I hope it helps even if the possible solutions for the second and third problem are no feed forward networks.
Answering the questions
3 I don't think that it makes any sense to look for symetries in the weight matrix.

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Unable to approximate the sine function using a neural network

I am trying to approximate the sine() function using a neural network I wrote myself. I have tested my neural network on a simple OCR problem already and it worked, but I am having trouble applying it to approximate sine(). My problem is that during training my error converges on exactly 50%, so I'm guessing it's completely random.
I am using one input neuron for the input (0 to PI), and one output neuron for the result. I have a single hidden layer in which I can vary the number of neurons but I'm currently trying around 6-10.
I have a feeling the problem is because I am using the sigmoid transfer function (which is a requirement in my application) which only outputs between 0 and 1, while the output for sine() is between -1 and 1. To try to correct this I tried multiplying the output by 2 and then subtracting 1, but this didn't fix the problem. I'm thinking I have to do some kind of conversion somewhere to make this work.
Any ideas?
Use a linear output unit.
Here is a simple example using R:
set.seed(1405)
x <- sort(10*runif(50))
y <- sin(x) + 0.2*rnorm(x)
library(nnet)
nn <- nnet(x, y, size=6, maxit=40, linout=TRUE)
plot(x, y)
plot(sin, 0, 10, add=TRUE)
x1 <- seq(0, 10, by=0.1)
lines(x1, predict(nn, data.frame(x=x1)), col="green")
When you train the network, you should normalize the target (the sin function) to the range [0,1], then you can keep the sigmoid transfer function.
sin(x) in [-1,1] => 0.5*(sin(x)+1) in [0,1]
Train data:
input target target_normalized
------------------------------------
0 0 0.5
pi/4 0.70711 0.85355
pi/2 1 1
...
Note that that we mapped the target before training. Once you train and simulate the network, you can map back the output of the net.
The following is a MATLAB code to illustrate:
%% input and target
input = linspace(0,4*pi,200);
target = sin(input) + 0.2*randn(size(input));
% mapping
[targetMinMax,mapping] = mapminmax(target,0,1);
%% create network (one hidden layer with 6 nodes)
net = newfit(input, targetMinMax, [6], {'tansig' 'tansig'});
net.trainParam.epochs = 50;
view(net)
%% training
net = init(net); % init
[net,tr] = train(net, input, targetMinMax); % train
output = sim(net, input); % predict
%% view prediction
plot(input, mapminmax('reverse', output, mapping), 'r', 'linewidth',2), hold on
plot(input, target, 'o')
plot(input, sin(input), 'g')
hold off
legend({'predicted' 'target' 'sin()'})
There is no reason your network shouldn't work, although 6 is definitely on the low side for approximating a sine wave. I'd try at least 10 maybe even 20.
If that doesn't work then I think you need to give more detail about your system. i.e. the learning algorithm (back-propagation?), the learning rate etc.
I get the same behavior if use vanilla gradient descent. Try using a different training algorithm.
As far as the Java applet is concerned, I did notice something interesting: it does converge if I use a "bipolar sigmoid" and I start with some non-random weights (such as results from a previous training using a Quadratic function).

Resources