MLP with both Pattern Recognition & Forecasting Inputs: Bad Idea? - machine-learning

This is regarding a 3-layer MLP (Input, Hidden, Output) in Ward Systems NeuroShell 2
I would prefer to split these input layer classes (PR & F) into 2 separate nets with their own hidden layers that then feed a single output layer - this would be a 3 layer network. There could be a 4 layer version using a new hidden layer to combine the 2 nets:
1) Inputs (partitioned into F and PR classes)
2) Hiddens (partitioned into F and PR classes)
3) Hiddens (fully connected "mixing" layer)
4) Output
These structures would be trained at once as opposed to training the two networks, getting the output/prediction, and then averaging those 2 numbers.
I've found that while averaging outputs works, "letting a net do it" works even better. But this requires layer partitioning which my platform (NeuralShell 2) cannot. And I've never read a paper where anyone attempts to do better than averaging.
FYI the ratio of PR to F inputs is 10:1.
Most discussion of nets is Forecasting with usually is of the Order of 10 inputs. Pattern Recognition has Orders more, 100's to 1000's and even more.
In fact, it seems that the two types of problems are virtually mutually exclusive when searching the research.
So my conclusion is having both types of structure in a single network is probably a very bad idea.
Agreed?

Not a bad idea at all! In fact this approach is a very common one that is used very frequently, you just missed out on some secret lingo.
Basically, what you're trying to do here is an ensemble prediction. The best way to approach this is to train two entirely separate nets for both halves of your problem. Then use those outputs as inputs to a new neural network.
The field is known as ensemble learning and the results are often quite good.
As far as your question of blending pattern recognition and forecasting, well it's really impossible to make a call on that without knowing more specifics around the data you're working with, but just because people haven't tried it doesn't mean you shouldn't either.

forecasting using time series creates a sliding window of data sequences for input into the network. I don't think you show mix forecasting with classification. The output is for classification can be either a softmax or a binary result. Whereas the output of a forecast will be a dense output with one neuron.
https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
[10, 20, 30, 40, 50, 60, 70, 80, 90]
X, y
10, 20, 30 40
20, 30, 40 50
30, 40, 50 60
The input X and y is then fit by the multi layer perceptron.

Related

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Why is there an activation function in each neural net layer, and not just one in the final layer?

I'm trying to teach myself machine learning and I have a similar question to this.
Is this correct:
For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):
x1 x2 x3 label
5 1 2 yes
3 9 8 no
1 2 3 yes
9 9 9 no
That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:
i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.
This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:
from keras.models import Sequential
from keras.models import Dense
model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))
What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.
Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.
Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.
So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,
No.
The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).
Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.
What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.
So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:
model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))
where in the functional API it would be written as:
inputs = Input(shape=(3,))
outputs = Dense(1, activation='linear')(inputs)
model = Model(inputs, outputs)
and that's all, i.e. it is actually just linear regression.
I know the bias doesn't have a weight
The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.
why the keras model has an activation function in each layer of the network and not just at the end
I trust this has been covered sufficiently in the other answer.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.
We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.
It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.
Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.
h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)
Let's do some manipulations. We'll simply replace h3 as a function of x,
h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))
In other words, h3 is in the following format.
h3 = Softmax(W.x + b)
So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.
Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function.
Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

Learning a Sin function

I'm new to Machine Learning
I' building a simple model that would be able to predict simple sin function
I generated some sin values, and feeding them into my model.
from math import sin
xs = np.arange(-10, 40, 0.1)
squarer = lambda t: sin(t)
vfunc = np.vectorize(squarer)
ys = vfunc(xs)
model= Sequential()
model.add(Dense(units=256, input_shape=(1,), activation="tanh"))
model.add(Dense(units=256, activation="tanh"))
..a number of layers here
model.add(Dense(units=256, activation="tanh"))
model.add(Dense(units=1))
model.compile(optimizer="sgd", loss="mse")
model.fit(xs, ys, epochs=500, verbose=0)
I then generate some test data, which overlays my learning data, but also introduces some new data
test_xs = np.arange(-15, 45, 0.01)
test_ys = model.predict(test_xs)
plt.plot(xs, ys)
plt.plot(test_xs, test_ys)
Predicted data and learning data looks as follows. The more layers I add, the more curves network is able to learn, but the training process increases.
Is there a way to make it predict sin for any number of curves? Preferably with a small number of layers.
With a fully connected network I guess you won't be able to get arbitrarily long sequences, but with an RNN it looks like people have achieved this. A google search will pop up many such efforts, I found this one quickly: http://goelhardik.github.io/2016/05/25/lstm-sine-wave/
An RNN learns a sequence based on a history of inputs, so it's designed to pick up these kinds of patterns.
I suspect the limitation you observed is akin to performing a polynomial fit. If you increase the degree of polynomial you can better fit a function like this, but a polynomial can only represent a fixed number of inflection points depending on the degree you choose. Your observation here appears the same. As you increase layers you add more non-linear transitions. However, you are limited by a fixed number of layers you chose as the architecture in a fully connected network.
An RNN does not work on the same principals because it maintains a state and can make use of the state being passed forward in the sequence to learn the pattern of a single period of the sine wave and then repeat that pattern based on the state information.

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

Artificial Neural Network Training and Testing

I'm trying to create an ANN that will solve a simple classification problem the example I am using is a degree classification so the input will be a percentage between 0-100 and the output will be one of five (1st, 2:1, 2:2...).
Currently I have set up a neural network with three layers, 1 input neuron, 3 hidden neurons and 5 output neurons, I have managed to train the network using one input e.g. 60 and the output (1,0,0,0,0). I am unsure though how i would go about properly training the network for each input and output combination so that after training I would be able to input the percentage and the correct output neuron would be the number closest to 1.
The network uses standard feed forward and back propagation algorithms, random weights and the Sigmoid function.
I have a file which I was thinking would work with inputs 0-100 with the outputs inbetween:
0
1, 0, 0, 0, 0
1
1, 0, 0, 0, 0
.....
40
0, 1, 0, 0, 0
....
100
0, 0, 0, 0, 1
Thanks
I don't quite understand the function you are trying to learn, but it doesn't matter. The usual way to train an ANN is to use SGD (stochastic gradient descent) where backpropagation is used to compute the gradient for each example at a time. You just repeat looping this over all input examples until it has learned those examples.
One thing you didn't mention is that you need a loss function. In your case, a simple mean squared error might be appropriate.
I suggest that you take a look at the classifer.py python script used for classification at this link - http://www.marekrei.com/blog/theano-tutorial/
The complete code for the above tutorial is available at this link -
https://github.com/marekrei/theano-tutorial
The classifier script at above link is meant for predicting whether the GDP per capita for a country is more than the average GDP. I, however, used the script for different kind of dataset.
I was able to successfully train neural networks in Theano using the above classifier script for classifying a speech sound as the letter "A" or "E".

Resources