Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?
Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.
Related
I am trying to train a neural network with complex numbers as input, as generally suggested I split the complex number into real and imaginary part for feeding the data into network. However, what if I use modulus (argument) and phase of the complex number instead of real and imaginary part? How will it affect my training process? I am unable to have a conclusive thought on this.
I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?
I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.
I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.
The attributes have been saved in 11 columns in csv file. If the order of columns change, Do Randomforest & RandomTree could give different accuracy in each time?
Ordering of the features does not affect any of classifiers I know (except those which are specially designed to do so - like specialistic classifiers for time series and other temporal features), no matter if it is Neural Network, SVM, RandomForest, RandomTree or NaiveBayes - it is just a numerical simplification, as arrays are faster then sets, while "under the hood" they are treated as unordered sets (only with indicies showing from which dimension it comes from).
What can change is the output of the particular classifier each time you run your code due to its probabilistic/stochastic methods of learning. For example - neural networks have random initializations, RandomForests has random subsampling etc.
So answer is suprisingly "yes, it can change after order of columns change", but the reason for this is not the change in order, but fact, that after you do so, the internal random number generator already passed some cycles and will generate different numbers.
I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."
Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).