Is it better to make neural network to have hierarchical output? - machine-learning

I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?

I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.

I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.

Related

Will Neural Network translate to the same sentence in different runtimes? And can I receive many translated sentences at a runtime?

I intend to use Neural Network (CNN, RNN, etc) to translate from a language to another one in a sentence unit. I wonder that this network will give us the same sentence in different runtimes or not. And can we have many translated sentences at a runtime?
Supposed we have these scenarios:
Runtime 1: sentence --- a
Runtime 2: sentence --- a
Runtime 3: sentence --- b
Runtime 4: sentence --- a, b, c, etc
Which scenarios does NW will give us? Thank!
If you have two identical neural networks (same architecture and same weights), then inference is deterministic: two identical inputs will give the same output. That wouldn't be true if you used some kind of randomness inside the archicture of your neural networks, for example if you used a Variational Autoencoder (VAE), or a Generative Adversarial Network (GAN), as you would be learning and sampling statistical distributions.
For your second question: neural networks take tensors as input, and provide tensors as output. The input can a 1D tensor (vector), 2D tensor (matrix), or even a 666D tensor (although it is not recommended). In the end, inference with neural networks is just a series of tensor product.
When you study linear algebra, you learn that in a product of tensors you can always stack one of the tensor with itself (or with a different one of same size) alongside a specific dimension and the expression will remain correct. So if you stack your input tensors correctly (one-hot encoding of your sentences I guess), you can run your prediction as a batch. In this case your output tensors (one-hot encoding of your translated sentences) will also be stacked together. But be aware that (1) such batch should fit in memory, and (2) the larger the batch the more computation it will take.
yes, as long as you have 2 different neural networks. I am not an expert in this at the same time though because I have only done it once.

Where do neurons in a neural network share their predictive results (learned functions)?

Definitely a noob NN question, but here it is:
I understand that neurons in a layer of a neural network all initialize with different (essentially random) input-feature weights as a means to vary their back-propagation results so they can converge to different functions describing the input data. However, I do not understand when or how these neurons generating unique functions to describe the input data "communicate" their results with each other, as is done in ensemble ML methods (e.g. by growing a forest of trees with randomized initial-decision criteria and then determining the most discriminative models in the forest). In the trees ensemble example, all of the trees are working together to generalize the rules each model learns.
How, where, and when do neurons communicate their prediction functions? I know individual neurons use gradient descent to converge to their respective functions, but they are unique since they started with unique weights. How do they communicate these differences? I imagine there's some subtle behavior in combining the neuronic results in the output layer where this communication is occurring. Also, is this communication part of the iterative training process?
Someone in the comments section (https://datascience.stackexchange.com/questions/14028/what-is-the-purpose-of-multiple-neurons-in-a-hidden-layer) asked a similar question, but I didn't see it answered.
Help would be greatly appreciated!
During propagation, each neuron typically participates in forming the value in multiple neurons of the next layer. In back-propagation, each of those next-layer neurons will try to push the participating neurons' weights around in order to minimise the error. That's pretty much it.
For example, let's say you're trying to get a NN to recognise digits. Let's say that one neuron in a hidden layer starts getting ideas on recognising vertical lines, another starts finding horisontal lines, and so on. The next-layer neuron that is responsible for finding 1 will see that if it wants to be more accurate, it should pay lots of attention to the vertical line guy; and also the more the horisontal line guy yells the more it's not a 1. That's what weights are: telling each neuron how strongly it should care about each of its inputs. In turn, the vertical line guy will learn how to recognise vertical lines better, by adjusting weights for its input layer (e.g. individual pixels).
(This is quite abstract though. No-one told the vertical line guy that he should be recognising vertical lines. Different neurons just train for different things, and by the virtue of mathematics involved, they end up picking different features. One of them might or might not end up being vertical line.)
There is no "communication" between neurons on the same layer (in the base case, where layers flow linearly from one to the next). It's all about neurons on one layer getting better at predicting features that the next layer finds useful.
At the output layer, the 1 guy might be saying "I'm 72% certain it's a 1", while the 7 guy might be saying "I give that 7 a B+", while the third one might be saying "A horrible 3, wouldn't look at twice". We instead usually either take whoever's loudest's word for it, or we normalise the output layer (divide by the sum of all outputs) so that we have actual comparable probabilities. However, this normalisation is not actually a part of neural network itself.

Can a recurrent neural network learn slightly different sequences at once?

Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?
Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.

Why is there only one hidden layer in a neural network?

I recently made my first neural network simulation which also uses a genetic evolution algorithm. It's simple software that just simulates simple organisms collecting food, and they evolve, as one would expect, from organisms with random and sporadic movements into organisms with controlled, food-seeking movements. Since this kind of organism is so simple, I only used a few hidden layer neurons and a few input and output neurons. I understand that more complex neural networks could be made by simply adding more neurons, but can't you add more layers? Or would this create some kind of redundancy? All of the pictures of diagrams of neural networks, such as this one http://mechanicalforex.com/wp-content/uploads/2011/06/NN.png, always have one input layer, one hidden layer, and one output layer. Couldn't a more complex neural network be made if you just added a bunch of hidden layers? Of course this would make processing the neural network harder, but would it create any sort of advantage, or would it be just the same as adding more neurons to a single layer?
You can include as many hidden layers you want, starting from zero (--that case is called perceptron).
The ability to represent unknown functions, however, does -- in principle -- not increase. Single-hidden layer neural networks already possess a universal representation property: by increasing the number of hidden neurons, they can fit (almost) arbitrary functions. You can't get more than this. And particularly not by adding more layers.
However, that doesn't mean that multi-hidden-layer ANN's can't be useful in practice. Yet, as you get another dimension in your parameter set, people usually stuck with the single-hidden-layer version.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

Resources