Interpreting Neural Network output - machine-learning

For a classification problem, how is the output of the network usually determined?
Say, there are three possible classes, each with a numerical identifier, would a reasonable solution be to sum the outputs and take that sum as the overall output of the network? Or would you take the average of the networks outputs?
There is plenty of information regarding ANN theory, but not much about application, but I apoligise if this is a silly question.

For a multi-layer perceptron classifier with 3 classes, one typically constructs a network with 3 outputs and trains the network so that (1,0,0) is the target output for the first class, (0,1,0) for the second class, and (0,0,1) for the third class. For classifying a new observation, you typically select the output with the greatest value (e.g., (0.12, 0.56, 0.87) would be classified as class 3).

I agree mostly with bogatron and further you will find many posts here advising on this kind of "multi-class classification" with neural networks.
Regarding your heading I would like to add that you can interpret that output as a probability since I struggled to find theoretical foundation for this. Going on I'll talk about a neural network with 3 neurons in the output layer, indicating 1 for the respective class.
Since the sum of all three outputs will always be 1 in training, the neural network will also give feed-forward output with a sum of one (so rather (0.12 0.36 0.52) than bogatrons example)) Then you can interpret these figures as the probability that the respective input belongs to class 1/2/3 (probability is 0.52 that it belongs to class 3)).
This is true when using the logistic function or the tanh as activation functions.
More on this:
Posterior probability via neural networks: http://www-vis.lbl.gov/~romano/mlgroup/papers/neural-networks-survey.pdf
How to convert the output of an artificial neural network into probabilities?

Related

Activation functions in Neural Networks

I have a few set of questions related to the usage of various activation functions used in neural networks? I would highly appreciate if someone could give good explanatory answers.
Why ReLU is used only on hidden layers specifically?
Why Sigmoid is a not used in Multi-class classification?
Why we do not use any activation function in regression problems having all negative values?
Why we use "average='micro','macro','average'" while calculating performance metric in multi_class classification?
I'll answer to the best of my ability the 2 first questions:
Relu (=max(0,x)) is used to extract feature maps from data. This is why it is used in the hidden layers where we're learning what important characteristics or features the data holds that could make the model learn how to classify for example. In the FC layers, it's time to make a decision about the output, so we usually use sigmoid or softmax, which tend to give us numbers between 0 and 1 (probability) that can give an interpretable result.
Sigmoid gives a probability for each class. So, if you have 10 classes, you'll have 10 probabilities. And depending on the threshold used, your model would predict for example that the image corresponds to two classes when in multi-classification you want just one predicted class per image. That's why softmax is used in this context: It chooses the class with the maximum probability. So it'll predict just one class.

Is it better to make neural network to have hierarchical output?

I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?
I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.
I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.

Why are weights and biases necessary in Neural networks ?

So I've been trying to pick up neural networks over the vacation now, and I've gone through a lot of pages about this. Now one thing I don't understand is why do we need weights and biases ?
For weights I've had this intuition that we're trying to multiply certain constants to the inputs such that we could reach the value of y and know the relation, kinda' like y = mx + c. Please help me out with the intuition as well if possible. Thanks in advance :)
I'd like to credit this answer to Jed Fox from this site where I have adapted his explanation. It's a great intro to neural Networks!:
https://github.com/cazala/synaptic/wiki/Neural-Networks-101
Adapted answer:
Neurons in a network are based on neurons found in nature. They take information in and, according to that information, will illicit a certain response. An "activation".
Artificial neurons look like this:
Neuron J
Artificial Neuron
As you can see they have several inputs, for each input there's a weight (the weight of that specific connection). When the artificial neuron activates, it computes its state, by adding all the incoming inputs multiplied by its corresponding connection weight. But neurons always have one extra input, the bias, which is always 1, and has its own connection weight. This makes sure that even when all the inputs are none (all 0s) there's gonna be an activation in the neuron.
After computing its state, the neuron passes it through its activation function, which normalises the result (normally between 0-1).
These weights (and sometimes biases) are what we learn in a neural network. Think of them as a the parameters of the system. Without them they'd be pretty useless!
Additional comment:
In a network, those weighted inputs may come from other neurons so you can begin to see that the weights also describe how neurons related to each other, often signifying the importance of the relationship between 2 neurons.
I hope this helps. There is plenty more information available on the internet and the link above. Consider reading through some of Stanford's Material for CNNs for information om more complicated neural networks.

Why use a restricted Boltzmann machine rather than a multi-layer perceptron?

I'm trying to understand the difference between a restricted Boltzmann machine (RBM), and a feed-forward neural network (NN). I know that an RBM is a generative model, where the idea is to reconstruct the input, whereas an NN is a discriminative model, where the idea is the predict a label. But what I am unclear about, is why you cannot just use a NN for a generative model? In particular, I am thinking about deep belief networks and multi-layer perceptrons.
Suppose my input to the NN is a set of notes called x, and my output of the NN is a set of nodes y. In a discriminative model, my loss during training would be the difference between y, and the value of y that I want x to produce (e.g. ground truth probabilities for class labels). However, what about if I just made the output have the same number of nodes as the input, and then set the loss to be the difference between x and y? In this way, the network would learn to reconstruct the input, like in an RBM.
So, given that a NN (or a multi-layer perceptron) can be used to train a generative model in this way, why would you use an RBM (or a deep belief network) instead? Or in this case, would they be exactly the same?
You can use a NN for a generative model in exactly the way you describe. This is known as an autoencoder, and these can work quite well. In fact, these are often the building blocks of deep belief networks.
An RBM is a quite different model from a feed-forward neural network. They have connections going both ways (forward and backward) that have a probabilistic / energy interpretation. You'll need to read the details to understand.
A deep belief network (DBN) is just a neural network with many layers. This can be a large NN with layers consisting of a sort of autoencoders, or consist of stacked RBMs. You need special methods, tricks and lots of data for training these deep and large networks. Simple back-propagation suffers from the vanishing gradients problem. But if you do manage to train them, they can be very powerful (encode "higher level" concepts).
Hope this helps to point you in the right directions.

Different weights for different classes in neural networks and how to use them after learning

I trained a neural network using the Backpropagation algorithm. I ran the network 30 times manually, each time changing the inputs and the desired output. The outcome is that of a traditional classifier.
I tried it out with 3 different classifications. Since I ran the network 30 times with 10 inputs for each class I ended up with 3 distinct weights but the same classification had very similar weights with a very small amount of error. The network has therefore proven itself to have learned successfully.
My question is, now that the learning is complete and I have 3 distinct type of weights (1 for each classification), how could I use these in a regular feed forward network so it can classify the input automatically. I searched around to check if you can somewhat average out the weights but it looks like this is not possible. Some people mentioned bootstrapping the data:
Have I done something wrong during the backpropagation learning process? Or is there an extra step which needs to be done post the learning process with these different weights for different classes?
One way how I am imaging this is by implementing a regular feed forward network which will have all of these 3 types of weights. There will be 3 outputs and for any given input, one of the output neurons will fire which will result that the given input is mapped to that particular class.
The network architecture is as follows:
3 inputs, 2 hidden neurons, 1 output neuron
Thanks in advance
It does not make sense if you only train one class in your neural network each time, since the hidden layer can make weight combinations to 'learn' which class the input data may belong to. Learn separately will make the weights independent. The network won't know which learned weight to use if a new test input is given.
Use a vector as the output to represent the three different classes, and train the data altogether.
EDIT
P.S, I don't think the link post you provide is relevant with your case. The question in that post arises from different weights initialization (randomly) in neural network training. Sometimes people apply some seed methods to make the weight learning reproducible to avoid such a problem.
In addition to response by nikie, another possibility is to represent output as one (unique) output unit with continuous values. For example, ann classify for first class if output is in the [0, 1) interval, for second if is in the [1, 2) interval and third classes in [2, 3). This architecture is declared in letterature (and verified in my experience) to be less efficient that discrete represetnation with 3 neurons.

Resources