I am trying to figure out to build a neural network in which let's say I have 3 output labels (A, B, C).
Now my data consist of rows in which 2 of the labels can be 1. Like A and B will be 1 and C will be 0. Now I want to train my neural network such that it can predict A or B. I don't want it to be trained to have high probability for both A and B (like multilabel problems), I want only one of them.
The reason for this is that the rows having 1 in A and B are more like don't care rows in which predicting either A or B will be correct. So I don't want neural network to find minima where it tries to predict both A and B.
Is it possible to train neural network like this?
I think using a weight is the best way I can think of for your application.
Define a weight w for each sample such that w = 0 if A = 1 and B = 1, else w = 1. Now, define your loss function as:
w * (CE(A) +CE(B)) + w' * min(CE(A), CE(B)) + CE(C)
where CE(A) gives the cross-entropy loss over label A. The w' indicates complement of w. The loss function is quite simple to understand. It will try to predict both A and B correctly when both A and B are not 1. Otherwise, it will either predict A or B correctly. Remember, which one out of A and B will be predicted correctly cannot be known in advance. Also, it may not be consistent over batches. Model will always try to predict the class C correctly.
If you are using your own weights to indicate sample importance, then you should use multiply the entire above expression with that weight.
However, I wouldn't be surprised if you get similar (or even better) performance with the classic multi-label loss function. Assuming you have equal proportion of each label, then only in 1/8th of cases, you are allowing your network to predict either A or B. Otherwise, the network has to predict all three of them correctly. Usually, the simpler loss functions work better.
TL;DR:
a typical network will give you a probability for each class.
how you interpret it is up to you.
if you get equal weights in a single label scenario it means both labels are equally likely
The typical implementation for multi class classifier with neural networks is using a softmax layer, with one output per class
if you want a single label classifier, you treat the output with the maximum value as the selected label.
the actual value of this output compared to the others is a measure of the confidence in this value.
in case of equality, it means that both outputs are as likely
Related
I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.
I have just begun to work with Neural Networks using tensor flow and I am really new to this. I trained my first model to make 2 category classifications and I'm a little curious about the output. Let's say we are making a prediction based on whether or not a house price will go up and we get an output like
House A: .99
House B: .75
House C: .55
House D: .40
Can I assume that these outputs are probabilities? So it's more likely that house B will go up, rather than House C. Or Is it just classifying it as C and B will go up and House D will not. Thanks!
Not exactly. A neural network will output a prediction of what you have trained it for. So if you trained it to predict probabilities, it sure will output (predictions of) probabilities. However, if you trained it on an observation that the price actually did go up, say a single output which is 1.0 if the price went up, and 0.0 if the price didn't, then the output will be a regression value of observation given the input. This is not necessarily the probability but can rather be viewed as the confidence of the model.
Yes each number can be thought of as a probability representing how likely a house will go up in price. Just to further clarify, the probability estimate of one house does not affect the probability estimate of the others as they are treated as separate samples. So B being more likely doesn't make C less likely. It's just that B happens to be more likely to go up.
And the classification depends on your threshold. By default I believe most classifiers use 0.5 as their threshold, so in this case A, B, and C are classified to go up and D is classified to go down.
I am trying to simulate a XOR gate using a neural network similar to this:
Now I understand that each neuron has certain number of weights and a bias. I am using a sigmoid function to determine whether a neuron should fire or not in each state (since this uses a sigmoid rather than a step function, I use firing in a loose sense as it actually spits out real values).
I successfully ran the simulation for feed-forwarding part, and now I want to use the backpropagation algorithm to update the weights and train the model. The question is, for each value of x1 and x2 there is a separate result (4 different combinations in total) and under different input pairs, separate error distances (the difference between the desired output and the actual result) could be be computed and subsequently a different set of weight updates will eventually be achieved. This means we would get 4 different sets of weight updates for each separate input pairs by using backpropagation.
How should we decide about the right weight updates?
Say we repeat the back propagation for a single input pair until we converge, but what if we would converge to a different set of weights if we choose another pair of inputs?
Now I understand that each neuron has certain weights. I am using a sigmoid function to determine a neuron should fire or not in each state.
You do not really "decide" this, typical MLP do not "fire", they output real values. There are neural networks which actually fire (like RBMs) but this is a completely different model.
This means we would get 4 different sets of weight updates for each input pairs by using back propagation.
This is actually a feature. Lets start from the beggining. You try to minimize some loss function on your whole training set (in your case - 4 samples), which is of form:
L(theta) = SUM_i l(f(x_i), y_i)
where l is some loss function, f(x_i) is your current prediction and y_i true value. You do this by gradient descent, thus you try to compute the gradient of L and go against it
grad L(theta) = grad SUM_i l(f(x_i), y_i) = SUM_i grad l(f(x_i), y_i)
what you now call "a single update" is grad l(f(x_i) y_i) for a single training pair (x_i, y_i). Usually you would not use this, but instead you would sum (or taken average) of updates across whole dataset, as this is your true gradient. Howver, in practise this might be computationaly not feasible (training set is usualy quite large), furthermore, it has been shown empirically that more "noise" in training is usually better. Thus another learning technique emerged, called stochastic gradient descent, which, in short words, shows that under some light assumptions (like additive loss function etc.) you can actually do your "small updates" independently, and you will still converge to local minima! In other words - you can do your updates "point-wise" in random order and you will still learn. Will it be always the same solution? No. But this is also true for computing whole gradient - optimization of non-convex functions is nearly always non-deterministic (you find some local solution, not global one).
When you have 2 classes A with 2 elements and B with one element in 1D space in any configuration. Task is to distinguish between the two classes, to classify them. If you can choose arbitrary activation function, what is the minimal number of neurons that can solve this.
I am thinking that you always have to use at least two neurons or am I wrong?
Your question is somewhat related to the classical XOR problem for perceptrons. Let us suppose for a moment, that it's about a neural network with the specific activation function - binary threshold - which perceptron has. Then the task turns into 1D XOR problem, and then indeed you need 2 neurons in hidden layer and 1 neuron in output layer to solve it. But you mention that an arbitrary activation function can be chosen. In this case we can choose radial basis function (RBF) network. If it is possible to denote class A as output value greater than T and class B as output value less than T, then only 1 RBF neuron will suffice to distinguish the classes. If you want every class to have its own output (which value can be treated as a probability measure of input data belonging to corresponding class), then you need 2 RBF neurons.
Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).