Machine learning multi-classification: Why use 'one-hot' encoding instead of a number - machine-learning

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?

Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )

The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats

Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

Related

Understanding Generative Adversarial Networks

I implemented the proposed GAN Model from the Paper Edge-Connect (https://github.com/knazeri/edge-connect) in Keras and did some trainings on the KITTI dataset. Now I am trying to figure out what's going on inside my model and therefore I have a few questions.
1. Initial Training (100 Epochs, 500 batches/epoch, 10 Samples/Batch)
At first I trained the model as proposed in the paper (incuding style-, perceptual-, L1- and adversarial loss)
At first sight, the model converges to nice results:
This is the output of the generator(left) for the masked input(right)
Most of the graphs from the tensorboard look quite good as well:
(These are all values from the GAN-Model, containing the total loss of the generator(GENERATOR_Loss), different losses based on the generated image (L1, perc, style) as well as the adversarial loss (DISCRIMINATOR_loss)
When closely looking at the discriminator, things look different. The adversarial loss of the discriminiator for the generated images steadly increases.
The loss while training the discriminator (50/50 fake/real examples) doesn't change at all:
![] (https://i.stack.imgur.com/o5jCA.png)
And when looking at the histogram of activations of the output of the discriminator it always outputs values around 0.5.
Coming to my questions/conclusions where I would appreciate your feedback:
So I assume now, that my model learned a lot but nothing from the discriminator, right? The results are all based on the losses other
than the adversarial loss?
It seems that the Discriminator could not keep up with the generator generating better images. I think the discriminators activations should somehow early move to two peaks at around 0 (fake labels) and 1 (real lables) and stay there?
I know that my final goal is that the discriminator outputs 0.5 probability for real as well as fake... but what does it mean when this happens right from the beginning and doesn't change during training?
Did I stop training too early? Could the discriminator catch up (since the output of the generator doesn't change much anymore) and eliminate the last tiny faults of the generator?
2. Thus I started a second training, this time only using the adversarial loss in the generator! (~16 Epochs, 500 batches/epoch, 10 Samples/Batch)
This time the discriminator seems to be able to differentiate between real and fake after a while.
(prob_real is the mean probability assigned to real images and vice versa)
The histogram of activations looks good as well:
But somehow after around 4k Samples things start to change and at around 7k it diverges...
Also all samples from the generator look like this:
Coming to my second part of questions/conclusions:
Should I pretrain the discriminator so it gets a head start? I guess it needs to somehow be able to differentiate between real and fake (outputting large probabilites for real and vice versa) so the generator can learn usefull things from it? Should I train the discriminator multiple times while training the generator one step for the same reason?
What happend in the second training? Was the learn rate for the discriminator too high? (Opt: ADAM, lr=1.0E-3)
Many hints on the internet for training GANs aim for increasing the difficulty of the discriminators job (Label noise/label flipping, instance noise, label smoothing etc). Here I think the discriminator rather needs to be boosted? (-> I also trained the Disc without changing the generator and it converges nicely)
If discriminator outputs 0.5 probability directly in the beginning of the network it means that the weights of the discriminator are not being updated and it plays no role in training, which further indicates it is not able to differentiate between real and fake image coming from the generator. To solve this issue try to add Gaussian noise as an input to the discriminator or do label smoothing which are very simple and effective techniques.
In answer to your this question, that The results are all based on the losses other than the adversarial loss , the trick that can be used is try to train the network first on all the losses except the adversarial loss and then fine tune on the adversarial losses, hope it helps.
For the second part of your questions, the generated images seem to face the problem of mode collpase where they tend to learn color, degradation from 1 image and pass the same to the other images , try to solve it out by either decreasing the batch size or using unrolled gans,

Using Multilayer Perceptron (MLP) to categorise images and its performance

I am new to Machine/Deep learning area!
If I understood correctly, when I am using images as an input,
the number of neurons at input layer = the number of pixels (i.e resolution)
The weights and biases are updated through back-propagation to achieive low as possible error-rate.
Question 1.
So, even one single image data will adjust the values of weights & biases (through back-propagation algorithm), then how does adding more similar images into this MLP improve the performance?
(I must be missing something big.. however to me, it seems like it will only be optimised for the given single image and if i input the next one (of similar img), it will only be optimised for the next one )
Question 2.
If I want to train my MLP to recognise certain types of images ( Let's say clothes / animals ) , what is a good number of training set for each label(i.e clothes,animals)? I know more training set will produce better result, however how much number would be ideal for good enough performance?
Question 3. (continue)
A bit different angle question,
There is a google cloud vision API , which will take images as an input, and produce label/probability as an output. So this API will give me an output of 100 (lets say) labels and the probabilities of each label.
(e.g, when i put an online game screenshot, it will produce as below,)
Can this type of data be used as an input to MLP to categorise certain type of images?
( Assuming I know all possible types of labels that Google API produces and using all of them as input neurons )
Pixel values represent an image. But also, I think this type of API output results can represent an image in different angle.
If so, what would be the performance difference ?
e.g) when classifying 10 different types of images,
(pixels trained model) vs (output labels trained model)
I can help you with the "intuitive" picture.
First, it may be worth looking at convolution neural nets and deep learning and see how to handle images as input to reduce number of weights. It will not be 1 weight per pixel.
Also, what exactly you mean by "performance"? That is not a well defined question. If you use 1 image, say a cat, do you mean by performance that you can identify cats in other pictures, or how well you are able to get close to your cat?
Imagine you have a table of 3 weights, 1 input and 1 output, and trained your network to have error of < 0.01, and the desired output is 0.5
W1 | W2 | W3 | Output
0.1 0.2 0.05 0.5006
If you retrain the network, you may get a different
W1 | W2 | W3 | Output
0.3 0.2 0.08 0.49983
Since the weights are way different, you can imagine that there are several solutions.
Then, if you add another input, you can imagine that some of those weights which worked for first solution will work for the second.
Then you add another input. Then subset of the solutions with 2 inputs will work for 3 inputs. Etc.
When you have enough unrelated or noisy inputs, you won't find a subset of weights which meet your error criterion. Either you need to add weights (more degrees of freedom) or increase the error target, or both.
Now, you have a learning rate when you train a network. Say you are doing online training (for each input you update the weights), not batch training (you find the error vector for a batch (subset) of the input and you update your weights based on that, 1 time for the batch).
Now, suppose your learning rate was 0.01 and weight of 0.1. Intuitively:
If, for the first input, the first weight had derivative of 5, then your weight has new value of 0.1 - 0.01*5 = 0.05
If you feed your next input, say the derivative was -5. That means that the second input "disagrees" with the first change, and tries to go back to 0.01
If the derivative for the second input was 5, that means that the second weight "agrees" with the first.
If you have 20 inputs, some will pull the value up, some will push the value down. You keep looping through the training and then the value will approach a value which most of the inputs agree on, hence minimizing the error caused by that weight.
For question 2:
My mathematical guts feel tells me you definitely need at least 2*weight number to have any meaning to the training, but you should make that at least 10x the number of weights for the least minimum amount to even make a conclusion about your network, unless you are not trying to guess something new (for example, for xor gate, you can probably get away with way less input than weights, but that is a bit long discussion)
Note:
With 1 image, you can rotate it, stretch it, mix it with other images... to create another images and increase your input set.
If you have a simple input like xor gate, you can create inputs like (0.3, 0.7) (0.3, 0.6) (0.2, 0.8)... to expand your training set.
For question 3:
This is equivalent to chaining google's network with a network you create serially, but training each part separately.
Basically: You have Pictures --> 10 labels input to your network --> your classification
The problem I see there is, you may not know all the possible outputs of google's classification. But say they are consistent,
Is your label same as one of the 10 labels? If so, use the given label. If it is a different type of label, you can use that API to simplify your network. What are the consequences or what is the performance?
That is beyond me. In neural nets, while they have good mathematical theories to tell us what they can do, many posed problems such as the one you asked require either a special mathematical analysis (perhaps get PhD on some insight related to that class of problems) or, as most do, show empirical results.

Time Series Prediction using Recurrent Neural Networks

I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.

How to use opencv ann to evaluate a function output

Can I compute the function f(x) = sqr(x) using opencv ANN ?
I need to train my ann by using set of integers and their square values.
I need to get squared value of a integer as output from ann model.
If we can do that using opencv ann, what will be the number input neurons, output neurons and how to specify the classes etc.. ??
You mention class specification, but I don't think that this is a class categorization problem. I think it would be better to treat the input as X, and the output as sqr(X). Then this becomes a general function approximation problem.
There is an issue with this particular problem however. Neural networks aren't well suited for functions with unbounded input/output. The output of a neural network is usually limited to the range of its activation function, and the input value is usually scaled to some reasonable range. Assuming you are using the default activation (symmetrical sigmoid), your output is limited to (-1, 1). If you have a limited range of integers you want to use, you can still do this, but you'll have to scale the inputs and outputs accordingly.
If you use this method, there will be one input node, and one output node, corresponding to the scaled versions of X and sqr(X) respectively. OpenCV will try to take care of scaling for you automatically. It's probably best for you to trust this, UNLESS you are planning on providing multiple different sets of training data. The different sets may have different distributions, hence a different scale.

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Resources