When solving a binary classification problem, I think there are two possible ways in caffe.
The first one is using "SigmoidCrossEntropyLossLayer" with one output unit.
The other one is using "SoftmaxWithLossLayer" with two output units.
My question is what’s the difference between these two approaches?
Which one should I use?
Thank you very much!
If you play a bit with the math, you can "duplicate" the predicted class probability of the "Sigmoid" layer to 0.5*x_i for class 1 and -0.5*x_i for class 0, then the "SoftmaxWithLoss" layer amounts to "SigmoindWithCrossEntropy" on the single output predictions x_i.
So I believe it can be said that these two methods can be regarded as equivalent for predicting binary outputs.
Related
I want to encode my multiclass classification output variable in a specific way to take ordinality into account. I want to use this in a NN with sigmoid objective.
I have a couple of questions about this:
How could I encode my classes in this way?
This would not change the problem from multiclass to multilabel classification right?
P.S. here is a link to the paper I based this on. And here is a figure representing the change from a normal NN to their addaptation:
1. How could I encode my classes in this way?
Depends on the framework, a pytorch example can be found here, which also includes a code snippet for converting from predictions and back to labels
This would not change the problem from multiclass to multilabel classification right?
No, you would have multiple binary outputs, but they are subsequently converted to a single label, thus it is still multiclass classification.
I'm working on a model which takes 2 labels(say A and B) as input. But, there might be a possibility that the output that needs to be predicted is neither A nor B, and hence I want to predict can't say. Could you plz guide me how to do that?
Also, guidance with some code snippets would be appreciated.
I answered a similar question over here: Limiting probability percentage of irrelevant image in CNN
The difference in your question is that you are doing binary classification instead of multi-class classification, as in their question. To predict unknown classes, as I mentioned there, you need to change your last layer to have an output of dimension 3. Then, apply a softmax activation to that (instead of using a sigmoid, which you might be currently using), which makes it such that the probabilities of each class add up to 1.
I don't know what framework you are using to build your model, so I can't provide relevant code snippets.
I'm basically trying to create a neural network that should tell me whether an input I'm giving it is valid or not. The problem is that I only have valid input with which I can train it.
Right now I am trying to come up with a working dense model that validates only mnist digits between 0 and 4. All other digits should be seen as invalid. First attempt was to train it with digits between 0 and 4 as valid and images with random pixels as invalid (with the same percent of black pixels as a normal image) but unfortunately it doesn't work. When I test it with digits between 5 and 9, they are seen as valid.
So I'm starting to think if it's even possible to train a neural network this way.
Also I realize there might be better ways to do this, maybe with an autoencoder or a different kind of network but right now I want to try this with only dense layers.
Thank you.
What you are looking for is one-class classification, also known as unary classification or class-modelling.
Quick google search suggests to train an autoencoder and define an object as in your class if the reconstruction error is below a specific threshold.
But if you start building up something like that i would suggest you to use something like One-Class K-Nearest Neighbor or One-Class SVM first to see if you get acceptable results. If so you can improve your results with the "extremly more complicated to develop"- solution using autoencoders
I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?
I am currently trying to use satellite imagery to recognize Apples orchards. And I am facing a small problem in the number of representative data for each class.
In fact my question is :
Is it possible to take randomly some different images in my "not-apples" class at each epoch because I have much more of theses (compared to the "apples" one) and I want to increase the probability my network will classify out an image unrepresentative.
Thanks in advance for your help
That is not possible in Keras. Keras will, by default, shuffle your training data and then train on it in a mini-batch fashion. However, there are still ways to re-balance your dataset.
The imbalanced training data problem that you are facing is pretty common. You have many options available to you; I list a few below:
You can adjust the relative weights of your classes using class_weight keyword of the model.fit() function.
You can "up-sample" your "apples" class or "down-sample" your "non-apples" class to have equal numbers of both classes during training.
You can generate synthetic images of your "apples" class to augment your data set. To this end, the ImageDataGenerator class in Keras can be particularly useful. This Keras tutorial is a good introduction to its usage.
In my experience, I've found #2 and #3 to be most useful. #1 is limited by the fact that the convergence of stochastic gradient descent suffers when using class weights differing by a couple orders of magnitude and smaller batch sizes.
Jason Brownlee has put together a list of tactics for dealing with imbalanced classes that might also be useful to you.