Is there a way to train neural network to output two values - the forecast, and its probability?
Example: let's say we want to predict time series 1, 2, 3, 4, 5, ?, we want to know two numbers - the forecast itself and its probability - how sure the Neural Net about its prediction (in this case it could be two numbers 6 and 90%).
Do you know if that's possible? And any reference to docs / examples of Neural Net that does that or something similar?
Note: the predicted value is not categorical / class, it's a number.
I'm not sure what library are you using for training neural networks but usually they are able to produce both the predicted probability and the predicted class.
Take keras as an example, the trained model has the following two methods:
predict_proba: predict the output class probabilities
predict_classes: predict the output class index
Usually (2) is (1) with a threshold value of 0.5. You can call both of them to get the two output predictions or you can call (1) and then set custom thresholds as well.
Related
After training any classifier, the classifier tells the probability of data point belonging to a class.
y_pred = clf.predict_proba(test_point)
Does the classifier predicts the class with the max probability or does it considers the probabilities as a distribution draws according to distribution?
In other words, suppose the output probability is -
C1 - 0.1 C2 - 0.2 C3 - 0.7
Will the output be C3 always or only 70% of the times?
When clf predict it won’t calculate the probably of each class . It will use the full connect get a array like [itemsnum ,classisnum] then you can use max output[1] get the items class
by the way when clf training it use softmax to get the probably of each class which is more smooth to optimize you can find some doc about softmax if you are interested about train process
How to go from class probability scores to a class is often called the 'decision function', and is often considered separate from the classifier itself. In scikit-learn, many estimators have a default decision function accessible via predict() for multi-class problems this generally just returns the largest value (argmax function).
However this may be extended in various ways, depending on needs. For instance if the effects of one prediction one of the classes is very costly, then one might weight those probabilities down (class weighting). Or one can have a decision function that only gives a class as output if the confidence is high, else returns an error or a fallback class.
One can also have multi-label classification, there the output is not a single class but a list of classes. [ 0.6, 0.1, 0.7, 0.2 ] -> (class0, class2) These can then use a common threshold, or a per-class threshold. This is common in tagging problems.
But in almost all cases the decision function is a deterministic function, not a probabilistic one.
If there are 4 classes and output probability from the model is A=0.30,B=0.40,C=0.20 D=0.10 then can I say that output from the model is class B with 40% confidence? If not then why?
Although a softmax activation will ensure that the outputs satisfy the surface Kolmogorov axioms (probabilities always sum to one, no probability below zero and above one) and the individual values can be seen as a measure of the network's confidence, you would need to calibrate the model (train it not as a classifier but rather as a probability predictor) or use a bayesian network before you could formally claim that the output values are your per-class prediction confidences. (https://arxiv.org/pdf/1706.04599.pdf)
Suppose I want to use a multilayer perceptron to classify 3 classes. When it comes to number of output neurons, anybody would instantly say - use 3 output neurons with softmax activation. But what if I use 2 output neurons with sigmoid activations to output [0,0] for class 1, [0,1] for class 2 and [1,0] for class 3? Basically getting a binary encoded output with each bit being output by each output neuron. Wouldn't this technique decrease output neurons(and hence number of parameters) by a lot? A 100 class word classification for simple NLP application would require 100 output neurons for softmax where as you can cover it with 7 output neurons with the above technique. One disadvantage is that you won't get the probability scores for all the classes. My question is, is this approach correct? If so, would you consider it to be more efficient than softmaxing for datasets with large number of classes?
You could do this, but then you would have to rethink your loss function. The cross-entropy loss used in training a model for classification is the likelihood of a categorical distribution, which assumes you have a probability associated with every class. The loss function requires 3 output probabilities and you only have 2 output values.
However, there are ways to do it anyway: you could use a binary cross-entropy loss on each element of your output, but this would be a different probabilistic assumption about your model. You'd be assuming that your classes have some shared characteristics [0,0] and [0,1] share a value. The decreased degrees of freedom are probably going to give you marginally worse performance (but other parts of the MLP may pick up the slack).
If you're really worried about the parameter cost of the final layer, then you might be better just not training it at all. This paper shows a fixed Hadamard matrix on the final layer is as good as training it.
I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.
For a classification problem, how is the output of the network usually determined?
Say, there are three possible classes, each with a numerical identifier, would a reasonable solution be to sum the outputs and take that sum as the overall output of the network? Or would you take the average of the networks outputs?
There is plenty of information regarding ANN theory, but not much about application, but I apoligise if this is a silly question.
For a multi-layer perceptron classifier with 3 classes, one typically constructs a network with 3 outputs and trains the network so that (1,0,0) is the target output for the first class, (0,1,0) for the second class, and (0,0,1) for the third class. For classifying a new observation, you typically select the output with the greatest value (e.g., (0.12, 0.56, 0.87) would be classified as class 3).
I agree mostly with bogatron and further you will find many posts here advising on this kind of "multi-class classification" with neural networks.
Regarding your heading I would like to add that you can interpret that output as a probability since I struggled to find theoretical foundation for this. Going on I'll talk about a neural network with 3 neurons in the output layer, indicating 1 for the respective class.
Since the sum of all three outputs will always be 1 in training, the neural network will also give feed-forward output with a sum of one (so rather (0.12 0.36 0.52) than bogatrons example)) Then you can interpret these figures as the probability that the respective input belongs to class 1/2/3 (probability is 0.52 that it belongs to class 3)).
This is true when using the logistic function or the tanh as activation functions.
More on this:
Posterior probability via neural networks: http://www-vis.lbl.gov/~romano/mlgroup/papers/neural-networks-survey.pdf
How to convert the output of an artificial neural network into probabilities?