Neural network classification into hierarchical categories - machine-learning

I am working on a classification problem into progressive classes. In other words, there is some hierarchy of categories in such a way, that A < B < C, e.g. low, medium, high.
What loss function and activation function for the output layer should I use to take advantage of the class hierarchy?
My ideas are:
1) To assign some value to each category, use one output unit with the sigmoid activation and RMS loss function. Then to assign each class to an interval, e.g. 0-033 - class A, 0.33-0.66 class B 0.66-1 - class C.
It seem to do the trick, but can favor the extreme categories over the middle ones.
2) Use K softmax output units, integer labels instead of one-hot encoded and the sparse categorical crossentropy loss function.
In this case I am not sure how exactly sparse categorical crossentropy works and if it really takes into account the hierarchy.

Related

Cross Entropy function implemented with Ground Truth probability vs Ground Truth on-hot coded vector

Hi I came across a documentation in Pytorch which implement cross-entropy loss function in two ways:
# Example of target with class indices
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()
# Example of target with class probabilities
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5).softmax(dim=1)
output = loss(input, target)
output.backward()
One method uses the probability vector of the target, and the other uses it as a one-hot vector. To me, the implementation with class probabilities is closer to the definition of the loss function, but in most places, I have seen the other method. Can someone clarify the difference between these methods?
Thanks
I think there are two things worth explaining. From what I understood, you are asking for the following:
the difference between soft and hard labels.
As well as the difference between the one-hot and dense encoding of hard labels.
Just to be clear a probability distribution determines how probabilities are distributed over the values of the random variable (here they correspond to the different classes for your classification task). A soft label represents the distribution itself, while a hard label represents the value of the random variable that has the highest probability (i.e. the most probable class).
If you're applying the loss function against a target that represents a probability distribution, its values are positive and sum to one, then this would most likely correspond to a pseudo-label. In practice, this means you are applying a soft cross-entropy loss and supervising the whole distribution explicitly.
Now, hards labels are what you can expect to have when using ground-truth annotations. You can choose to represent a label in one of two ways (what is called the encoding of the label):
either with a one-hot encoding vector, all 0s, and a single 1 for the true class
or a dense representation which is the index of the true class.
With hard labels, only the probability of true class (the class corresponding to the ground-truth information) is explicitly supervised: you are pushing to maximize the probability mass it gets predicted with. Implicitly minimizing the mass of all other labels... since the mass is finite (i.e. equal to 1).
In PyTorch, the utility provided by nn.CrossEntropyLoss expects dense labels for the target vector. Tensorflow's implementation on the other hand allows you to provide targets as one-hot encoding. This let's you apply the function not only with one-hot-encodings (as intended for classical classification tasks), but also soft target...

How do classifiers classify?

After training any classifier, the classifier tells the probability of data point belonging to a class.
y_pred = clf.predict_proba(test_point)
Does the classifier predicts the class with the max probability or does it considers the probabilities as a distribution draws according to distribution?
In other words, suppose the output probability is -
C1 - 0.1 C2 - 0.2 C3 - 0.7
Will the output be C3 always or only 70% of the times?
When clf predict it won’t calculate the probably of each class . It will use the full connect get a array like [itemsnum ,classisnum] then you can use max output[1] get the items class
by the way when clf training it use softmax to get the probably of each class which is more smooth to optimize you can find some doc about softmax if you are interested about train process
How to go from class probability scores to a class is often called the 'decision function', and is often considered separate from the classifier itself. In scikit-learn, many estimators have a default decision function accessible via predict() for multi-class problems this generally just returns the largest value (argmax function).
However this may be extended in various ways, depending on needs. For instance if the effects of one prediction one of the classes is very costly, then one might weight those probabilities down (class weighting). Or one can have a decision function that only gives a class as output if the confidence is high, else returns an error or a fallback class.
One can also have multi-label classification, there the output is not a single class but a list of classes. [ 0.6, 0.1, 0.7, 0.2 ] -> (class0, class2) These can then use a common threshold, or a per-class threshold. This is common in tagging problems.
But in almost all cases the decision function is a deterministic function, not a probabilistic one.

Multilabel classification neural network, any one label

I am trying to figure out to build a neural network in which let's say I have 3 output labels (A, B, C).
Now my data consist of rows in which 2 of the labels can be 1. Like A and B will be 1 and C will be 0. Now I want to train my neural network such that it can predict A or B. I don't want it to be trained to have high probability for both A and B (like multilabel problems), I want only one of them.
The reason for this is that the rows having 1 in A and B are more like don't care rows in which predicting either A or B will be correct. So I don't want neural network to find minima where it tries to predict both A and B.
Is it possible to train neural network like this?
I think using a weight is the best way I can think of for your application.
Define a weight w for each sample such that w = 0 if A = 1 and B = 1, else w = 1. Now, define your loss function as:
w * (CE(A) +CE(B)) + w' * min(CE(A), CE(B)) + CE(C)
where CE(A) gives the cross-entropy loss over label A. The w' indicates complement of w. The loss function is quite simple to understand. It will try to predict both A and B correctly when both A and B are not 1. Otherwise, it will either predict A or B correctly. Remember, which one out of A and B will be predicted correctly cannot be known in advance. Also, it may not be consistent over batches. Model will always try to predict the class C correctly.
If you are using your own weights to indicate sample importance, then you should use multiply the entire above expression with that weight.
However, I wouldn't be surprised if you get similar (or even better) performance with the classic multi-label loss function. Assuming you have equal proportion of each label, then only in 1/8th of cases, you are allowing your network to predict either A or B. Otherwise, the network has to predict all three of them correctly. Usually, the simpler loss functions work better.
TL;DR:
a typical network will give you a probability for each class.
how you interpret it is up to you.
if you get equal weights in a single label scenario it means both labels are equally likely
The typical implementation for multi class classifier with neural networks is using a softmax layer, with one output per class
if you want a single label classifier, you treat the output with the maximum value as the selected label.
the actual value of this output compared to the others is a measure of the confidence in this value.
in case of equality, it means that both outputs are as likely

Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

SVM in OpenCV: how to handle unbalance data?

For example I have classification problem with 2 classes, but data is skewed(data samples of different classes are in proportion 1:10)
How I can handle unbalance data using SVM?
I found no parameter for weights of different classes (OpenCV seems have no parameters for this?)
It has class_weights parameter in CvSVMParams::CvSVMParams.
class_weights – Optional weights in the C_SVC problem , assigned to
particular classes. They are multiplied by C so the parameter C of
class #i becomes class_weights_i * C. Thus these weights affect the
misclassification penalty for different classes. The larger weight,
the larger penalty on misclassification of data from the corresponding
class.

Resources