Deep Learning an Imbalanced data set - machine-learning

I have two data sets that looks like this:
DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)
DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)
I am trying to build a deep feedforward neural net in Tensorflow. I get accuracies in the 90s and AUC scores in the 80s. Of course, the data set is heavily imbalanced so those metrics are useless. My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly.
My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced.
Thanks!

Question
Can I use tensorflow to learn imbalance classification with a ratio of about 30:1
Answer
Yes, and I have. Specifically Tensorflow provides the ability to feed in a weight matrix. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have.
One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. Once you get both classes to have about the same precision to accuracy ratio then they are balanced.
Example Implementation
Here is an example implementation that converts a Y into a weight matrix that has performed very well for me
def weightMatrix( matrix , most=0.9 ) :
b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
a = 1./( b * 2. )
weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
return weights
The most parameter represents the largest fractional difference to consider. 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. Values below .5 don't work.

You might be interested to have a look at this question and its answer. Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case.
Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance.

Related

How to calculate accuracy score of a random classifier?

Say for example, a dataset contains 60% instances for "Yes" class and 30% instances for "NO" class.
In this scenario, Precision, Recall for the random classifier are
Precision =60%
Recall =50%
Then, what will be the accuracy for random classifier in this scenario?
Some caution is required here, since the very definition of a random classifier is somewhat ambiguous; this is best illustrated in cases of imbalanced data.
By definition, the accuracy of a binary classifier is
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
where P stands for probability.
Indeed, if we stick to the intuitive definition of a random binary classifier as giving
P(prediction=0) = P(prediction=1) = 0.5
then the accuracy computed by the above formula is always 0.5, irrespectively of the class distribution (i.e. the values of P(class=0) and P(class=1)).
However, in this definition, there is an implicit assumption, i.e. that our classes are balanced, each one consisting of 50% of our dataset.
This assumption (and the corresponding intuition) breaks down in cases of class imbalance: if we have a dataset where, say, 90% of samples are of class 0 (i.e. P(class=0)=0.9), then it doesn't make much sense to use the above definition of a random binary classifier; instead, we should use the percentages of the class distributions themselves as the probabilities of our random classifier, i.e.:
P(prediction=0) = P(class=0) = 0.9
P(prediction=1) = P(class=1) = 0.1
Now, plugging these values to the formula defining the accuracy, we get:
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
= (0.9 * 0.9) + (0.1 * 0.1)
= 0.82
which is nowhere close to the naive value of 0.5...
As I already said, AFAIK there are no clear-cut definitions of a random classifier in the literature. Sometimes the "naive" random classifier (always flip a fair coin) is referred to as a "random guess" classifier, while what I have described is referred to as a "weighted guess" one, but still this is far from being accepted as a standard...
The bottom line here is the following: since the main reason for using a random classifier is as a baseline, it makes sense to do so only in relatively balanced datasets. In your case of a 60-40 balance, the result turns out to be 0.52, which is admittedly not far from the naive one of 0.5; but for highly imbalanced datasets (e.g. 90-10), the usefulness itself of the random classifier as a baseline ceases to exist, since the correct baseline has become "always predict the majority class", which here would give an accuracy of 90%, in contrast to the random classifier accuracy of just 82% (let alone the 50% accuracy of the naive approach)...
As #desertnaut mentioned, if you're after a naïve benchmark for your model you're always better using "always predict the majority class" as your benchmark, achieving accuracy of %of_samples_in_majority_class (which is always better than either a random guess or a weighted guess).
In Deepchecks (a package I maintain) we have a check that automatically compares the performance of your model to a simple model (either weighted random, majority class or simple decision tree).
from deepchecks.checks import SimpleModelComparison
from deepchecks import Dataset
SimpleModelComparison().run(Dataset(train_df, label='target'), Dataset(test_df, label='target'), model)

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

Confused about sklearn’s implementation of OSVM

I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size

Imbalanced data and sample size for large multi-class NLP classification

I'm working on an NLP project where I hope to use MaxEnt to categorize text into one of 20 different classes. I'm creating the training, validation and test sets by hand from administrative data that is hand written.
I would like to determine the sample size required for the classes in the training set and the appropriate size of the validation/testing set.
In the real world, the 20 outcomes are imbalanced. But I'm considering creating a balanced training set to help build the model.
So I have two questions:
How should I determine the appropriate sample size for each category in the training set?
Should the validation/testing sets be imbalance to reflect the conditions the model might encounter if faced with real world data?
In order to determine the sample size of your test set you could use Hoeffding's inequality.
Let E be the positive tolerance value and N the sample size of the data set.
Then we can compute Hoeffding's inequality, p = 1 - ( 2 * EXP( -2 * ( E^2 ) * N) ).
Let E = 0.05 (±5%) and N = 750, then p = 0.9530. This means that with a certainty of 95.3% your (in-sample) test error won't deviate more than 5% out of sample.
As for the sample size of the training and validation set there is an established convention to split the data as follows: 50% for training, and 25% each for validation and testing. The optimal size of those sets depends a lot on the the training set and the amount of noise in the data. For further information have a look at "Model Assessment and Selection" in "Elements of statistical learning".
As for your other question regarding imbalanced datasets have a look at this thread: https://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training

Probability and Neural Networks

Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.

Resources