How can I do a stratified downsampling? - machine-learning

I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.
One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?
Here is the histogram of length_sequence for class 1:
Here is the histogram of length_sequence for class 0:
You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.
How can I down sample class 0 and make the new histogram look similar to the one in class 1?
I am trying to do stratified down sampling with scikit-learn, but I'm stuck.

Related

binary classification:why use +1/0 as label, what's the difference between +1/-1 or even +100/-100

In binary classification problem, we usually use +1 for positive label and 0 for negative label. why is that? especially why use 0 rather than -1 for the negative label?
what's the difference between using -1 for negative label, or even more generally, can we use +100 for positive label and -100 for negative label?
As the name suggests (labeling) is used for differentiating the classes. You can use 0/1, +1/-1, cat/dog, etc. (Any name that fits your problem).
For example:
If you want to distinguish between cat and dog images, then use cat and dog labels.
If you want to detect spam, then labels will be spam/genuine.
However, because ML algorithms mostly work with numbers before training, labels transform to numeric formats.
Using labels of 0 and 1 arises naturally from some of the historically first methods that have been used for binary classification. E.g. logistic regression models directly the probability of an event happening, event in this case meaning belonging of an object to positive or negative class. When we use training data with labels 0 and 1, it basically means that objects with label 0 have probability of 0 belonging to a given class, and objects with label 1 have probability of 1 belonging to a given class. E.g. for spam classification, emails that are not spam would have label 0, which means they have 0 probability of being a spam, and emails that are spam would have label 1, because their probability of being a spam is 1.
So using labels of 0 and 1 makes perfect sense mathematically. When a binary classifaction model outputs e.g. 0.4 for some input, we can usually interpret this as a probability of belonging to a class 1 (although strictly it's not always the case, as pointed out for example here).
There are classification methods that don't make use of convenient properties of labels 0 and 1, such as support vector machines or linear discriminant analysis, but in their case no other labels would provide more convenience than 0 and 1, so using 0 and 1 is still okay.
Even encoding of classes for multiclass classification makes use of probabilities of belonging to a given class. For example in classification with three classes, objects from the first class would be encoded like [1 0 0], from the second class [0 1 0] and the third class [0 0 1], which again can be interpreted with probabilities. (This is called one-hot encoding). Output of a multiclass classification model is often a vector of form [0.1 0.6 0.3] which can be conveniently intepreted as a vector of class probabilities for given object.

Which metric to use for imbalanced classification problem?

I am working on a classification problem with very imbalanced classes. I have 3 classes in my dataset : class 0,1 and 2. Class 0 is 11% of the training set, class 1 is 13% and class 2 is 75%.
I used and random forest classifier and got 76% accuracy. But I discovered 93% of this accuracy comes from class 2 (majority class). Here is the Crosstable I got.
The results I would like to have :
fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1
What I found on the internet to solve the problem and what I've tried :
using class_weight='balanced' or customized class_weight ( 1/11% for class 0, 1/13% for class 1, 1/75% for class 2), but it doesn't change anything (the accuracy and crosstable are still the same). Do you have an interpretation/explenation of this ?
as I know accuracy is not the best metric in this context, I used other metrics : precision_macro, precision_weighted, f1_macro and f1_weighted, and I implemented the area under the curve of precision vs recall for each class and use the average as a metric.
Here's my code (feedback welcome) :
from sklearn.preprocessing import label_binarize
def pr_auc_score(y_true, y_pred):
y=label_binarize(y_true, classes=[0, 1, 2])
return average_precision_score(y[:,:],y_pred[:,:])
pr_auc = make_scorer(pr_auc_score, greater_is_better=True,needs_proba=True)
and here's a plot of the precision vs recall curves.
Alas, for all these metrics, the crosstab remains the same... they seem to have no effect
I also tuned the parameters of Boosting algorithms ( XGBoost and AdaBoost) (with accuracy as metric) and again the results are not improved.. I don't understand because boosting algorithms are supposed to handle imbalanced data
Finally, I used another model (BalancedRandomForestClassifier) and the metric I used is accuracy. The results are good as we can see in this crosstab. I am happy to have such results but I notice that, when I change the metric for this model, there is again no change in the results...
So I'm really interested in knowing why using class_weight, changing the metric or using boosting algorithms, don't lead to better results...
As you have figured out, you have encountered the "accuracy paradox";
Say you have a classifier which has an accuracy of 98%, it would be amazing, right? It might be, but if your data consists of 98% class 0 and 2% class 1, you obtain a 98% accuracy by assigning all values to class 0, which indeed is a bad classifier.
So, what should we do? We need a measure which is invariant to the distribution of the data - entering ROC-curves.
ROC-curves are invariant to the distribution of the data, thus are a great tool to visualize classification-performances for a classifier whether or not it is imbalanced. But, they only work for a two-class problem (you can extend it to multiclass by creating a one-vs-rest or one-vs-one ROC-curve).
F-score might a bit more "tricky" to use than the ROC-AUC since it's a trade off between precision and recall and you need to set the beta-variable (which is often a "1" thus the F1 score).
You write: "fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1". Remember, that all algorithms work by either minimizing something or maximizing something - often we minimize a loss function of some sort. For a random forest, lets say we want to minimize the following function L:
L = (w0+w1+w2)/n
where wi is the number of class i being classified as not class i i.e if w0=13 we have missclassified 13 samples from class 0, and n the total number of samples.
It is clear that when class 0 consists of most of the data then an easy way to get a small L is to classify most of the samples as 0. Now, we can overcome this by adding a weight instead to each class e.g
L = (b0*w0+b1*w1+b2*x2)/n
as an example say b0=1, b1=5, b2=10. Now you can see, we cannot just assign most of the data to c0 without being punished by the weights i.e we are way more conservative by assigning samples to class 0, since assigning a class 1 to class 0 gives us 5 times as much loss now as before! This is exactly how the weight in (most) of the classifiers work - they assign a penalty/weight to each class (often proportional to it's ratio i.e if class 0 consists of 80% and class 1 consists of 20% of the data then b0=1 and b1=4) but you can often specify the weight your self; if you find that the classifier still generates to many false negatives of a class then increase the penalty for that class.
Unfortunately "there is no such thing as a free lunch" i.e it's a problem, data and usage specific choice, of what metric to use.
On a side note - "random forest" might actually be bad by design when you don't have much data due to how the splits are calculated (let me know, if you want to know why - it's rather easy to see when using e.g Gini as splitting). Since you have only provided us with the ratio for each class and not the numbers, I cannot tell.

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

Torch7 using weights with unbalanced training sets

I am using a CrossEntropyCriterion with my convnet. I have 150 classes and the number of training files per class is very unbalanced (5 to 2000 files). According to the documentation, I can compensate for this using weights:
criterion = nn.CrossEntropyCriterion([weights])
"If provided, the optional argument weights should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set."
What format should the weights be in? Eg: number training files in class n / total number of training files.
I assume you want to balance your training in this meaning, that small class becomes more important. In general there are infinitely many possible weightings leading to various results. One of the simpliest ones, which simply assumes that each class should be equally important (thus efficiently you drop the empirical prior) is to put weight proportional to
1 / # samples_in_class
for example
weight_of_class_y = # all_samples / # samples_in_y
This way if you have 5:2000 dissproportion, the smaller class becomes 400 times more important for the model.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Resources