the definition of unbalanced sample [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Unbalanced sample causes issues and more efforts as we know.
When I am handling the issue, I am confused about the definition. Say, I have a training dataset of 200 cats, 200 dogs and 400 stones.
When I am to classify the dataset, when classfying 3 classesm I should have 200 cats, 200 dogs and 200 stones, what should I allocate when I am just to classify 2 classes of pets and stones?
Should I still go with 400 pets (w/ 200 cats & 200 dogs) and 400 stones? make class pets and stones has same quantities.
or should I go with 400 pets (w/ 200 cats & 200 dogs) and 200 stones? or make all inner classes have the same probability to be watched, after all, cats and dogs are essentitally different.

I think it is task dependent, if you are going to classify your samples into two classes (pets and stones) then you must use all 400 pet images (cats and dogs) and the 400 stone samples. However, if you are having three classes: cats, dogs, and stones; then you need to limit the number of stone sample to 200 for eavery training epoch.
Why this?
In the case of two classes pets vs stones: both labels (pet and stone) update the weights of the models 400 times for each epoch. So after the training finishes, the model will be able to regognize both classes equivalently.
In the case of three classes (cats, dogs, and stones) the cat and dog classes update the wights 200 times per epoch while the stone class update the weights 400 times per epoch, so the model will have a higher chance of outputing the stone class than outputing the cat or dog class.
So, in summary, you should make the number of samples the same for all classes.
PS: if you randomly select 200 stone samples from the 400 ones in the case of three classes, your model won't end up biased to the stone class compared to the other two classes, however it will generalize better on the stone class compared to the other two because it has seen more unique samples of this class.

Related

How can I do a stratified downsampling?

I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.
One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?
Here is the histogram of length_sequence for class 1:
Here is the histogram of length_sequence for class 0:
You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.
How can I down sample class 0 and make the new histogram look similar to the one in class 1?
I am trying to do stratified down sampling with scikit-learn, but I'm stuck.

Unbalanced model, confused as to what steps to take

This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited

Association Rule - Non-Binary Items

I have studied association rules and know how to implement the algorithm on the classic basket of goods problem, such as:
Transaction ID Potatoes Eggs Milk
A 1 0 1
B 0 1 1
In this problem each item has a binary identifier. 1 indicates the basket contains the good, 0 indicates it does not.
But what would be the best way to model a basket which can contain many of the same good? E.g., take the below, very unrealistic example.
Transaction ID Potatoes Eggs Milk
A 5 0 178
B 0 35 7
Using binary indicators in this case would obviously be losing a lot of information and I am seeking a model which takes into account not only the presence of items in the basket, but also the frequency that the items occur.
What would be a suitable algorithm for this problem?
In my actual data there are over one hundred items and, based on the profile of a user's basket, I would like to calculate the probabilities of the customer consuming the other available items.
An alternative is to use binary indicators but constructing them in a more clever way.
The idea is to set the indicator when an amount is more than the central value, which means that it shall be significant. If everyone buys 3 breads on average, does it make sense to flag someone as a "bread-lover" for buying two or three?
Central value can a plain arithmetic mean, one with outliers removed, or the median.
Instead of:
binarize(x) = 0 if x = 0
1 otherwise
you can use
binarize*(x) = 0 if x <= central(X)
1 otherwise
I think if you really want to have probabilities is to encode your data in a probabilistic way. Bayesian or Markov networks might be a feasible way. Nevertheless without having a reasonable structure this will be computational extremely expansive. For three item types this, however, seems to be feasible
I would try to go for a Neural Network Autoencoder if you have many more item types. If there is some dependency in the data it will discover that.
For the above example you could use a network with three input, two hidden and three output neurons.
A little bit more fancy would be to use 3 fully connected layers with drop out in the middle layer.

naive bayes for Forecast grade

I have data set of grade in four lessons (for example lesson a,lesson b,lesson c,lesson d) for 100 students and let's imagine this grades are In association with
grade of lesson f.
I want to implement naive Bayes for Forecast grade lesson f by that four grade but I don't know how use input for this.
I read naive Bayes for spam mail detection and in that, Possibility of each word Calculated.
But for grade I do not know what Possibility I must calculate.
I have tried like spam but for this example I have just four names (for each lesson)
In order to do a good classification, you need to have some information in plus about student than class they are taking. Following your exemple, spam detection is based on words, stop words which are generally spam (buy, promotion, money) or origin in http headers.
For the case to predict student grade, you could imagine having information about student like : social class, is he doing sport, male or female and so on.
Getting back to your question, it is not the name of the lessons which are interesting but the grades each students got at this lessons. You need to take grades of each four lessons and lesson f to train the naive Bayes classifier.
Your entry might look like that:
StudentID gradeA gradeB gradeC gradeD gradeF
1 10 9 8 5 8
2 3 5 3 8 8
3 5 3 1 1 2
4 10 10 10 5 4
After training your classifier you will pass new entry for a new student like that:
StudentID gradeA gradeB gradeC gradeD
1058 1 5 8 4
The classifier will be able to predict the grade for lesson F taking into consideration the precedant grades.
You might have notice that I intentionnally did a training dataset where gradeF is highly correlated with gradeD. It is what the Bayes classifier will try to learn, just in a more complexe way.

Naive bayes text classification fails in one category. Why? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am implementing Naive Bayes classifier for text category detection.
I have 37 categories and I've got accuracy about 36% on my test set.
I want to improve accuracy, so I decided to implement 37 two-way classifiers as suggested in many sources (Ways to improve the accuracy of a Naive Bayes Classifier? is one of them), these classifiers would answer for a given text:
specific_category OR everything_else
and I would determine text's category by applying them sequentally.
But I've got a problem with first classifier, it always fails in "specific_category" category.
I have training data - 37 categories, 100 documents for each category of the same size.
For each category I found list of 50 features I selected by mutual information criteria (features are just words).
For the sake of example, I use two categories "agriculture" and "everything_else" (except agriculture).
For category "agriculture":
number of words in all documents of this class
(first term in denominator in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf, (13.7))
W_agriculture = 31649.
Size of vocabulary V_agriculture = 6951.
Log probability of Unknown word (UNK) P(UNK|agriculture) = -10.56
Log probability of class P(agriculture) = log(1/37) = -3.61 (we have 37 categories of same-size documents)
For category "everything_else":
W_everything_else = 1030043
V_everything_else = 44221
P(UNK|everything_else) = -13.89
P(everything_else) = log(36/37) = -0.03
Then I have a text not related to agriculture, let it consist mostly of Unknown words (UNK). It has 270 words, they are mostly unknown for both categories "agriculture" and "everything_else". Let's assume 260 words are UNK for "everything_else", other 10 is known.
Then, when I calculate probabilities
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
P(text|everything_else) = P(everything_else) + SUM(P(UNK|everything_else) for 260 times) + SUM(P(word|everything_else) for 10 times)
In the last line we counted 260 words as UNK and 10 as known for a category.
Main problem. As P(UNK|agriculture) >> P(everything_else) (for log it is much greater), the influence of those 270 terms P(UNK|agriculture) outweighs influence of sum for P(word|everything_else) for each word in text.
Because
SUM(P(UNK|agriculture) for 270 times) = -2851.2
SUM(P(UNK|everything_else) for 260 times) = -3611.4
and first sum is much larger and can't be corrected not with P(agriculture) nor SUM(P(word|everything_else) for 10 words), because the difference is huge. Then I always fail in "agriculture" category though the text does not belong to it.
The questions is: Am I missing something? Or how should I deal with big number of UNK words and their probability being significantly higher for small categories?
UPD: Tried to enlarge tranining data for "agriculture" category (just concatenating the document 36 times) to be equal in number of documents. It helped for few categories, not much for others, I suspect due to fewer number of words and dictionary size, P(UNK|specific_category) gets bigger and outweighs P(UNK|everything_else) when summing 270 times.
So it seems such method is very sensitive on number of words in training data and vocabulary size. How to overcome this? Maybe bigrams/trigrams would help?
Right, ok. You're pretty confused, but I'll give you a couple of basic pointers.
Firstly, even if you're following a 1-vs-all scheme, you can't have different vocabularies for the different classes. If you do this, the event spaces of the random variables are different, so probabilities are not comparable. You need to decide on a single common vocabulary for all classes.
Secondly, throw out the unknown token. It doesn't help you. Ignore any words that aren't part of the vocabulary you decide upon.
Finally, I don't know what you're doing with summing probabilities. You're confused about taking logs, I think. This formula is not correct:
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
Instead it's:
p(text|agriculture) = p(agriculture) * p(unk|agriculture)^270 * p(all other words in doc|agriculture)
If you take logs, this becomes:
log( p(t|a) ) = log(p(agriculture)) + 270*log(p(unk|agriculture)) + log(p(all other words|agriculture))
Finally, if your classifier is right, there's no real reason to believe that one-vs-all will work better than just a straight n-way classification. Empirically it might, but theoretically their results should be equivalent. In any case, you shouldn't apply decisions sequentially, but do all n 2-way problems and assign to the class where the positive probability is highest.

Resources