When is a dataset called unbalanced?

When is a dataset called unbalanced? - machine-learning

I've a dataset (based on million song dataset) on which I need to do genre classification. Following is the distribution of various genre classes in the dataset.
Genre Count %age
1. Rock 115104 39.94364359
2. Pop 47534 16.49535337
3. Electronic 24313 8.437150809
4. Jazz 16465 5.713720564
5. Rap 15347 5.325749741
6. RnB 13769 4.778148706
7. Country 13509 4.687922933
8. Reggae 8739 3.032627027
9. Blues 7075 2.455182083
10. Latin 7042 2.44373035
11. Metal 6257 2.171317921
12. World 4624 1.604630664
13. Folk 3661 1.270448283
14. Punk 3479 1.207290242
15. New Age 1248 0.433083709
Would you call this data unbalanced? I've tried reading around but found that people describe datasets unbalanced where one of the classes is 99% of the dataset and it's a binary classification problem. Not sure whether the above set falls into this category. Please help. I'm not able to get the classification right and being a newbie can't decide whether it's the data or my naivety. This is one of the hypotheses I have and need to validate.

In general there is no strict definition of imbalanced dataset, but generally, if the smallest class is 10x smaller than the largest one, then calling it imbalanced is a good idea.
In your case, smallest class is actually 100x smaller than the biggest one, so you can even map it onto your consideration of "99-1" for binary classification. If you only ask about differentiating between New Age and Rock you will end up with 99-1 imbalance, so you can expect problems typical to imbalanced classification - to appear in your project.

Related

how to classifying with not ordianal data

i'm new to machine learning field.
Trying to classify 10 people with a their phone call logs.
The phone call logs look like this
UserId IsInboundCall Duration PhoneNumber(hashed)
1 false 23 1011112222
2 true 45 1033334444
Trained with this kind of 8700 logs with SVM from sklearn gives a result is accuracy 88%
I have a several question about this result and
what is a proper way to use some not ordinal data(ex. phone number)
I'm not sure using a hashed phone number as a feature but this multi class classifiers accuracy is not bad, is it just a coincidence?
How to use not oridnal data as a feature?
If this classifier have to classify more 1000 classes(more 1000 users), is SVM still work on that case?
Any advice is helpful for me. Thanks

1) Try the SVM without Phone number as a feature to get a sense of how much impact it has.
2) In order to avoid Ordinal Data you can either transform into a number or use a 1 of K approach. Say you added an Phone OS field with possible values {IOS, Android, Blackberry} you can represent this as a number 0,1,2 or as 3 features (1,0,0), (0,1,0), (0,0,1).
3) The SVM will still give good results as long as the data is approximately linearly separable. To achieve this you might need to add more features and map into a different feature space (an RBF kernel is a good start).

Right approach to find similar products solely based on content and not on user history using machine learning algorithms

I have around 2-3 million products. Each product follows this structure
{
"sku": "Unique ID of Product ( String of 20 chars )"
"title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
"brand":"Brand of product eg OnePlus",
"cat1":"First Category of Product Phone",
"cat2":"Second Category of Product Mobile Phones",
"cat3":"Third Category of Product Smart Phones",
"price":500.00,
"shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
"longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}
The problem statement is
Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.
For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"
1) apple iphone 6s gold or other color
2) apple iphone 6s plus options
3) apple iphone 6s options with other configurations
4) other apple iphones
5) other smart-phones in that price range
What I have tried so far
A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.
B) One hot encoder + Singular Value Decomposition ( SVD ) + Cosine Similarity
I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.
But I have some questions here
1) Is using One Hot Encoder and SVD is right approach in my use case?
2) Is there any way or trick to give extra weight the title and brand attribute in the training.
3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )
4) What should be my dimension value when I want to train for 2 million products.
5) How much memory I would need to deploy the SVD trained model to find in-memory cosine similarity for 2 million products.
What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.

Now that I've stated my objections to the posting, I will give some guidance on the questions:
"Right Approach" often doesn't exist in ML. The supreme arbiter is whether the result has the characteristics you need. Most important, is the accuracy what you need, and can you find a better method? We can't tell without having a significant subset of your data set.
Yes. Most training methods will adjust whatever factors improve the error (loss) function. If your chosen method (SVD or other) doesn't do this automatically, then alter the error function.
Yes, it's scalable. The basic inference process is linear on the data set size. You got poor results because you didn't scale up the hardware when you enlarged the data set; that's part of "scale up". You might also consider scaling out (more compute nodes).
Well, how should a dimension scale with the data base size? I believe that empirical evidence supports this being a log(n) relationship ... you'd want 600-700 dimension. However, you should determine this empirically.
That depends on how you use the results. From what you've described, all you'll need is a sorted list of N top matches, which requires only the references and the similarity (a simple float). That's trivial memory compared to the model size, a matter of N*8 bytes.

TensorFlow seq2seq model with low number of target_vocab_size

I am experimenting with the tensorflow seq2seq_model.py model.
The target vocab size I have is around 200.
The documentation the says:
For vocabularies smaller than 512, it might be a better idea to just use a standard softmax loss.
The source-code also has the check:
if num_samples > 0 and num_samples < self.target_vocab_size:
Running the model with only 200 target output vocabulary does not invoke the if statement.
Do I need to write a "standard" softmax loss function to ensure a good training, or can I just let the model run as it comes?
Thanks for the help!

I am doing the same thing. In order to just get my fingers wet with different kinds of structures in the training data I am working in an artificial test-world with just 117 words in the (source and) target vocabulary.
I asked myself the same question and decided to not go through that hassle. My models train well even though I didn't touch the loss, thus still using the sampled_softmax_loss.
Further experiences with those small vocab sizes:
- batchsize 32 is best in my case (smaller ones make it really unstable and I run into nan-issues quickly)
- I am using AdaGrad as the optimizer and it works like magic
- I am working with the model_with_buckets (addressed through translate.py) and having size 512 with num_layers 2 produces the desired outcomes in many cases.

Predicting classes with a lot of data skewed towards one class

I have a question on how to deal with some interesting data.
I currently have some data (The counts are real, but the situation is fake) where we predict the number of t-shirts that people will purchase online today. We know quite a bit about everyone for our feature attributes and these change from day to day. We also know how many t-shirts the everyone purchased the previous days.
What I want is to have an algorithm that is able to churn out a continuous variable that is a ranking or “score” of the number of t-shirts the person is going to purchase the today. My end goal is that if I can get this score attached to each person, I can sort them according to the score and use them in a specific UI. Currently I’ve been using random forest regression with sci-kit where my target classes are yesterday’s count of t-shirt purchases by each person. This has worked out pretty well except that my data is mildly difficult in that there are a lot of people that purchase 0 t-shirts. This is a problem due to my random forest giving me a lot of predicted classes of 0 and I cannot sort them effectively. I get why this is happening, but I’m not sure the best way to get around it.
What I want is a non-zero score (even if it’s a very small number close to 0) that tells me more about the features and the predicted class. I feel that some of my features must be able to tell me something and give me a better prediction than 0.
I think that the inherent problem is using random forest regressor as the algorithm. Each tree is getting a vote; however, there are so many zeros that there are many forests where all trees are voting for 0. I would like another algorithm to try, but i don’t know which one would work best. Currently I’m training on the whole dataset and using the out-of-bag estimate that scikit provides.
Here are the counts (using python’s Counter([target classes]) of the data classes. This is setup as such: {predicted_class_value: counts_of_that_value_in_the_target_class_list}
{0: 3560426, 1: 121256, 2: 10582, 3: 1029, 4: 412, 5: 88, 6: 66, 7: 35, 8: 21, 9: 17, 10: 17, 11: 10, 12: 2, 13: 2, 15: 2, 21: 2, 17: 1, 18: 1, 52: 1, 25: 1}
I have tried some things to manipulate the training data, but I’m really guessing at things to do.
One thing I tried was scaling the number of zeros in the training set to a linearly scaled amount based on the other data. So instead of passing the algorithm 3.5 million 0-class rows, I scaled it down to 250,000. So my training set looked like: {0: 250,000, 1: 121256, 2: 10582, 3: 1029, … }. This has a drastic effect on the amount of 0’s coming back from the algorithm. I’ve gone from the algo guessing 99% of the data as 0 to about only 50%. However, I don’t know if this is a valid thing to do or if it even makes sense.
Other things I’ve tried include increasing the forest size - however that doesn’t have too much of an effect, telling the random forest to only use sqrt features for each of the tree - which has had a pretty good effect, and using the out-of-bag estimate - which also has seem to have good results.
To summarize, I have a set of data where there is a disproportionate amount of data toward one class. I would like to have some way to produce a continuous value that is a “score” for each value in the predicted dataset so I may sort them.
Thank you for your help!

This is an unbalanced class problem. One thing you could do is over/undersampling. Undersampling means that you randomely delete instances from the majority class. Over sampling means that you sample with replacement instances from the minority class. Or you could use a combination of both. One thing you could try is SMOTE[1] which is an oversampling algorithm but instead of just sampling exsisting instances from the minority class, it creates synthetic instances which will avoid overfitting and in theory will be better at generalizing.
[1] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research (2002): 321-357.

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!

Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.

I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez

I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.

Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.

my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart