“Histogram and binning” technique for categorical variables publication and implementations - random-forest

H2O.ai have implemented the "histogram and binning" technique for efficient and accurate tree-building using categorical variables of high cardinality (>100): http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html
Somewhere in their documentation, they have a reference to a publication that details the method but I can't seem to find it anymore - can anyone link to that publication?
Given that the method, which seems to be state-of-the-art for tree-building using categorical variables, is published - are there really no other implementations than H2O.ai?
In sklearn, this feature has been brewing for years on github but apparently still hasn't come out.
I asked the question previously on Data Science: https://datascience.stackexchange.com/questions/40241/histogram-and-binning-technique-for-categorical-variables-publication-and-impl

Related

Is it a good idea to use word2vec for encoding of categorical features?

I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.
However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.
Do you guys have any advice, thoughts, recommendations on this?
You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.
google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737
It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.
Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)
That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.
In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.
Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

Scikit - How to get single term for similar words using sklearn

I'm new to text analysis and scikit-learn. I am trying to vectorize tweets using sklearn's TfidfVectorizer class. When I listed the terms using 'get_feature_names()' after vactorizing the tweets, I see similar words such as 'goal', 'gooooal' or 'goaaaaaal' as different terms.
Question is, How can I make a single term 'goal' for such similar but different words using sklearn feature extraction techniques (or any other techniques) to get my results better?
In short - you can't. This is a very complex problem, going to the whole language understanding. Think for a moment - can you define exactly what does it mean to be "similar but different"? If you can't, computer will not be able to, too. What you can do?
You can come up with easy preprocessing rules, such as "remove any repeating letters", which will fix the "goal" problem. (this should not lead to any further problems)
You can use existing databases of synonyms (like wordnet) to "merge" the same meaning to the same tokens (this might cause false positive - you might "merge" words of different meaning due to the lack of context analysis)
You can build some language model and use it to embed your data in a lower-dimensional space forcing your model to merge similar meanings (using the well known heuristics "words that occur in similar contexts have similar meaning"). One of such technique is Latent Semantic Analysis but obviously there are many more possible.

General questions regarding text-classification

I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA http://jgibblda.sourceforge.net but since my data is labeled, there is an improvement of this called JGibbLabeledLDA https://github.com/myleott/JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka http://sourceforge.net/projects/meka/ that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.
Thanks!

What does "Oracle" means in machine learning?

I'm reading "Diversity Creation Methods: A Survey and Categorisation". There is an explanation about the Q-statistic, “Take two classifiers, fi and fj. Over a large set of testing patterns, they will exhibit certain conicident errors, and therefore a probability of error coincidence can be calculated. These are also referred to as the Oracle outputs”
AFAIK "Oracle" doesn't have any established meaning in ML in general. In this article it's just a synonym for an ensemble member. It's not used constantly throughout the article, so I'm guessing it's just a reference to the term used in some earlier formulation of the method discussed.
Sometimes "Oracle" stands for an imaginary source of knowledge about the target function - a source of training/testing samples used in some kind of intuitive proofs or
thought experiments.

Resources