H2O Flow: How does H2O flow UI treat data types differently - machine-learning

Specifically, what is the difference in how H2O treats enum and string data types in contrast to 'int's and 'numerical' types?
For example, say I have a binary classifier that takes input samples that have features
x1=(1 of 10 possible favorite ice cream flavors (enum))
x2=(some random phrase (string))
x3=(some number (int))
What would be the difference in how the classifier treats these types during training?
When uploading data into h2o Flow UI, I get the option to convert certain data types (like enum) to 'numerical.' This makes me think that there is more than just string-to-number mapping going on when I just leave the 'enum' as an 'enum' (not converting to 'numerical' type), but I can't find information on what that difference is.
Clarification would be appreciated, thanks.

The "enum" type is the type of encoding you'll want to use for categorical features. If the categorical features are encoded as "enum", then the tree-based algorithms like Random Forest and GBM will be able to handle these features in a smart way. Most other implementations of RFs and GBM force you to do a one-hot expansion of the categorical features (into K dummy columns), but in H2O, the tree-based methods can use these features without any expansion. The exact whay that the variables are handled can be controlled using the categorical_encoding argument.
If you have an ordered categorical variable, then it might be okay to encode that as "int", however, the effect of doing that on model performance will depend on the data.
If you were to convert an "enum" column to "numeric" that would simply encode each category as an integer and you'd lose the notion that those numbers represent categories (so it's not recommended).
You should not use the "string" type in H2O unless you are going to exclude that column from the set of predictors. It would make sense to use a "string" column for text, but you'll probably want to parse (e.g. tokenize) that text to generate new numeric or enum features that will be included in the set of predictors.

Related

Using a Word2Vec Model to Extract Data

I've used gensim Word2Vec to learn the embedding of monetary amounts and other numeric data in bank transaction memos. The goal is to use this to be able to extract these amounts and currencies from future input strings.
Design
Our input strings are something like
"AMAZON.COM TXNw98e7r3347 USD 49.00 # 1.283"
During preprocessing, I tokenize and also replace all tokens that have the possibility of being a monetary amount (string consisting only of digits, commas, and <= 1 decimal point/period) with a special VALUE_TOKEN. And I also manually replace exchange rates with RATE_TOKEN. The result would be
["AMAZON", ".COM", "TXNw", "98", "e", "7", "r", "3347", "USD", "VALUE_TOKEN", "#", "RATE_TOKEN"]
With all my preprocessed lists of strings in list data, I generate model
model = Word2Vec(data, window=3, min_count=3)
The embeddings of model that I'm most interested in are that of VALUE_TOKEN, RATE_TOKEN, as well as any currencies (USD, EUR, CAD, etc.). Now that I generated the model, I'm not sure what to do with it.
Problem
Say I have a new string that the model has never seen before,
new_string = "EUR 299.99 RATE 1.3289 WITH FEE 5.00"
I would like to use model to identify which tokens of new_string is most contextually similar to VALUE_TOKEN (which should return ["299.99", "5.00"]), which is closest to RATE_TOKEN ("1.3289"). It should be able to classify these based on the learned embedding. I can preprocess new_string the way I do with the training data, but because I don't know the exchange rate before hand, all three tokens of ["299.99", "5.00", "1.3289"] will be tagged the same (either with VALUE_TOKEN or a new UNIDENTIFIED_TOKEN).
I've looked into methods like most_similar and similarity but don't think they work for tokens that are not necessarily in the vocabulary. What methods should I use to do this? Is this the right approach?
Word2vec's fuzzy, dense embedded token representations don't strike me as the right tool for what you're doing, though they might perhaps be an indirect contributor to a hybrid approach.
In particular:
The word2vec algorithm originated from, & has the most consistent public results, when applied to natural-language texts, with their particular patterns of relative token frequences, and varied co-occurrences. Certainly, many ahave applied it, with success, to other kinds of text/record data, but such uses may require a lot more preprocessing/parameter-tuning, and to the extent the underlying data has some fixed, highly-repetitive scheme, might be more amenable to other approaches.
If you replace all known values with 'VALUE_TOKEN', & all known rates with 'RATE_TOKEN', then the model is only going to learn token-vectors for 'VALUE_TOKEN' & 'RATE_TOKEN'. Such a model won't be able to supply any vector for non-replaced tokens it's never seen like '$1.2345' or '299.99'. Even collapsing all those to 'UNIDENTIFIED_TOKEN' just limits the model to whatever it learned earlier was the vector for 'UNIDENTIFIED_TOKEN' (if any, in the training data).
I've not noticed existing word2vec implementations offering an interface for inferring the word-vector for new unknown-vectors, from just one or several new examples of its appearance in-context. They could, in the same style of new-document-vector inference used by 'Paragraph Vectors'/Doc2Vec, but just don't.) The closest I've seen is Gensim's predict_output_word(), which does a CBOW-like forward-propagation on negative-sampling models, to every 'output node' (one per known word), to give a ranked list of the known-words most-likely to appear given some context words.
That predict_output_word() might, if fed surrounding known-tokens, contribute to your needs by whether it says your 'VALUE_TOKEN' or 'RATE_TOKEN' is a more-likely model-prediction. You could adapt its code to only evaluate those two candidates, if you're always sure the right answer is one or the other, for a speed-up. A simple comparison of the average-of-context-word-vectors, and the candidate-answer vectors, might be as effective as the full forward-propagation.
Alternatively, you might want use the word2vec model solely as a source of features (via context-words) for some other classifier, which is trained to answer VALUE or TOKEN. This other classifier's input might include things like:
some average of the vectors of all nearby tokens
the full vectors of closest neighbors
a one-hot encoding ('bag-of-words') of all nearby (or 'preceding') or 'following) known-tokens, assuming the vocabulary of non-numerical tokens is fairly short & highly indicative
?
If the data streams might include arbitrary new or corrupted tokens whose meaning might be inferrable from substrings, you could consider a FastText model as well.

How to deal with multiple categorical variables each with different cardinality?

I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.

One-hot encoding in random forest classifier

Is one-hot encoding necessary for random forest classifier in python? I want to understand logically if random forest can handle categorical features with label encoding rather that one-hot-encoding.
The concept of encoding is necessary in machine learning because with the help of it, we can convert non-numeric features into numeric ones which is understandable by any model.
Any type of encoding can be done on any non-numeric features, it solely depends on intution.
Now, coming to your question when to use label-encoding and when to use One-hot encoding:
Use Label-encoding - Use this when, you want to preserve the ordinal nature of your feature. For example, you have a feature of education level, which has string values like "Bachelor","Master","Ph.D". In this case, you want to preserve the ordinal nature that, Ph.D > Master > Bachelor hence you'll map using label-encoding like - Bachelor-1, Master-2, Ph.D-3.
Use One-hot encoding - Use this when, you want to treat your categorical variable with equal order. For example, you have colors variable which has values "red","yellow", "orange". Now, in this case any value has no precedence over other values, hence you'll use One hot encoding here.
NOTE: In One-hot encoding your number of features will increase, which is not good for any tree based algorithm like Decision-trees, Random Forest etc. That's why Label encoding is mostly preferred in this case, but still if you use one hot encoding, you can check the importance of categorical features by using feature_importances_ hyperparameter in sklearn. If the feature is having low importance you can drop it off.
Random forest is based on the principle of Decision Trees which are sensitive to one-hot encoding. Now here sensitive means like if we induce one-hot to a decision tree splitting can result in sparse decision tree. The trees generally tend to grow in one direction because at every split of a categorical variable there are only two values (0 or 1). The tree grows in the direction of zeroes in the dummy variables.
Now you must be wondering how will you tackle the categorical values without one-hot encoding? For that you can refer to this Hashing Trick further you can also look into h2o Random Forest.

Nominal valued dataset in machine learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?
Should I map each nominal value to real value?
For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include
{ gender(boolean), age(real), job(nominal) }
where dependent variable may be the number of web-site login.
The variable job may be one of
{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.
Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?
Do a one-hot encoding, if anything.
If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.
If you read the book called "Machine Learning with Spark", the author
wrote,
Categorical features
Categorical features cannot be used as input in their raw form, as they are not
numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.
:
To transform categorical variables into a numerical representation, we can use a
common approach known as 1-of-k encoding. An approach such as 1-of-k encoding
is required to represent nominal variables in a way that makes sense for machine
learning tasks. Ordinal variables might be used in their raw form but are often
encoded in the same way as nominal variables.
:
I had exactly the same thought.
I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.
Actually I've done some projects where I had to do that way and
there was no issue raised concerning the performance of learning system.
To someone who took a vote against my question,
please cancel your evaluation.

How to include words as numerical feature in classification

Whats the best method to use the words itself as the features in any machine learning algorithm ?
The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ?
In general, How are words itself used as features in NLP ?
There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:
a Boolean field which encodes the presence or absence of that word in a given document;
a frequency histogram of a
predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the
last paragraph of this Answer);
the juxtaposition of two or more
words (e.g., 'alternative' and
'lifestyle' in consecutive order have
a meaning not related either
component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;
words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.
A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.
Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur.
This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK.
"Index in the dictionary" is a useless feature, I wouldn't use it.
tf-idf is a pretty standard way of turning words into numeric features.
You need to remember to use a learning algorithm that supports numeric featuers, like SVM. Naive Bayes doesn't support numeric features.

Resources