h2o DRF unseen categorical values handling - random-forest

The documentation for DRF states
What happens when you try to predict on a categorical level not seen
during training?
DRF converts a new categorical level to a NA value in
the test set, and then splits left on the NA value during scoring. The
algorithm splits left on NA values because, during training, NA values
are grouped with the outliers in the left-most bin.
So h2o converts unseen levels to NAs and then treats them the same way as NAs in the training data. But what if there are also no NAs in the training data?
Assume my categorical predictor is of enum type and to be understood as non-ordinal. What does "grouped with the outliers in the left-most bin" then mean? If the predictor is non-ordinal there is no "left-most" and there are no "outliers".
Let's put questions 1 and 2 aside and focus on the part "The
algorithm splits left on NA values because, during training, NA values
are grouped with the outliers in the left-most bin". This is in contradiction to this SO answer showing a single DRF tree derived from a MOJO. One can clearly see that NAs go left and right. It also contradicts the answer to another question in the documentation that says "missing values as a separate category [...] can go either left or right", see
How does the algorithm handle missing values during training? Missing
values are interpreted as containing information (i.e., missing for a
reason), rather than missing at random. During tree building, split
decisions for every node are found by minimizing the loss function and
treating missing values as a separate category that can go either left
or right.
The last point is more of a suggestion than a question. The documentation on missing values for GBM says
What happens when you try to predict on a categorical level not seen
during training? Unseen categorical levels are turned into NAs, and
thus follow the same behavior as an NA. If there are no NAs in the
training data, then unseen categorical levels in the test data follow
the majority direction (the direction with the most observations). If
there are NAs in the training data, then unseen categorical levels in
the test data follow the direction that is optimal for the NAs of the
training data.
In contrast to the description of how DRF handles missing values, this seems to be completely consistent. Plus: using the majority path rather than always going left at split points appears to be more natural.

The sentence you pointed to that seemed to contradict other portions of the docs, is in fact outdated. I have made a Jira Ticket to update the FAQ with the correct answer (which is what you see for the GBM missing values section - i.e. the missing value handling is the same for GBM and DRF).
as a side note the enum data type are internally encoded as numeric values, you can learn more about the types of mapping's H2O can use here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.
Or take a look at how h2o-3 does binning for categoricals here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html


Loss function for Question Answering posed as Multiclass Classification?

I'm working on a question answering problem with limited data (~10,000s of data points) and very few features for both the context/question as well as the options/choices. Given:
a question Q and
options A, B, C, D, E (each characterized by some features, say, string similarity to Q or number of words in each option)
(while training) a single correct answer, say B.
I wish to predict exactly one of these as the correct answer. But I'm stuck because:
If I arrange ground truth as [0 1 0 0 0], and give the concatenation of QABCDE as input, then the model will behave as if classifying an image into dog, cat, rat, human, bird, i.e. each class will have a meaning, however that's not true here. If I switched the input to QBCDEA, the prediction should be [1 0 0 0 0].
If I split each data point into 5 data points, i.e. QA:0, QB:1, QC:0, QD:0, QE:0, then the model fails to learn that they're in fact interrelated, and only one of them must be predicted as 1.
One approach that seems viable is to make a custom loss function which penalizes multiple 1s for a single question, and which penalizes no 1s as well. But I think I might be missing something very obvious here :/
I'm also aware of how large models like BERT do this over SQuAD like datasets. They add positional embeddings to each option (eg. A gets 1, B gets 2), and then use a sort of concatenation over QA1 QB2 QC3 QD4 QE5 as input, and [0 1 0 0 0] as output. Unfortunately, I believe this will not work in my case given the very small dataset I have.
The problem you're having is that you removed all useful information from your "ground truth". The training target is not the ABCDE labels -- the target is the characteristics of the answers that those letters briefly represent.
Those five labels are merely array subscripts for classifications that are a 5Pn (5 objects chosen from n) shuffled subset of your training space. Bottom line: there is no information in those labels.
Rather, extract the salient characteristics from those answers. Your training needs to find the answer (characteristic set) that sufficiently matches the question. As such, what you're doing is close to multi-label training.
Multi-label models should handle this situation. This will include those that label photos, identifying multiple classes represented in the input.
Does that get you moving?
Response to OP comment
You understand correctly: predicting 0/1 for five arbitrary responses is meaningless to the model; the single-letter variables are of only transitory meaning, and have no relation to anything trainable.
A short thought experiment will demonstrate this. Imagine that we sort the answers such that A is always the correct answer; this doesn't change the information in the inputs and outputs; it's a valid arrangement of the multiple-choice test.
Train the model; we'll get to 100% accuracy in short order. Now, consider the model weights: what has the model learned from the input? Nothing -- the weights will train to ignore the input and select A, or will have absolutely arbitrary values that come to the A conclusion.
You need to ignore the ABCDE designations entirely; the target information is in the answers themselves, not in those letters. Since you haven't posted any sample cases, we have little to guide us for an alternate approach.
If your paradigm is a typical multiple-choice examination, with few restrictions on the questions and answers, then the problem you're tackling is far larger than your project is likely to solve -- you're in "Watson" territory, requiring a large knowledge base and a strong NLP system to parse the inputs and available responses.
If you have a restricted paradigm for the answers, perhaps you can parse them into phrases and relations, yielding a finite set of classes to consider in your training. In this case, a multi-label model might well be able to solve your problem.
If your application is open-ended, i.e. open topic, then I expect that you need a different model class (such as BERT), but you'll still need to consider the five answers as text sequences, not as letters. You need a holistic match to the subject at hand. If this is a typical multiple-choice exam, then your model will still have classification troubles, as all five answers are likely to be on topic; finding the correct answer should depend on some level of semantic insight into question and answer, something stronger than "bag of words" processing.

one hot encoding of output labels

While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/

How do UnknownCategoricalLevels affect the confidence values of H2O model predictions

I am using a DRF model generated with h2o flow. When running fresh input data against this model (using its MOJO in a java program with the EasyPredictModelWrapper), there are a large number of UnknownCategoricalLevels (checking with the getUnknownCategoricalLevelsSeen() and getUnknownCategoricalLevelsSeenPerColumn() methods).
My workaround for this was to only use those predictions that had a prediction confidence above a certain threshold (say 0.90). Ie. the classProbability selected by the model must be grater than threshold to be used.
My questions are:
Is this solution wrong-headed (ie. does not actually address/workaround the problem (eg. unknownlevels don't actually affect the class probability values)) or is it a valid workaround to the problem?
Is there a better way to address this issue?
The unknown categorical level is treated as an NA for that column.
Without knowing the details of your data (including the cost implications of false positives and false negatives), I wouldn't say that you need to threshold rows that have NAs any differently than for rows that do not. (The NA is already handled quite well by DRF.)
Note the built-in threshold is max-F1 (not 0.5). So if you are changing the threshold for rows with unknown values, it's relative to max-F1 (not 0.5). Using your own threshold is certainly a valid approach.
If you want to visualize your trees to more easily see how the NAs behave, you can do so following the instructions here:
There are also other strategies for dealing with it, like target-encoding your categorical input column and treating an NA as the average target value. (This effectively turns a categorical variable into a numeric one, but requires you to preprocess the data.)

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

Neural Network Normalization of Nominal Data for 1 Output Neuron

I am new to machine learning and AI and started with NN recently.
Already got some information here on stackoverflow, but I don't understand the logic from the whole gathered information at the moment.
Let's take 4 nominal (but not ordinal) values [A, B, C, D] and 2 numericals already normalized [0.35, 0.55] - so 2 input neurons, one for nominal one for numerical.
I mostly see in NN literature you have to use 4 input neurons for encoding. But I don't need it to predict those nominal ones. I have only one output neuron that represents at most a relationship in the way if I would use it with expert systems and rules.
If I would normalize them to [0.2, 0.4, 0.6, 0.8] for example, isn't the NN able to distinguish between them? For the NN it's only a number, isn't it?
Naive approach and thinking:
A with 0.35 numerical leads to ideal 1.
B with 0.55 numerical leads to ideal 0.
C with 0.35 numerical leads to ideal 0.
D with 0.55 numerical leads to ideal 1.
Is there a mistake in my way of thinking about this approach?
Additional info (edit):
Those nominal values are included in decision making (significance if measured with statistics tools by combining with the numerical values), depends if they are true or not. I know they can be encoded binary, but the list of nominal values is a litte bit larger.
Other example:
Symptom A with blood test 1 leads to diagnosis X (the ideal)
Symptom B with blood test 1 leads to diagnosys Y (the ideal)
Actually expert systems are used. Symptoms are nominal values, but in combination with the blood test value you get the diagnosis. The main question finally: Do I have to encode symptoms in binary way or can I replace symptoms with numbers? If I can't replace it with numbers, why binary representation is the only way in usage of a NN?
Theoretically it doesn't really matter how do you encode your inputs. As long as different samples will be represented by different points in the input space it is possible to separate them with a line - and that what's the input layer (if it's linear) is doing - it combines the inputs linearly. However, the way the data is laid out in the input space can have huge impact on convergence time during learning. A simple way to see this is this: imagine a set of lines crossing the origin in the 2D space. If your data is scattered around the origin, then it is likely that some of these lines will separate data into parts, and few "moves" will be required, especially if the data is linearly separable. On the other hand, if your input data is dense and far from the origin, then most of initial input discrimination lines won't even "hit" the data. So it will require a large number of weight updates to reach the data, and the large amount of precise steps to "cut" it into initial categories.
If you have categories then encoding them as binary is quite important. Imagine that you have three categories: A, B and C. If you encode them with two three neurons as 1;0;0, 0;1;0 and 0;0;1 then during learning and later with noisy data a point about which network is "not sure" can end up as 0.5;0.0;0.5 on the output layer. That makes sense, if it is really something conceptually between A and C, but surely not B. If you'd choose one output neuron end encode A, B and C as 1, 2 and 3, then for the same situation the network would give an input of average between 1 and 3 which gives you 2! So the answer would be "definitely B" - clearly wrong!
