How do UnknownCategoricalLevels affect the confidence values of H2O model predictions - machine-learning

I am using a DRF model generated with h2o flow. When running fresh input data against this model (using its MOJO in a java program with the EasyPredictModelWrapper), there are a large number of UnknownCategoricalLevels (checking with the getUnknownCategoricalLevelsSeen() and getUnknownCategoricalLevelsSeenPerColumn() methods).
My workaround for this was to only use those predictions that had a prediction confidence above a certain threshold (say 0.90). Ie. the classProbability selected by the model must be grater than threshold to be used.
My questions are:
Is this solution wrong-headed (ie. does not actually address/workaround the problem (eg. unknownlevels don't actually affect the class probability values)) or is it a valid workaround to the problem?
Is there a better way to address this issue?
Thanks.

The unknown categorical level is treated as an NA for that column.
Without knowing the details of your data (including the cost implications of false positives and false negatives), I wouldn't say that you need to threshold rows that have NAs any differently than for rows that do not. (The NA is already handled quite well by DRF.)
Note the built-in threshold is max-F1 (not 0.5). So if you are changing the threshold for rows with unknown values, it's relative to max-F1 (not 0.5). Using your own threshold is certainly a valid approach.
If you want to visualize your trees to more easily see how the NAs behave, you can do so following the instructions here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo
There are also other strategies for dealing with it, like target-encoding your categorical input column and treating an NA as the average target value. (This effectively turns a categorical variable into a numeric one, but requires you to preprocess the data.)

Related

best practices for using Categorical Variables in H2O?

I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.
High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

Machine learning kerrnels (how to check if the data is linearly separable in high dimensional space using a given kernel)

How can I test/check whether a given kernel (example: RBF/ polynomial) does really separate my data?
I would like to know if there is a method (not plotting the data of course) which can allow me to check if a given data set (labeled with two classes) can be separated in high dimensional space?
In short - no, there is no general way. However, for some kernels you can easily say that... everything is separable. This property, proved in many forms (among other by Schoenberg) says for example that if your kernel is of form K(x,y) = f(||x-y||^2) and f is:
ifinitely differentible
completely monotonic (which more or less means that if you take derivatives, then the first one is negative, next positive, next negative, ... )
positive
then it will always be able to separate every binary labeled, consistent dataset (there are no two points of the exact same label). Actually it says even more - that you can exactly interpolate, meaning, that even if it is a regression problem - you will get zero error. So in particular multi-class, multi-label problems also will be linearly solvable (there exists linear/multi-linear model which gives you a correct interpolation).
However, if the above properties do not hold, it does not mean that your data cannot be perfectly separated. This is only "one way" proof.
In particular, this class of kernels include RBF kernel, thus it will always be able to separate any training set (this is why it overfits so easily!)
So what about in the other way? Here you have to first fix hyperparameters of the kernel and then you can also answer it through optimization - solve hard-margin SVM problem (C=inf) and it will find a solution iff data is separable.

Choosing random_state for sklearn algorithms

I understand that random_state is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting). But the documentation does not clarify or detail on this. Like
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?
What I am mainly concerned about is
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).
5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
Yes, the choice of the random seeds will impact your prediction results and as you pointed out in your fourth question, the impact is not really predictable.
The common way to guard against predictions that happen to be good or bad just by chance is to train several models (based on different random states) and to average their predictions in a meaningful way. Similarly, you can see cross validation as a way to estimate the "true" performance of a model by averaging the performance over multiple training/test data splits.
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?
random_state is used wherever randomness is needed:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
Good problems should not depend too much on the random_state.
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
Do not choose it. Instead try to optimize the other aspects of classification to achieve good results, regardless of random_state.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).
As of Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?, random.seed(X) is not used by sklearn. If you need to control this, you could set np.random.seed() instead.
5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
How can I know training data is enough for machine learning's answers mostly state that the more data the better.
If you do a lot of model-selection, maybe Sacred can help, too. Among other things, it sets and can log the random seed for each evaluation, f.ex.:
>>./experiment.py with seed=123
During the experiment, for tune-up and reproducibility, you fix temporarily random state but you repeat the experiment with different random states and take the mean of the results.
import os
# Set a Random State value
RANDOM_STATE = 42
# Set Python a random state
os.environ['PYTHONHASHSEED'] = str(RANDOM_STATE)
# Set Python random a fixed value
import random
random.seed(RANDOM_STATE)
# Set numpy random a fixed value
import numpy as np
np.random.seed(RANDOM_STATE)
# Set other library like TensorFlow random a fixed value
import tensorflow as tf
tf.set_seed(RANDOM_STATE)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
# Eventually don't forget to set random_state parameter in function like
RandomizedSearchCV(random_state = RANDOM_STATE, ...)
For production system, you remove random state by setting it to None
# Set a Random State value
RANDOM_STATE = None

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

Resources