best practices for using Categorical Variables in H2O? - machine-learning

I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.

High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.

Related

What to do with corrected wrongly classified random forest predictions?

I have trained a multi-class Random Forest model and So now if the model predicts something wrong we manually correct it, SO the thing is What can we do to with that corrected label and make the predictions better.
Thoughts:
Can't retrain the model again and again.(Trained on 0.7 million rows so it might treat the new data as noise)
Can not train small models of RF as they will also create a mess
Random FOrest works better then NN, So not thinking to go that way.
What do you mean by "manually correct" - i.e. there may be various different points in the decision trees that were executed leading to a wrong prediction, not to mention the numerous decision trees used to get your final prediction.
I think there is some misunderstanding in your first point. Unless the distribution is non-stationary (in which case your trained model is of diminished value to begin with), the new data is treated is treated as "noise" in the sense that including it in the final model is unlikely to change future predictions all that much. As far as I can tell this is how it should be, without specifying other factors like a changing distribution, etc. That is, if future data you want to predict will look a lot more like the data you failed to predict correctly, then you would indeed want to upweight the importance of classifying this sample in your new model.
Anyway, it sounds like you're describing an online learning problem(you want a model that updates itself in response to streaming data). You can find some general ideas just searching for online random forests, for example:
[Online random forests] (http://www.ymer.org/amir/research/online-random-forests/) and [online multiclass lpboost] (https://github.com/amirsaffari/online-multiclass-lpboost) describe a general framework akin to what you may have in mind: the input to the model is a stream of new observations; the forest learns on this new data by dropping those trees which perform poorly and eventually growing new trees that include the new data.
The general idea described here is used in a number of boosting algorithms (for example, AdaBoost aggregates an ensemble of "weak learners", for example individual decision trees grown on different + incomplete subsets of data, into a better whole by training subsequent weak learners specifically on formerly misclassified instances. The idea here is that those instances where your current model is wrong are the most informative for future performance improvements.
I don't know the specific details of how the linked implementations accomplish this, though the idea is inline with what you might expect.
You might try these, or other such algorithms you find from searching around.
That all said, I suspect something like the online random forest algorithm is relatively good when old data becomes obsolete over time. If it doesn't -- i.e. if your future data and early data are pulled from the same distribution -- it's not obvious to me that successively retraining your model (by which I mean the random forest itself and any cross validation / model selection procedures you might have to transform forest predictions into a final assignment) data on the whole batch of examples you have is a bad idea, modulo data in a very high dimensional feature space, or really quickly incoming data.

How specific should a Support Vector Machine Model be?

The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?
If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)

Machine learning, Do unbalanced non-numeric variable classes matter

If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?
For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?
Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.
Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.
I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.
An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.
Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.

Standardization before or after categorical encoding?

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator

Should 'deceptive' training cases be given to a Naive Bayes Classifier

I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.

Resources