H2O stacked ensemble with models using different inputs - machine-learning

Using h2o flow, is there a way to create a stacked ensemble model based on individual models that may not take the same inputs but predict on the same response labels.
Eg. I am trying to predict for miscoded healthcare claims (ie. charges) and would like to train models for a stacked ensemble of the form:
model1(diagnosis1, diagnosis2, ..., diagnosis5) -> denied or paid (by insurer)
model2(procedure, procedure_detail1, ..., procedure_detail5) -> denied or paid
model3(service_date, insurance_amount, insurer_id) -> (same)
model4(pat_age, pat_sex, ...) -> (same)
...
Is there a way to do this in h2o flow (can't tell how to do this with what is presented in the h2o flow gui for stacked ensemble)? Is this even a sensible way to go about this or is it confused in some way (relatively new to machine learning)? Thanks.

Darren's response that you can't do this in H2O was correct until very recently -- H2O just removed the requirement that the base models had to be trained on the same set of inputs since it's not actually required by the Stacked Ensemble algorithm. This is only available on the nightly releases off of master though, so even if you're on the latest stable release, you'd see an error that looks like this (in Flow, R, Python, etc) if you tried to use models that don't use the exact same columns:
Error: water.exceptions.H2OIllegalArgumentException: Base models are inconsistent: they use different column lists. Found: [x6, x7, x4, x5, x2, x3, x1, x9, x8, x10, response] and: [x10, x16, x15, x18, x17, x12, x11, x14, x13, x19, x9, x8, x20, x21, x28, x27, x26, x25, x24, x23, x22, x6, x7, x4, x5, x2, x3, x1, response].
The metalearning step in the Stacked Ensemble algorithm combines the output from the base models, so the number of inputs that went into training the base models doesn't really matter. Currently, H2O still requires that the inputs are all part of the same original training_frame -- but you can use a different x for each base model if you like (the x argument specifies which of the columns from the training_frame you want to use in your model).
The way that Stacked Ensemble works in Flow is that it looks for models that are all "compatible", in other words -- trained on, the same data frame. Then you select from this list which ones you want to include in the ensemble. So as long as you are using the latest development version of H2O, then this is how to do what you want to do in Flow.
Here's an R example of how to ensemble models that are trained on different subsets of the feature space:
library(h2o)
h2o.init()
# Import a sample binary outcome training set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
# Train & Cross-validate a GBM using a subset of features
my_gbm <- h2o.gbm(x = x[1:10],
y = y,
training_frame = train,
distribution = "bernoulli",
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train & Cross-validate a RF using a subset of features
my_rf <- h2o.randomForest(x = x[3:15],
y = y,
training_frame = train,
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train a stacked ensemble using the GBM and RF above
ensemble <- h2o.stackedEnsemble(y = y, training_frame = train,
base_models = list(my_gbm, my_rf))
# Check out ensemble performance
perf <- h2o.performance(ensemble, newdata = test)
h2o.auc(perf)

A stacked ensemble won't do this, as it does require identical inputs to each model. But you can set up a looser kind of ensemble... and that can almost, but not quite, be done in Flow.
Basically, you would create your four models. Then you would run predict on each of them. Each predict() will give you a new h2o frame. You would then need to cbind (column-bind) those four predictions together, to give you a new h2o frame with 4 binary columns (*). Then that would be fed into a 5th model, that gives you a combined result.
*: This is the bit I don't think you can do in Flow. You would need to export the data, combine it in another application, then bring it back in.
A better approach would be to be build a single model using all the inputs together. This would be both simpler, and give you more accurate results (as, e.g. interactions between insurance_amount and pat_age could be discovered). But, the (potentially major) downside is you cannot explain the model as four sets of yes/no any more. I.e. it becomes more black-box-like.

Related

How can I include survey weights in a poisson pint process model fitted to a logistic regression quadrature scheme?

Is it possible to include weights in a poisson point process model fitted to a logistic regression quadrature scheme? My data is a stratified sample and I would like to account for this sampling strategy in order to have valid population level predictions.
This is a question about the model-fitting function ppm in the R package spatstat.
Yes, you can include survey weights. The easiest way is to create a covariate surveyweight, which could be a function(x,y) or a pixel image or a column of data associated with your quadrature scheme. Then when fitting the model using ppm, add the model term +offset(log(surveyweight)).
The result of ppm will be a fitted model that describes the observed point pattern. You can do prediction, simulation etc from this model, but be aware that these will be predictions or simulations of the observed point process including the effect of non-constant survey effort.
To get a prediction or simulation of the original point process (i.e. after removing the effect of non-constant survey effort) you need to replace the original covariate surveyweight by another covariate that is constant and equal to 1, then pass this to predict.ppm in the argument newdata.
Here are a few lines to elaborate on the answer by #adrian-baddeley.
If you have the setup of your related question and we imagine you have the weights and two covariates in a data.frame in the same order as the points of your quadscheme:
library(spatstat)
X <- split(chorley)$larynx
D <- split(chorley)$lung
Q <- quadscheme.logi(X,D)
covar <- data.frame(weights = runif(npoints(chorley)),
covar1 = rnorm(npoints(chorley)),
covar2 = rnorm(npoints(chorley)))
fit <- ppm(Q ~ offset(log(weights)) + covar1 + covar2, data = covar)

Feature order when used with a trained machine learning model

Say I have a machine learning algorithm trained using features F1, F2 and F3. This model is then subsequently picked and used on another project (imported using Joblib).
When using the trained model, do the inputs need to be in the same order (F1, F2 or F3)?
For simplicity assume that you are fitting a linear model and a regression model (but generalizes to all the others). If F1, F2, F3 are your features then it finds the weights w1, w2, w3, bias such that the error made by w1*F1 + w2*F2 + w3*F3 + bias is minimum. It is called the linear combination of weight and features.
So when making the prediction the model calcualtes the value w1*F1 + w2*F2 + w3*F3 + bias so the order of features matter.
Yes, they must be in exactly the same order. And preprocessed in exactly the same way.

How to move from sliding window of accelerometer data to feature vector in gesture recognition task?

We have an application that needs to be able to recognize a short hand gesture based on the accelerometer data.
Now, I read lots of different papers on the subject (among which "Machine Learning Methods for Classifying Human Physical Activity from On-Body Accelerometers" written by Andrea Mannini and Angelo Maria Sabatini seemed to be the most useful one) and I have a somewhat clear understanding of what needs to be done (some steps might look wrong as I don't have any previous experience in statistics/ML):
I need to gather the accelerometer data from the smartphone along 3 axes.
The next step is to separate the data into AC and DC components (one of them relates to the gravitational acceleration, second one relates to body acceleration). I need to work with the body acceleration component, as this component seems to provide better results for activity recognition problems.
The next step is to extract some features from the body acceleration component - the features to be extracted are yet to be determined as it's not the part of the question.
I need to take the list of my feature vectors and train a classifier (most likely I'll go with Hidden Markov Models as this classifier seems to be the best choice for the sequential classification tasks, and the gesture recognition tasks seems to belong to sequential classification tasks).
After that I can evaluate and refine.
Now, there is one thing that these papers mentioned and I can not grasp yet and this thing is sliding windows. As far as I understand, I need to take the stream of the incoming accelerometer data and split it into the set of overlapping series, e.g.
Data from 0s to 1.0s
Data from 0.5s to 1.5s
Data from 1.0s to 2.0s
...
Data from n s to (n+1.0)s
The thing that I don't understand is that each window has number of accelerometer readings:
x0, y0, z0
x1, y1, z1
x2, y2, z2
...
xn, yn, zn
As you can see, this is not the vector, it's matrix. But I need a feature vector to train my classifier, so how do I squeeze the matrix into the vector?
I had some ideas on how the authors of these papers did that:
We can create a feature vector based on each triplet: [featureA based on x0, y0, z0; featureB based on x0, y0, z0; featureA based on x1, y1, z1; featureB based on x1, y1, z1; ... ; featureB based on xn, yn, zn].
Or we can somehow average the accelerometer readings inside the window so that we end up having only a single triplet xa, ya, za ('a' for 'average') and then create a feature vector like that: [featureA based on xa, ya, za; featureB based on xa, ya, za].
I don't like the second idea as it seems that averaging the accelerometer data readings, even for the small windows might lead to the data losses and inadequate classifier behaviour. Probably the authors of the papers meant something like the first idea, but I am not sure.
Is my understanding of how the sliding windows are created correct (that is, idea #1)?
First of all, you are right in both part of your question but not completly.
I will say two thing:
First, the sliding window is meant to give a context to the data at time t. Depending on the data flow, you can put the future in it or not. But the idea is indeed to put the close datas in a same window to create a new dataset where each row is a window. Having said that, extracting feature from a window is not the same as extracting feature from 3 coordinates. You could for example try to calculate the speed of mouvement (by using the previous point). So the two options are valid! you will have to get feature from single points (polar coordinates for example) but also speed, movement from last point etc etc ... Those are all good features.
each coordinate can be a feature. You don't need to squeeze the matrix because it's not in fact any different from your first option or for that matter the second one. Your accelerometre is providing you datas from wich you can extract features to make a new data set, but you will end up with a matrix whatever you do.

Design a Data Model to predict value of sum of Function

I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.

complex machine learning application

2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.

Resources