Understanding Polynomial Regression.
I understand that we use polynomial regression for some kind of non Linear Data set and to give it a curve. I know the equation of writing a Polynomial Regression for single independent variable but i don't really understand how this equation is constructed for 2 variables?
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
What will be the equation for Polynomial Regression in case we have 3 or more variables? What actually is the logic behind developing these polynomial equation for more than one variable?
You can choose whatever you want, but the general 'formula' (to the best of my own experience and knowledge) is:
Powers (so x1, x1^2, x1^3 etc) up to whichever number you choose (many stop at 2).
Cross products (x1 * x2, x1 * x3 etc)
Combinations (x1^2 * x2, x1 * x2^2 etc) and then you can even add higher combinations (x1 * x2 * x3, and you can even add powers here).
But this gets quickly out of hand, and you can end up with too many features.
I would stick to powers of 2, and cross products (only pairs) with no powers, quite like your example, and if you have three elements, then the multiplication of all three of them, but if you have more than three, I wouldn't bother with triplets.
The idea with polynomials is that you model the complex relationship between the features and polynomials are sometimes a good approximation to more complex relationships (that are not really polynomial in their nature).
I hope this is what you meant and that this can help you.
I have excel sheet, where 3 column x1, x2 , x3 are present. x1,x2 have question and x3 have all the answer serially, I mean x1 and x2 1st row have question and that questions answer is x3 1st column. x1, and x2 have mixture of numerical and text data and have some NA value is also there.
Here my work is I have to using NLP technics to solve these issue, if I type x1 and x2 questions it will give x3 answer . so the question is not given full statement but some selected words, if I will give some selected keyword also it will be answer. Please guide me where and how I need to start . Please guide and sugest
It sounds (your question is a bit unclear) that you have a bunch of mixed data types, and you only want to process x1 = some text1 + x2 = some text2 -> x3 = some answer text.
I would recommend first cleaning up your data, you can easily remove NA's or NAN's by piping your data into a PANDAS dataframe (I'm unsure what language you're using either). If you're using python, you can also easily remove the numeric information by using the is.digit function.
I'm not entirely sure what you're trying to do, so I can't really recommend things to do after cleaning your data. Posting 2 examples of a proper and improper x1, x2 and x3 might be helpful.
Using h2o flow, is there a way to create a stacked ensemble model based on individual models that may not take the same inputs but predict on the same response labels.
Eg. I am trying to predict for miscoded healthcare claims (ie. charges) and would like to train models for a stacked ensemble of the form:
model1(diagnosis1, diagnosis2, ..., diagnosis5) -> denied or paid (by insurer)
model2(procedure, procedure_detail1, ..., procedure_detail5) -> denied or paid
model3(service_date, insurance_amount, insurer_id) -> (same)
model4(pat_age, pat_sex, ...) -> (same)
...
Is there a way to do this in h2o flow (can't tell how to do this with what is presented in the h2o flow gui for stacked ensemble)? Is this even a sensible way to go about this or is it confused in some way (relatively new to machine learning)? Thanks.
Darren's response that you can't do this in H2O was correct until very recently -- H2O just removed the requirement that the base models had to be trained on the same set of inputs since it's not actually required by the Stacked Ensemble algorithm. This is only available on the nightly releases off of master though, so even if you're on the latest stable release, you'd see an error that looks like this (in Flow, R, Python, etc) if you tried to use models that don't use the exact same columns:
Error: water.exceptions.H2OIllegalArgumentException: Base models are inconsistent: they use different column lists. Found: [x6, x7, x4, x5, x2, x3, x1, x9, x8, x10, response] and: [x10, x16, x15, x18, x17, x12, x11, x14, x13, x19, x9, x8, x20, x21, x28, x27, x26, x25, x24, x23, x22, x6, x7, x4, x5, x2, x3, x1, response].
The metalearning step in the Stacked Ensemble algorithm combines the output from the base models, so the number of inputs that went into training the base models doesn't really matter. Currently, H2O still requires that the inputs are all part of the same original training_frame -- but you can use a different x for each base model if you like (the x argument specifies which of the columns from the training_frame you want to use in your model).
The way that Stacked Ensemble works in Flow is that it looks for models that are all "compatible", in other words -- trained on, the same data frame. Then you select from this list which ones you want to include in the ensemble. So as long as you are using the latest development version of H2O, then this is how to do what you want to do in Flow.
Here's an R example of how to ensemble models that are trained on different subsets of the feature space:
library(h2o)
h2o.init()
# Import a sample binary outcome training set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
# Train & Cross-validate a GBM using a subset of features
my_gbm <- h2o.gbm(x = x[1:10],
y = y,
training_frame = train,
distribution = "bernoulli",
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train & Cross-validate a RF using a subset of features
my_rf <- h2o.randomForest(x = x[3:15],
y = y,
training_frame = train,
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train a stacked ensemble using the GBM and RF above
ensemble <- h2o.stackedEnsemble(y = y, training_frame = train,
base_models = list(my_gbm, my_rf))
# Check out ensemble performance
perf <- h2o.performance(ensemble, newdata = test)
h2o.auc(perf)
A stacked ensemble won't do this, as it does require identical inputs to each model. But you can set up a looser kind of ensemble... and that can almost, but not quite, be done in Flow.
Basically, you would create your four models. Then you would run predict on each of them. Each predict() will give you a new h2o frame. You would then need to cbind (column-bind) those four predictions together, to give you a new h2o frame with 4 binary columns (*). Then that would be fed into a 5th model, that gives you a combined result.
*: This is the bit I don't think you can do in Flow. You would need to export the data, combine it in another application, then bring it back in.
A better approach would be to be build a single model using all the inputs together. This would be both simpler, and give you more accurate results (as, e.g. interactions between insurance_amount and pat_age could be discovered). But, the (potentially major) downside is you cannot explain the model as four sets of yes/no any more. I.e. it becomes more black-box-like.
I am trying to program a machine learning algorithm to learn from training data and classify the language of the instance. There are 4 total classifications: Polish, French, Slovak, German.
In the training data, the data is full sentences, but when looking at the test data, the data is represented by just single characters.
For example, an instance of my training data looks like this:
"Et oui cest la fille du patron Il fait tout"
But my testing data looks like this:
"e e n t l n r i a e i a v i t s r e t n"
How come my training dataset is so different from my testing dataset, and what would be an appropriate feature selection for this problem?
It is suspicious you have train set like this. The only method comes in mind is to use probability given distribution if you have large enough paragraphs you can calculate percent value counts for each letter given language and match it with your data. For example, it is known that in large enough English text letter "a" appears ~8.167%, letter "e" ~ 12.702% however in German "a" occurs ~ 6.% and "e" ~ 16.4%. Other languages have different distributions.
Check this Wikipedia article: https://en.wikipedia.org/wiki/Letter_frequency
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.