VotingRegressor vs. StackingRegressor - machine-learning

Here's one. In what situation would you use one vs. the other. Let me run a hypothetical.
Let's say I'm training a few different Regressors, and I get the final score from each regressor's training run. If I wanted to use the VotingRegressor to ensemble each model, I could use those scores as potential weight parameters to get a weighted average of each model's prediction right?
So what's the benefit of doing that, vs. using the StackingRegressor to get the final prediction? As I understand it, a final model is used to make its predictions based on each individual model's prediction, so in effect, wouldn't that final StackingRegressor model learn that some predictions are better than others? Almost like it's doing a sort of weight voting of its own?
Short of running both examples and seeing the differences in predictions, wondering if anyone else has experience with both of these and could provide some insight as to which might be a better way to go? I don't see a question like this on SO yet. Thanks!

Related

Choosing right metrics for regression model

I have always been using r2 score metrics. I know there are several evaluation metrics out there i have read several articles about it. Since i'm still a beginner in machine learning. I'm still very confused of
When to use each of it, is depending on our case, if yes please give me example
I read this article and it said, r2 score is not straightforward, we need other stuff to measure the performance of our model. Does it mean we need more than 1 evaluation metrics in order to get better insight of our model performance?
Is it recommended if we only measure our model performance by just one evaluation metrics?
From this article it said knowing the distribution of our data and our business goal helps us to understand choose appropriate metrics. What does it mean by that?
How to know for each metrics that the model is 'good' enough?
There are different evaluation metrics for regression problems like below.
Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Mean Square Percentage Error (MSPE)
so on so forth..
As you mentioned you need to use them based on your problem type, what you want to measure and the distribution of your data.
To do this, you need to understand how these metrics evaluate the model. You can check the definitions and pros/cons of evaluation metrics from this nice blog post.
R² shows what variation of your purpose variable is described by independent variables. A good model can give R² score close to 1.0 but it does not mean it should be. Models which have low R² can also give low MSE score. So to ensure your predictive power of your model it is better to use MSE, RMSE or other metrics besides the R².
No. You can use multiple evaluation metrics. The important thing is if you compare two models, you need to use same test dataset and the same evaluation metrics.
For example, if you want to penalize your bad predictions too much, you can use MSE evaluation metric because it basically measures the average squared error of our predictions or if your data have too much outlier MSE give too much penalty to this examples.
The good model definition changes based on your problem complexity. For example if you train a model which predicts that heads or tails and gives %49 accuracy it is not good enough because the baseline of this problem is %50. But for any other problem, %49 accuracy may enough for your problem. So in a summary, it depends on your problem and you need to define or think that human(baseline) threshold.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

What to do with corrected wrongly classified random forest predictions?

I have trained a multi-class Random Forest model and So now if the model predicts something wrong we manually correct it, SO the thing is What can we do to with that corrected label and make the predictions better.
Thoughts:
Can't retrain the model again and again.(Trained on 0.7 million rows so it might treat the new data as noise)
Can not train small models of RF as they will also create a mess
Random FOrest works better then NN, So not thinking to go that way.
What do you mean by "manually correct" - i.e. there may be various different points in the decision trees that were executed leading to a wrong prediction, not to mention the numerous decision trees used to get your final prediction.
I think there is some misunderstanding in your first point. Unless the distribution is non-stationary (in which case your trained model is of diminished value to begin with), the new data is treated is treated as "noise" in the sense that including it in the final model is unlikely to change future predictions all that much. As far as I can tell this is how it should be, without specifying other factors like a changing distribution, etc. That is, if future data you want to predict will look a lot more like the data you failed to predict correctly, then you would indeed want to upweight the importance of classifying this sample in your new model.
Anyway, it sounds like you're describing an online learning problem(you want a model that updates itself in response to streaming data). You can find some general ideas just searching for online random forests, for example:
[Online random forests] (http://www.ymer.org/amir/research/online-random-forests/) and [online multiclass lpboost] (https://github.com/amirsaffari/online-multiclass-lpboost) describe a general framework akin to what you may have in mind: the input to the model is a stream of new observations; the forest learns on this new data by dropping those trees which perform poorly and eventually growing new trees that include the new data.
The general idea described here is used in a number of boosting algorithms (for example, AdaBoost aggregates an ensemble of "weak learners", for example individual decision trees grown on different + incomplete subsets of data, into a better whole by training subsequent weak learners specifically on formerly misclassified instances. The idea here is that those instances where your current model is wrong are the most informative for future performance improvements.
I don't know the specific details of how the linked implementations accomplish this, though the idea is inline with what you might expect.
You might try these, or other such algorithms you find from searching around.
That all said, I suspect something like the online random forest algorithm is relatively good when old data becomes obsolete over time. If it doesn't -- i.e. if your future data and early data are pulled from the same distribution -- it's not obvious to me that successively retraining your model (by which I mean the random forest itself and any cross validation / model selection procedures you might have to transform forest predictions into a final assignment) data on the whole batch of examples you have is a bad idea, modulo data in a very high dimensional feature space, or really quickly incoming data.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

How to discover new classes in a classification machine learning algorithm?

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

Resources