muti output regression in xgboost - machine-learning

Is it possible to train a model in Xgboost that have multiple continuous outputs (multi regression)?
What would be the objective to train such a model?
Thanks in advance for any suggestions

My suggestion is to use sklearn.multioutput.MultiOutputRegressor as a wrapper of xgb.XGBRegressor. MultiOutputRegressor trains one regressor per target and only requires that the regressor implements fit and predict, which xgboost happens to support.
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear')).fit(X, y)
# predicting
print np.mean((multioutputregressor.predict(X) - y)**2, axis=0) # 0.004, 0.003, 0.005
This is probably the easiest way to regress multi-dimension targets using xgboost as you would not need to change any other part of your code (if you were using the sklearn API originally).
However this method does not leverage any possible relation between targets. But you can try to design a customized objective function to achieve that.

Multiple output regression is now available in the nightly build of XGBoost, and will be included in XGBoost 1.6.0.
See https://github.com/dmlc/xgboost/blob/master/demo/guide-python/multioutput_regression.py for an example.

It generates warnings: reg:linear is now deprecated in favor of reg:squarederror, so I update an answer based on #ComeOnGetMe's
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror')).fit(X, y)
# predicting
print(np.mean((multioutputregressor.predict(X) - y)**2, axis=0))
Out:
[2.00592697e-05 1.50084441e-05 2.01412247e-05]

I would place a comment but I lack the reputation. In addition to #Jesse Anderson, to install the most recent version, select the top link from here:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/list.html?prefix=master/
Make sure to select the one for your operating system.
Use pip install to install the wheel. I.e. for macOS:
pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-1.6.0.dev0%2B4d81c741e91c7660648f02d77b61ede33cef8c8d-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl

You can use Linear regression, random forest regressors and some other related algorithms in Scikit-learn to produce multi-output regression. Not sure about XGboost. The boosting regressor in Scikit does not allow multiple outputs. For people who asked, when it may be necessary one example would be to forecast multi-steps of time-series a head.

Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.

Related

LightGBM predicts negative values

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629
LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)
LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

Why Naive Bayes gives results and on training and test but gives error of negative values when applied with GridSerchCV?

I have studied some related questions regarding Naive Bayes, Here are the links. link1, link2,link3 I am using TF-IDF for feature selection and Naive Bayes for classification. After fitting the model it gave the prediction successfully. and here is the output
accuracy = train_model(model, xtrain, train_y, xtest)
print("NB, CharLevel Vectors: ", accuracy)
NB, accuracy: 0.5152523571824736
I don't understand the reason why Naive Bayes did not give any error in the training and testing process
from sklearn.preprocessing import PowerTransformer
params_NB = {'alpha':[1.0], 'class_prior':[None], 'fit_prior':[True]}
gs_NB = GridSearchCV(estimator=model,
param_grid=params_NB,
cv=cv_method,
verbose=1,
scoring='accuracy')
Data_transformed = PowerTransformer().fit_transform(xtest.toarray())
gs_NB.fit(Data_transformed, test_y);
It gave this error
Negative values in data passed to MultinomialNB (input X)
TL;DR: PowerTransformer, which you seem to apply only in the GridSearchCV case, produces negative data, which makes MultinomialNB to expectedly fail, es explained in detail below; if your initial xtrain and ytrain are indeed TF-IDF features, and you do not transform them similarly with PowerTransformer (you don't show something like that), the fact that they work OK is also unsurprising and expected.
Although not terribly clear from the documentation:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
reading closely you realize that it implies that all the features should be positive.
This has a statistical basis indeed; from the Cross Validated thread Naive Bayes questions: continus data, negative data, and MultinomialNB in scikit-learn:
MultinomialNB assumes that features have multinomial distribution which is a generalization of the binomial distribution. Neither binomial nor multinomial distributions can contain negative values.
See also the (open) Github issue MultinomialNB fails when features have negative values (it is for a different library, not scikit-learn, but the underlying mathematical rationale is the same).
It is not actually difficult to demonstrate this; using the example available in the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100)) # random integer data
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y) # works OK
# inspect X
X # only 0's and positive integers
Now, changing a single element of X to a negative number and trying to fit again:
X[1][0] = -1
clf.fit(X, y)
gives indeed:
ValueError: Negative values in data passed to MultinomialNB (input X)
What can you do? As the Github thread linked above suggests:
Either use MinMaxScaler(), which will bring all the features to [0, 1]
Or use GaussianNB instead, which does not suffer from this limitation

What is the leaf-score in LightGBM (classification)?

I have trained LightGBM on a binary-classification problem, and when plotting the tree I get some leafs like this
I struggle to find the loss-function for the classification trees - Does LightGBM minimize the cross-entropy in the binary case, and is that the leaf score?
I struggle to find the loss-function for the classification trees - Does LightGBM minimize the cross-entropy in the binary case
Yes, if you don't specify an objective then LGBMClassifier will use cross-entropy by default. The documentation in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier says that the default for objective is "binary", and then https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective notes that binary is cross-entropy loss.
and is that the leaf score?
The values like leaf 33: -2.209 ("leaf scores") represent the value of the target that will be predicted for instances in that leaf node, multiplied by the learning rate.
Negative values are possible because of the way the boosting process works. Each tree is trained on the residuals of the model up to that tree. A prediction from a model is obtained by summing the output of all trees. The XGBoost docs have a very good explanation of this: "Introduction to Boosted Trees".
In the future, please try to provide a small reproducible example explaining how you created a figure that you're asking questions about. I assumed something like the following Python code, using lightgbm 3.1.0. You can change the values of tree_index to see the different trees in the model.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
gbm = lgb.LGBMClassifier(
n_estimators=10,
num_leaves=3,
max_depth=8,
min_data_in_leaf=3,
)
gbm.fit(X, y)
# visualize tree structure as a directed graph
ax = lgb.plot_tree(
gbm,
tree_index=0,
figsize=(15, 8),
show_info=[
'data_percentage',
]
)
# visualize tree structure in a dataframe
gbm.booster_.trees_to_dataframe()

Does Scikit-learn support transfer learning?

Does Scikit-learn support transfer learning? Please check the following code.
model clf is gotten by fit(X,y)
Can model clf2 learn on the base of clf and transfer learn by fit(X2,y2) ?
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> X, y= ....
>>> clf.fit(X, y)
SVC()
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.fit(X2,y2)
>>> clf2.predict(X[0:1])
In the context of scikit-learn there's no transfer learning as such, there is incremental learning or continuous learning or online learning.
By looking at your code, whatever you're intending to do won't work the way you're thinking here. From this scikit-learn documentation:
Calling fit() more than once will overwrite what was learned by any
previous fit()
Which means using fit() more than once on the same model will simply overwrite all the previously fitted coefficients, weights, intercept (bias), etc.
However if you want to fit a portion of your data set and then improve your model by fitting a new data, what you can do is look for estimators that include partial_fit API implementation.
If we call partial_fit() multiple times, framework will update the
existing weights instead of re-initialising them.
Another way to do incremental learning with scikit-learn is to look for algorithms that support the warm_start parameter.
From this doc:
warm_start: bool, default=False
When set to True, reuse the solution of
the previous call to fit() as initialization, otherwise, just erase the
previous solution. Useless for liblinear solver.
Another example is Random forrest regressor.

SciKit Learn feature selection and cross validation using RFECV

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).

Resources