With two different versions of Sklearn, I get TOTALLY different results from the Decision Tree - machine-learning

Hello I am facing a strange problem. I have EXACT same code and same database and same train and test data but when i run it with Jupyter that uses sklearn version (0.24.1) I receive this prediction plot:
plot for version 0.24.1
But when I switch to Pycharm and run the same code with sklearn version (1.0.1) I receive this plot:
plot for version 1.0.1
and here is how I fit the model:
model = DecisionTreeRegressor()
model.fit(x_train, y_train)
The results were totally different from what I expected
How can I achieve same result with new version?

Related

Decision Tree classifier throws KeyError: 'log_loss'

I used Decision Tree from sklearn, normally there is log_loss
classifier = DecisionTreeClassifier(random_state = 42,class_weight ='balanced' ,criterion='log_loss')
classifier.fit(X_train, y_train)
error :
KeyError: 'log_loss'
The log_loss option for the parameter criterion was added only in the latest scikit-learn version 1.1.2:
criterion{“gini”, “entropy”, “log_loss”}, default=”gini”
It is not there in either of the two previous ones, version 1.0.2 or version 0.24.2:
criterion{“gini”, “entropy”}, default=”gini”
The error suggests that you are using an older version; you can check your scikit-learn version with
import sklearn
print(sklearn.__version__)
So, you will need to upgrade scikit-learn to v1.1.2.
log_loss criterion is applicable for the case when we have 2 classes in our target column.
Otherwise, if we have more than 2 classes then we can use entropy as our criterion for keeping the same impurity measure.

Catboost's Incremental training with "init_model" fails when not all initial labels are present in new data

catboost python version: 1.0.6
I am training a CatboostClassifier on 10 different output classes, which works fine. Then I'm incrementally training a new classifier using the earlier trained init_model and training on a new training dataset. The catch is that this dataset has only 2 of the original 10 unique labels. Catboost warns me already with: Found only 2 unique classes in the data, but have defined 10 classes. Probably something is wrong with data.
but starts to train fine anyway. Only at the end (I assume when the model gets merged with the original one?) I get the following error message:
Exception has occurred: CatBoostError
CatBoostError: catboost/libs/model/model.cpp:1716: Approx dimensions don't match: 10 != 2
Is it expected behavior that incremental training is not possible on only a subset of the original classes? If yes, then maybe a clearer error message should be given. It would be even better though if the code could handle this case, but maybe I'm overseeing some things that do not allow such functionality.
The similar issue has been posted on github : https://github.com/catboost/catboost/issues/1953

sklearn High score with low performance

could you kindly help me decide whether I'm hitting a bug or the problem may be in my implementation?
I have a data set with 5 features and 2000+ observations and I use SVR to do regression tests and select parameters with grid search. If I don't scale my data, then I get a best score of close to zero, but if I do scale it, the best score is around 0.90.
When I manually test the data, it predicts wrong values totally randomly. How can this be? I expect the best score to show how well the trained data could have been validated on new ones during cross validation. I suppose I should not get high score if my model cannot generate well. Should I? Could this be a bug?
SKlearn version is 0.19.1 (from package of Ubuntu Linux 18.04 x64 LTS platform)
Python version is 3.6.7
Would it be worth an upgrade with pip? Any further idea? Thank you.
Edit: see the following code that produces high score, still generalizes badly - though it is regression, scoring should reflect the difference of the predicted ones from the test values:
C_range = 2.0 ** np.arange(-5, 15, 2)
gamma_range = 2.0 ** np.arange(-5, 15, 2)
parameters = {"kernel":["rbf"], "C":C_range, "gamma":gamma_range}
estimator = svm.SVR()
clf = GridSearchCV(estimator, parameters, cv=3, n_jobs=-1, verbose=0)
clf.fit(x, y)
print( clf.best_score_ )

How does one save the data results that resulted from a keras experiment run?

I want to save the results of my experiments in keras not the model. For example, I want to save everything that results from:
''' Plots '''
if plot:
# Plots for training and testing process: loss and accuracy
plt.figure(0)
plt.plot(cnn.history['acc'],'r')
plt.plot(cnn.history['val_acc'],'g')
plt.xticks(np.arange(0, nb_epochs+1, 2.0))
plt.rcParams['figure.figsize'] = (8, 6)
plt.xlabel("Num of Epochs")
plt.ylabel("Accuracy")
plt.title("Training Accuracy vs Validation Accuracy")
plt.legend(['train','validation'])
plt.figure(1)
plt.plot(cnn.history['loss'],'r')
plt.plot(cnn.history['val_loss'],'g')
plt.xticks(np.arange(0, nb_epochs+1, 2.0))
plt.rcParams['figure.figsize'] = (8, 6)
plt.xlabel("Num of Epochs")
plt.ylabel("Loss")
plt.title("Training Loss vs Validation Loss")
plt.legend(['train','validation'])
how do I save all that so I can plot the plots again and inspect what happened during training?
the website:
https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
doesn't seem to explain it...help?
The pickle module lets you serialize python objects.
You can save the history with:
pkl.dump(cnn.history, file_obj)
If you want to save your plots as an image:
plt.savefig(path)
You can also try to pickle matplotlib Figure/Axes objects to recreate the interactive plots but this feature is experimental. I would suggest just pickling your history dict and then regenerating the plots with your code above.

muti output regression in xgboost

Is it possible to train a model in Xgboost that have multiple continuous outputs (multi regression)?
What would be the objective to train such a model?
Thanks in advance for any suggestions
My suggestion is to use sklearn.multioutput.MultiOutputRegressor as a wrapper of xgb.XGBRegressor. MultiOutputRegressor trains one regressor per target and only requires that the regressor implements fit and predict, which xgboost happens to support.
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear')).fit(X, y)
# predicting
print np.mean((multioutputregressor.predict(X) - y)**2, axis=0) # 0.004, 0.003, 0.005
This is probably the easiest way to regress multi-dimension targets using xgboost as you would not need to change any other part of your code (if you were using the sklearn API originally).
However this method does not leverage any possible relation between targets. But you can try to design a customized objective function to achieve that.
Multiple output regression is now available in the nightly build of XGBoost, and will be included in XGBoost 1.6.0.
See https://github.com/dmlc/xgboost/blob/master/demo/guide-python/multioutput_regression.py for an example.
It generates warnings: reg:linear is now deprecated in favor of reg:squarederror, so I update an answer based on #ComeOnGetMe's
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror')).fit(X, y)
# predicting
print(np.mean((multioutputregressor.predict(X) - y)**2, axis=0))
Out:
[2.00592697e-05 1.50084441e-05 2.01412247e-05]
I would place a comment but I lack the reputation. In addition to #Jesse Anderson, to install the most recent version, select the top link from here:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/list.html?prefix=master/
Make sure to select the one for your operating system.
Use pip install to install the wheel. I.e. for macOS:
pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-1.6.0.dev0%2B4d81c741e91c7660648f02d77b61ede33cef8c8d-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl
You can use Linear regression, random forest regressors and some other related algorithms in Scikit-learn to produce multi-output regression. Not sure about XGboost. The boosting regressor in Scikit does not allow multiple outputs. For people who asked, when it may be necessary one example would be to forecast multi-steps of time-series a head.
Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.

Resources