I'm using the LinearRegression model in the Spark ML for prediction.
import pyspark.ml.regression.LinearRegression
featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’,
‘Kms_Driven’, ‘Owner’],
outputCol=’features’)
output = featureassembler.transform(df)
data = output.select('features', 'Selling_Price')
# Initializing a Linear Regression model
ss = LinearRegression(featuresCol='features', labelCol='Selling_Price')
I want to test the linear regression with SGD(Stochastic Gradient Descent.) but pyspark.ml does not propose any linearregressionwithSGD like mllib. Also, when accessing the mllib linear regressionwithSGD i found that it Deprecated since version 2.0.0.
How can i use ml for linear regression with SGD. Is there any parameter that i can use for that?
Instead of ml you can use mllib:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
Here is the documentation:
https://spark.apache.org/docs/1.6.1/mllib-linear-methods.html
Related
I have trained LightGBM on a binary-classification problem, and when plotting the tree I get some leafs like this
I struggle to find the loss-function for the classification trees - Does LightGBM minimize the cross-entropy in the binary case, and is that the leaf score?
I struggle to find the loss-function for the classification trees - Does LightGBM minimize the cross-entropy in the binary case
Yes, if you don't specify an objective then LGBMClassifier will use cross-entropy by default. The documentation in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier says that the default for objective is "binary", and then https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective notes that binary is cross-entropy loss.
and is that the leaf score?
The values like leaf 33: -2.209 ("leaf scores") represent the value of the target that will be predicted for instances in that leaf node, multiplied by the learning rate.
Negative values are possible because of the way the boosting process works. Each tree is trained on the residuals of the model up to that tree. A prediction from a model is obtained by summing the output of all trees. The XGBoost docs have a very good explanation of this: "Introduction to Boosted Trees".
In the future, please try to provide a small reproducible example explaining how you created a figure that you're asking questions about. I assumed something like the following Python code, using lightgbm 3.1.0. You can change the values of tree_index to see the different trees in the model.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
gbm = lgb.LGBMClassifier(
n_estimators=10,
num_leaves=3,
max_depth=8,
min_data_in_leaf=3,
)
gbm.fit(X, y)
# visualize tree structure as a directed graph
ax = lgb.plot_tree(
gbm,
tree_index=0,
figsize=(15, 8),
show_info=[
'data_percentage',
]
)
# visualize tree structure in a dataframe
gbm.booster_.trees_to_dataframe()
Does Scikit-learn support transfer learning? Please check the following code.
model clf is gotten by fit(X,y)
Can model clf2 learn on the base of clf and transfer learn by fit(X2,y2) ?
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> X, y= ....
>>> clf.fit(X, y)
SVC()
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.fit(X2,y2)
>>> clf2.predict(X[0:1])
In the context of scikit-learn there's no transfer learning as such, there is incremental learning or continuous learning or online learning.
By looking at your code, whatever you're intending to do won't work the way you're thinking here. From this scikit-learn documentation:
Calling fit() more than once will overwrite what was learned by any
previous fit()
Which means using fit() more than once on the same model will simply overwrite all the previously fitted coefficients, weights, intercept (bias), etc.
However if you want to fit a portion of your data set and then improve your model by fitting a new data, what you can do is look for estimators that include partial_fit API implementation.
If we call partial_fit() multiple times, framework will update the
existing weights instead of re-initialising them.
Another way to do incremental learning with scikit-learn is to look for algorithms that support the warm_start parameter.
From this doc:
warm_start: bool, default=False
When set to True, reuse the solution of
the previous call to fit() as initialization, otherwise, just erase the
previous solution. Useless for liblinear solver.
Another example is Random forrest regressor.
how to make grid_search to deep learning fastai (i use LSTM) to tune hyperpramters of Learner - https://docs.fast.ai/basic_train.html#Learner my code is :
from fastai.train import Learner
from fastai.train import DataBunch
model = NeuralNet(embedding_matrix, y_aux_train.shape[-1])
learn = Learner(databunch, model, loss_func=custom_loss)
i wonder how can i tune the hyper-parameters like in standard classification algorithm of sklearn using -https://scikit-learn.org/stable/modules/grid_search.html .
You can use https://github.com/skorch-dev/skorch that allows you to use Sklearn grid search on a pytorch model, then wrap this model in a fastai learner.
I want to use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python. My training data set is of the order 16000x60. Also how to map principal component to original column to use it in SVM or can I use principal component directly?
It is unclear what the problem is and what did you try already. Of course you can. You can either add PCA output to your original set or just use the output as a single feature. I encourage you to use sklearn pipelines.
Simple example:
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn import svm
svc = svm.SVC()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('svc', svc)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)
print(pipe.score(X_digits,y_digits))
Is it possible to train a model in Xgboost that have multiple continuous outputs (multi regression)?
What would be the objective to train such a model?
Thanks in advance for any suggestions
My suggestion is to use sklearn.multioutput.MultiOutputRegressor as a wrapper of xgb.XGBRegressor. MultiOutputRegressor trains one regressor per target and only requires that the regressor implements fit and predict, which xgboost happens to support.
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear')).fit(X, y)
# predicting
print np.mean((multioutputregressor.predict(X) - y)**2, axis=0) # 0.004, 0.003, 0.005
This is probably the easiest way to regress multi-dimension targets using xgboost as you would not need to change any other part of your code (if you were using the sklearn API originally).
However this method does not leverage any possible relation between targets. But you can try to design a customized objective function to achieve that.
Multiple output regression is now available in the nightly build of XGBoost, and will be included in XGBoost 1.6.0.
See https://github.com/dmlc/xgboost/blob/master/demo/guide-python/multioutput_regression.py for an example.
It generates warnings: reg:linear is now deprecated in favor of reg:squarederror, so I update an answer based on #ComeOnGetMe's
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
# get some noised linear data
X = np.random.random((1000, 10))
a = np.random.random((10, 3))
y = np.dot(X, a) + np.random.normal(0, 1e-3, (1000, 3))
# fitting
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror')).fit(X, y)
# predicting
print(np.mean((multioutputregressor.predict(X) - y)**2, axis=0))
Out:
[2.00592697e-05 1.50084441e-05 2.01412247e-05]
I would place a comment but I lack the reputation. In addition to #Jesse Anderson, to install the most recent version, select the top link from here:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/list.html?prefix=master/
Make sure to select the one for your operating system.
Use pip install to install the wheel. I.e. for macOS:
pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-1.6.0.dev0%2B4d81c741e91c7660648f02d77b61ede33cef8c8d-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl
You can use Linear regression, random forest regressors and some other related algorithms in Scikit-learn to produce multi-output regression. Not sure about XGboost. The boosting regressor in Scikit does not allow multiple outputs. For people who asked, when it may be necessary one example would be to forecast multi-steps of time-series a head.
Based on the above discussion, I have extended the univariate XGBoostLSS to a multivariate framework called Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. Code follows soon.