You know in TensorFlow, we use a callback named ReduceLROnPlateau which reduce our learning rate slightly when our model stops learning. Does anyone know how to do this in XGBoost? I want to know if there is any way to reduce learning when XGBoost model stops learning.
This article claims:
A quick Google search reveals that there has been some work done utilizing a decaying learning rate, one that starts out large and shrinks at each round. But we typically see no accuracy gains in Cross Validation and when looking at test error graphs there are minimal differences between its performance and using a regular old constant.
They wrote a Python package called BetaBoost to find an optimal sequence for the learning rate scheduler.
In principle they seem to use a function which returns learning rates for the LearningRateScheduler
from scipy.stats import beta
def beta_pdf(scalar=1.5,
a=26,
b=1,
scale=80,
loc=-68,
floor=0.01,
n_boosting_rounds=100):
"""
Get the learning rate from the beta PDF
Returns
-------
lrs : list
the resulting learning rates to use.
"""
lrs = [scalar*beta.pdf(i,
a=a,
b=b,
scale=scale,
loc=loc)
+ floor for i in range(n_boosting_rounds)]
return lrs
[...]
xgb.train(
[...],
callbacks=[xgb.callback.LearningRateScheduler(beta_pdf())]
)
Related
I am looking to fine tune a GNN and my supervisor suggested exploring different learning rates. I came across this tutorial video where he mentions that a randomised log space search of hyper parameters is typically done in practice. For sake of the introductory tutorial this was not covered.
Any help or pointers on how to achieve this in PyTorch is greatly appreciated. Thank you!
Setting the scale in logarithm terms let you take into account more desirable values of the learning rate, usually values lower than 0.1
Imagine you want to take learning rate values between 0.1 (1e-1) and 0.001 (1e-4). Then you can set this lower and upper bound on a logarithm scale by applying a logarithm base 10 on it, log10(0.1) = -1 and log10(0.001) = -4. Andrew Ng provides a clearer explanation in this video.
In Python you can use np.random.uniform() for this
searchable_learning_rates = [10**(-4 * np.random.uniform(0.5, 1)) for _ in range(10)]
searchable_learning_rates
>>>
[0.004890650359810075,
0.007894672127828331,
0.008698831627963768,
0.00022779163472045743,
0.0012046829055603172,
0.00071395500159473,
0.005690032483124896,
0.000343368839731761,
0.0002819402550629178,
0.0006399571804618883]
as you can see you're able to try learning rate values from 0.0002819402550629178 up to 0.008698831627963768 which is close to the upper bound. The longer the array the more values you will try.
Following the example code in the video you provided you can implement the randomized log search for the learning rate by replacing learning_rates for searchable learning_rates
for batch_size in batch_sizes:
for learning_rate in searchable_learning_rates:
...
...
I am trying to implement time series forecasting using genetic programming. I am creating random trees (Ramped Half-n-Half) with s-expressions and evaluating each expression using RMSE to calculate the fitness. My problem is the training process. If I want to predict gold prices and the training data looked like this:
date open high low close
28/01/2008 90.959999 91.889999 90.75 91.75
29/01/2008 91.360001 91.720001 90.809998 91.150002
30/01/2008 90.709999 92.580002 90.449997 92.059998
31/01/2008 90.919998 91.660004 90.739998 91.400002
01/02/2008 91.75 91.870003 89.220001 89.349998
04/02/2008 88.510002 89.519997 88.050003 89.099998
05/02/2008 87.900002 88.690002 87.300003 87.68
06/02/2008 89 89.650002 88.75 88.949997
07/02/2008 88.949997 89.940002 88.809998 89.849998
08/02/2008 90 91 89.989998 91
As I understand, this data is nonlinear so my questions are:
1- Do I need to make any changes to this data like exponential smoothing? and why?
2- When looping the current population and evaluating the fitness of each expression on the training data, should I calculate the RMSE on just part of this data or all of it?
3- When the algorithm finishes and I get an expression with the best (lowest) fitness, does this mean that when I apply any row from the training data, the output should be the price of the next day?
I've read some research papers about this and I noticed some of them mentioning dividing the training data when calculating the fitness and some of them are doing exponential smoothing. However, I found them a bit difficult to read and understand, and most implementations I've found are either in python or R which I am not familiar with.
I appreciate any help on this.
Thank you.
from sklearn.model_selection import GridSearchCV
from sklearn import svm
params_svm = {
'kernel' : ['linear','rbf','poly'],
'C' : [0.1,0.5,1,10,100],
'gamma' : [0.001,0.01,0.1,1,10]
}
svm_clf = svm.SVC()
estimator_svm = GridSearchCV(svm_clf,param_grid=params_svm,cv=4,verbose=1,scoring='accuracy')
estimator_svm.fit(data,labels)
print(estimator_svm.best_params_)
estimator_svm.best_score_
/*
data.shape is (891,9)
labels.shape is (891) both are numeric 2-D and 1-D arrays.
*/
when I am using GridSearchCV with rbf it's giving the best parameter combination in just 2.7seconds..!
but when I make a list of kernel including any 'poly' or 'linear' separately or with 'rbf' it's taking too long to produce output, i.e. not giving output even after 15-20 minutes, which means I am doing something wrong. I am new to Machine Learning(supervised). I am not able to find any bug in the coding...I am not getting what's going wrong behind the scenes!
Can anyone explain this to me ,what i am doing wrong
No you are not doing anything wrong as per your code. There are many factors that come into play here
SVC is a complex classfier which requires computation of a distance between each pair of points in the dataset.
The complexity also varies with different kernel. I am not sure but i think it is O((no_of_samples)^2 * n_features) for rbf kernel, while it is O(n_samples*n_features) for linear kernel. So, it is not the case that just because rbf kernel works in 15 mins, then linear kernel will also work in similar time.
Also the time taken depends drastically on the dataset and the data patterns present in it. For e.g. an rbf kernel may converge quickly with say C = 0.5 but may take drastically more time for polynomial kernel to converge for the same value of C.
Also, without using the cache the running time increase a lot. In this answer, the author mentions it might increase to O(n_samples^3 *n_features).
Here is the offical documentation from sklearn about SVM complexity. See this section about practical tips on using SVM as well.
You can set verbose to True to see the progress of your classfier and how it is trained.
References
GridSearchCV goes to endless execution using SVC
Computational complexity of SVM
Official Documentation of SVM for scikit-learn
Hi i am solving a regression problem.My data set consists of 13 features and 550068 rows.I tried different different models and found that boosting algorithms(i.e xgboost,catboost,lightgbm) are performing well on that big data set.here is the code
import lightgbm as lgb
gbm = lgb.LGBMRegressor(objective='regression',num_leaves=100,learning_rate=0.2,n_estimators=1500)
gbm.fit(x_train, y_train,
eval_set=[(x_test, y_test)],
eval_metric='l2_root',
early_stopping_rounds=10)
y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
accuracy = round(gbm.score(x_train, y_train)*100,2)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
import xgboost as xgb
boost_params = {'eval_metric': 'rmse'}
xgb0 = xgb.XGBRegressor(
max_depth=8,
learning_rate=0.1,
n_estimators=1500,
objective='reg:linear',
gamma=0,
min_child_weight=1,
subsample=1,
colsample_bytree=1,
scale_pos_weight=1,
seed=27,
**boost_params)
xgb0.fit(x_train,y_train)
accuracyxgboost = round(xgb0.score(x_train, y_train)*100,2)
predict_xgboost = xgb0.predict(x_test)
msexgboost = mean_squared_error(y_test,predict_xgboost)
rmsexgboost= np.sqrt(msexgboost)
from catboost import Pool, CatBoostRegressor
train_pool = Pool(x_train, y_train)
cbm0 = CatBoostRegressor(rsm=0.8, depth=7, learning_rate=0.1,
eval_metric='RMSE')
cbm0.fit(train_pool)
test_pool = Pool(x_test)
predict_cat = cbm0.predict(test_pool)
acc_cat = round(cbm0.score(x_train, y_train)*100,2)
msecat = mean_squared_error(y_test,predict_cat)
rmsecat = np.sqrt(msecat)
By using the above models i am getting rmse values about 2850.Now i want to improve my model performance by reducing root mean square error.How can i improve my model performance? As i am new to boosting algorithms,which parameters effect the models?And how can i do hyperparameter tuning for those algorithms(xgboost,catboost,lightgbm).I am using Windows10 os and intel i5 7th genration.
Out of those 3 tools that you have tried CatBoost provides an edge in categorical feature processing (it could be also faster, but I did not see a benchmark demonstrating it, and it seems to be not dominating on kaggle, so most likely it is not as quick as LightGBM, but I might be wrong in that hypothesis). So I would use it if I have many of those in my sample. The other two (LightGBM and XGBoost) provide very similar functionality and I would suggest to choose one of them and stick top it. At the moment it seems that LightGBM outperforms XGBoost in training time on CPU providing a very comparable precision of predictions. See for example GBM-perf beachmark on github or this in-depth analysis. If you have GPU's available, than in fact XGBoost seems to be preferable, judging on the benachmark above.
In general, you can improve your model performance in several ways:
train longer (if early stopping was not triggered, that means that there is still room for generalisation; if it was, then you can not improve further by training longer the chosen model with chosen hyper-parameters)
optimise hyper-parameters (see below)
choose a different model. There is no single silver bullet for all problems. Typically GBMs work very well on large samples of structured data, but for some classes of problems (e.g. linear dependence) it is hard for a GBM to learn how to generalise, as it might require very many splits. So it might be that for your problem a linear model, an SVM or something else will do better out of the box.
Since we narrowed down to 2 options, I can not advice on catboost hyper-parameter optimisation, as I have no hands-on experience with it yet. But for lightgbm tuning you can read this official lightgbm doc and these instructions in one of the issues. There are very many good examples of hyper parameter tuning for LightGBM. I can quickly dig out my kernel on kaggle: see here. I do not claim it to be perfect but that's something what is easy for me to find :)
If you are using Intel CPU, then try Intel XGBoost. Intel has powered several optimizations for XGBoost to accelerate gradient boosting models and improve its training and inference capabilities. Also, please check out the article, https://www.intel.com/content/www/us/en/developer/articles/technical/easy-introduction-xgboost-for-intel-architecture.html#gs.q4c6p6 on how to use XGBoost with Intel optimizations.
You can use either of lasso or ridge, these methods could improve the performance.
For hyper parameter tuning, you can use loops. iterate the values and check where you getting lowest RMSE values.
You can also try stacked ensemble techniques.
If you use R, use h20.ai package, It gives good result.
I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.