Improving boosting model ,reducing Root mean square error - machine-learning

Hi i am solving a regression problem.My data set consists of 13 features and 550068 rows.I tried different different models and found that boosting algorithms(i.e xgboost,catboost,lightgbm) are performing well on that big data set.here is the code
import lightgbm as lgb
gbm = lgb.LGBMRegressor(objective='regression',num_leaves=100,learning_rate=0.2,n_estimators=1500)
gbm.fit(x_train, y_train,
eval_set=[(x_test, y_test)],
eval_metric='l2_root',
early_stopping_rounds=10)
y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
accuracy = round(gbm.score(x_train, y_train)*100,2)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
import xgboost as xgb
boost_params = {'eval_metric': 'rmse'}
xgb0 = xgb.XGBRegressor(
max_depth=8,
learning_rate=0.1,
n_estimators=1500,
objective='reg:linear',
gamma=0,
min_child_weight=1,
subsample=1,
colsample_bytree=1,
scale_pos_weight=1,
seed=27,
**boost_params)
xgb0.fit(x_train,y_train)
accuracyxgboost = round(xgb0.score(x_train, y_train)*100,2)
predict_xgboost = xgb0.predict(x_test)
msexgboost = mean_squared_error(y_test,predict_xgboost)
rmsexgboost= np.sqrt(msexgboost)
from catboost import Pool, CatBoostRegressor
train_pool = Pool(x_train, y_train)
cbm0 = CatBoostRegressor(rsm=0.8, depth=7, learning_rate=0.1,
eval_metric='RMSE')
cbm0.fit(train_pool)
test_pool = Pool(x_test)
predict_cat = cbm0.predict(test_pool)
acc_cat = round(cbm0.score(x_train, y_train)*100,2)
msecat = mean_squared_error(y_test,predict_cat)
rmsecat = np.sqrt(msecat)
By using the above models i am getting rmse values about 2850.Now i want to improve my model performance by reducing root mean square error.How can i improve my model performance? As i am new to boosting algorithms,which parameters effect the models?And how can i do hyperparameter tuning for those algorithms(xgboost,catboost,lightgbm).I am using Windows10 os and intel i5 7th genration.

Out of those 3 tools that you have tried CatBoost provides an edge in categorical feature processing (it could be also faster, but I did not see a benchmark demonstrating it, and it seems to be not dominating on kaggle, so most likely it is not as quick as LightGBM, but I might be wrong in that hypothesis). So I would use it if I have many of those in my sample. The other two (LightGBM and XGBoost) provide very similar functionality and I would suggest to choose one of them and stick top it. At the moment it seems that LightGBM outperforms XGBoost in training time on CPU providing a very comparable precision of predictions. See for example GBM-perf beachmark on github or this in-depth analysis. If you have GPU's available, than in fact XGBoost seems to be preferable, judging on the benachmark above.
In general, you can improve your model performance in several ways:
train longer (if early stopping was not triggered, that means that there is still room for generalisation; if it was, then you can not improve further by training longer the chosen model with chosen hyper-parameters)
optimise hyper-parameters (see below)
choose a different model. There is no single silver bullet for all problems. Typically GBMs work very well on large samples of structured data, but for some classes of problems (e.g. linear dependence) it is hard for a GBM to learn how to generalise, as it might require very many splits. So it might be that for your problem a linear model, an SVM or something else will do better out of the box.
Since we narrowed down to 2 options, I can not advice on catboost hyper-parameter optimisation, as I have no hands-on experience with it yet. But for lightgbm tuning you can read this official lightgbm doc and these instructions in one of the issues. There are very many good examples of hyper parameter tuning for LightGBM. I can quickly dig out my kernel on kaggle: see here. I do not claim it to be perfect but that's something what is easy for me to find :)

If you are using Intel CPU, then try Intel XGBoost. Intel has powered several optimizations for XGBoost to accelerate gradient boosting models and improve its training and inference capabilities. Also, please check out the article, https://www.intel.com/content/www/us/en/developer/articles/technical/easy-introduction-xgboost-for-intel-architecture.html#gs.q4c6p6 on how to use XGBoost with Intel optimizations.

You can use either of lasso or ridge, these methods could improve the performance.
For hyper parameter tuning, you can use loops. iterate the values and check where you getting lowest RMSE values.
You can also try stacked ensemble techniques.
If you use R, use h20.ai package, It gives good result.

Related

How to find the optimal learning rate, number of epochs & decay strategy in Torch.optim.adam?

I am working on a model trained on the MNIST dataset. I am using the torch.optim.adam model and have been experimenting with tuning the hyper parameters. After running a lot of tests, I have come to find a combination of hyper parameters that give 90% accuracy. However, I feel like maybe since I am new to this, there might be a more efficient way to find the optimal values of the hyperparameters. The brute force approach seems to depend on trial and error & I was wondering if there is certain strategy to find these values.
Example of the code being used is:
if __name__ == '__main__':
end = time.time()
model_ft = Net().to(device)
print(model_ft.network)
criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.Adam(model_ft.parameters(), lr=1e-3)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=9, gamma=0.5)
history, accuracy = train_test(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
num_epochs=15)
Here I would like to find the optimal values of:-
Learning Rate
Step Size
Gamma
Number of Epochs
Any help is much appreciated!
A similar question was already answered in-depth it seems.
However, in short, you can use something called Grid Search. With Grid Search, you set the values you want to try for each hyperparameter, and then Grid Search will try every combination. This link shows how to do it with PyTorch
The following Medium Post goes more in-depth about other methods and packages to try, but I think you should start with a simple grid search.

Why it's only working with setting kernel: 'rbf' in SVM Classifier?

from sklearn.model_selection import GridSearchCV
from sklearn import svm
params_svm = {
'kernel' : ['linear','rbf','poly'],
'C' : [0.1,0.5,1,10,100],
'gamma' : [0.001,0.01,0.1,1,10]
}
svm_clf = svm.SVC()
estimator_svm = GridSearchCV(svm_clf,param_grid=params_svm,cv=4,verbose=1,scoring='accuracy')
estimator_svm.fit(data,labels)
print(estimator_svm.best_params_)
estimator_svm.best_score_
/*
data.shape is (891,9)
labels.shape is (891) both are numeric 2-D and 1-D arrays.
*/
when I am using GridSearchCV with rbf it's giving the best parameter combination in just 2.7seconds..!
but when I make a list of kernel including any 'poly' or 'linear' separately or with 'rbf' it's taking too long to produce output, i.e. not giving output even after 15-20 minutes, which means I am doing something wrong. I am new to Machine Learning(supervised). I am not able to find any bug in the coding...I am not getting what's going wrong behind the scenes!
Can anyone explain this to me ,what i am doing wrong
No you are not doing anything wrong as per your code. There are many factors that come into play here
SVC is a complex classfier which requires computation of a distance between each pair of points in the dataset.
The complexity also varies with different kernel. I am not sure but i think it is O((no_of_samples)^2 * n_features) for rbf kernel, while it is O(n_samples*n_features) for linear kernel. So, it is not the case that just because rbf kernel works in 15 mins, then linear kernel will also work in similar time.
Also the time taken depends drastically on the dataset and the data patterns present in it. For e.g. an rbf kernel may converge quickly with say C = 0.5 but may take drastically more time for polynomial kernel to converge for the same value of C.
Also, without using the cache the running time increase a lot. In this answer, the author mentions it might increase to O(n_samples^3 *n_features).
Here is the offical documentation from sklearn about SVM complexity. See this section about practical tips on using SVM as well.
You can set verbose to True to see the progress of your classfier and how it is trained.
References
GridSearchCV goes to endless execution using SVC
Computational complexity of SVM
Official Documentation of SVM for scikit-learn

Matching PyTorch w/ CNTK (VGG on CIFAR)

I am trying to understand how PyTorch works and want to replicate a simple CNN training on CIFAR. The CNTK script gets to 0.76 accuracy after 168 seconds of training (10 epochs), which is similar to my MXNet script (0.75 accuracy after 153 seconds).
However, my PyTorch script is lagging behind a lot at 0.71 accuracy and 354 seconds. I appreciate I will get differences in accuracy due to stochastic weight initialisation, etc. However the difference across frameworks is much greater than difference within a framework, initialising randomly between runs.
The reasons I can think of:
MXNet and CNTK are initialized to xavier/glorot uniform; not sure how to do this in PyTorch and so perhaps the weights are initialised to 0
CNTK does gradient-clipping by default; not sure if PyTorch has the equivalent
Perhaps the bias is dropped in PyTorch by default
I use SGD with momentum; perhaps the PyTorch implementation of momentum is a bit different
Edit:
I have tried specifying the weight initialisation, however it seems to have no big effect:
self.conv1 = nn.Conv2d(3, 50, kernel_size=3, padding=1)
init.xavier_uniform(self.conv1.weight, gain=np.sqrt(2.0))
init.constant(self.conv1.bias, 0)
I try to answer your first two questions:
weight initialization: different kinds of layers have their own method, you can find the default weight initialization of all these layers in the following link: https://github.com/pytorch/pytorch/tree/master/torch/nn/modules
gradient-clipping: you might want to use torch.nn.utils.clip_grad_norm
In addition, I am curious why you don't use torchvision.transforms torch.utils.data.DataLoader and torchvision.datasets.CIFAR10 to load and preprocess your data?
There is a similar image classification tutorial of cifar for Pytorch
http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
Hope this can help you.

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

libSVM giving highly inaccurate predictions even for the file that was used to train it

here is the deal.
I am trying to make an SVM based POS tagger.
The feature vectors for the SVM was created with the help of format converters.
Now here is a screenshot of the training file that I am using.
http://tinypic.com/r/n4fn2r/8
I have 25 labels for various POS tags. when i use the java implementation or the command line tools for prediction i get the following results.
http://tinypic.com/r/2dtw5ky/8
I have tried with all the kernels available but it gave more or less the same results.
This is happening even when the training file is used as the testing file.
please help me out here..!!
P.S. I cannot share more than two links. Thus here is a snippet of the model file
svm_type c_svc
kernel_type rbf
gamma 0.000548546
nr_class 25
total_sv 431
rho -0.929467 1.01073 1.0531 1.03472 1.01585 0.953263 1.03027 -0.921365 0.984535 1.02796 1.01266 1.03374 0.949463 0.977925 0.986551 -0.920912 0.940926 -0.955562 0.975386 -0.981959 -0.884042 0.0516955 -0.980884 -0.966095 0.995091 1.023 1.01489 1.00308 0.948314 1.01137 -0.845876 0.968034 1.0076 1.00064 1.01335 0.942633 0.965703 0.979212 -0.861236 0.935055 -0.91739 0.970223 -0.97103 0.0743777 0.970321 -0.971215 -0.931582 0.972377 0.958193 0.931253 0.825797 0.954894 -0.972884 -0.941726 0.945077 0.922366 0.953999 -1.00503 0.840985 0.882229 -0.961742 0.791631 -0.984971 0.855911 -0.991528 -0.951211 -0.962096 -0.99213 -0.99708 -0.957557 -0.308987 -0.455442 -0.94881 -0.995319 -0.974945 -0.964637 -0.902152 -0.955258 -1.05287 -1.00614 -0.
update
Just trained the SVM with svm type as c-SVC and kernel type as linear. Which gave a non-zero(although very poor) accuracy.
As mentioned by #Pedrom, parameter choice is absolutely crucial when training SVMs. I suggest you have a look at this practical guide. Also, 431 words is nowhere near enough to train a 25-class model. You will definitely need more data.
That said, 0% accuracy is indeed odd. Can you please show us the commands you are using to train and evaluate the model?

Resources