sklearn High score with low performance - machine-learning

could you kindly help me decide whether I'm hitting a bug or the problem may be in my implementation?
I have a data set with 5 features and 2000+ observations and I use SVR to do regression tests and select parameters with grid search. If I don't scale my data, then I get a best score of close to zero, but if I do scale it, the best score is around 0.90.
When I manually test the data, it predicts wrong values totally randomly. How can this be? I expect the best score to show how well the trained data could have been validated on new ones during cross validation. I suppose I should not get high score if my model cannot generate well. Should I? Could this be a bug?
SKlearn version is 0.19.1 (from package of Ubuntu Linux 18.04 x64 LTS platform)
Python version is 3.6.7
Would it be worth an upgrade with pip? Any further idea? Thank you.
Edit: see the following code that produces high score, still generalizes badly - though it is regression, scoring should reflect the difference of the predicted ones from the test values:
C_range = 2.0 ** np.arange(-5, 15, 2)
gamma_range = 2.0 ** np.arange(-5, 15, 2)
parameters = {"kernel":["rbf"], "C":C_range, "gamma":gamma_range}
estimator = svm.SVR()
clf = GridSearchCV(estimator, parameters, cv=3, n_jobs=-1, verbose=0)
clf.fit(x, y)
print( clf.best_score_ )

Related

RandomizedSearchCV-Appropriate hyper-parameter distribution

In "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" book I see below distributions(reciprocal and Expon) being applied for Hyperparameters C and gamma. How did Author(Aurelion) came up with these distributions ? I mean how to determine which distribution would be appropriate for application in RandomizedSearchCV ?
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200000),
'gamma': expon(scale=1.0),
}
I hope I got the question right.
It depends on the ML model. Randomized or Grid Search is used to the search for the best hyper-parameter that would result in the best estimator for prediction.
For example, consider the following code example. The ```rf_clf`` is the Random Forest model object. The param_distribs will contain the parameters with arbitrary choice of the values.
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=500),
'max_depth': randint(low=1, high=10),
'max_features':randint(low=1,high=10),
}
rf_clf = RandomForestClassifier(random_state=42)
rnd_search_rf = RandomizedSearchCV(rf_clf, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='accuracy', random_state=42)
rnd_search_rf.fit(X_train,y_train)
The best estimator can be accessed via
rnd_search_rf.best_estimator_
I found below comments in sample code on Github.
C-->The distribution we used for C looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is Reciprocal-:The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

Improving boosting model ,reducing Root mean square error

Hi i am solving a regression problem.My data set consists of 13 features and 550068 rows.I tried different different models and found that boosting algorithms(i.e xgboost,catboost,lightgbm) are performing well on that big data set.here is the code
import lightgbm as lgb
gbm = lgb.LGBMRegressor(objective='regression',num_leaves=100,learning_rate=0.2,n_estimators=1500)
gbm.fit(x_train, y_train,
eval_set=[(x_test, y_test)],
eval_metric='l2_root',
early_stopping_rounds=10)
y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
accuracy = round(gbm.score(x_train, y_train)*100,2)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
import xgboost as xgb
boost_params = {'eval_metric': 'rmse'}
xgb0 = xgb.XGBRegressor(
max_depth=8,
learning_rate=0.1,
n_estimators=1500,
objective='reg:linear',
gamma=0,
min_child_weight=1,
subsample=1,
colsample_bytree=1,
scale_pos_weight=1,
seed=27,
**boost_params)
xgb0.fit(x_train,y_train)
accuracyxgboost = round(xgb0.score(x_train, y_train)*100,2)
predict_xgboost = xgb0.predict(x_test)
msexgboost = mean_squared_error(y_test,predict_xgboost)
rmsexgboost= np.sqrt(msexgboost)
from catboost import Pool, CatBoostRegressor
train_pool = Pool(x_train, y_train)
cbm0 = CatBoostRegressor(rsm=0.8, depth=7, learning_rate=0.1,
eval_metric='RMSE')
cbm0.fit(train_pool)
test_pool = Pool(x_test)
predict_cat = cbm0.predict(test_pool)
acc_cat = round(cbm0.score(x_train, y_train)*100,2)
msecat = mean_squared_error(y_test,predict_cat)
rmsecat = np.sqrt(msecat)
By using the above models i am getting rmse values about 2850.Now i want to improve my model performance by reducing root mean square error.How can i improve my model performance? As i am new to boosting algorithms,which parameters effect the models?And how can i do hyperparameter tuning for those algorithms(xgboost,catboost,lightgbm).I am using Windows10 os and intel i5 7th genration.
Out of those 3 tools that you have tried CatBoost provides an edge in categorical feature processing (it could be also faster, but I did not see a benchmark demonstrating it, and it seems to be not dominating on kaggle, so most likely it is not as quick as LightGBM, but I might be wrong in that hypothesis). So I would use it if I have many of those in my sample. The other two (LightGBM and XGBoost) provide very similar functionality and I would suggest to choose one of them and stick top it. At the moment it seems that LightGBM outperforms XGBoost in training time on CPU providing a very comparable precision of predictions. See for example GBM-perf beachmark on github or this in-depth analysis. If you have GPU's available, than in fact XGBoost seems to be preferable, judging on the benachmark above.
In general, you can improve your model performance in several ways:
train longer (if early stopping was not triggered, that means that there is still room for generalisation; if it was, then you can not improve further by training longer the chosen model with chosen hyper-parameters)
optimise hyper-parameters (see below)
choose a different model. There is no single silver bullet for all problems. Typically GBMs work very well on large samples of structured data, but for some classes of problems (e.g. linear dependence) it is hard for a GBM to learn how to generalise, as it might require very many splits. So it might be that for your problem a linear model, an SVM or something else will do better out of the box.
Since we narrowed down to 2 options, I can not advice on catboost hyper-parameter optimisation, as I have no hands-on experience with it yet. But for lightgbm tuning you can read this official lightgbm doc and these instructions in one of the issues. There are very many good examples of hyper parameter tuning for LightGBM. I can quickly dig out my kernel on kaggle: see here. I do not claim it to be perfect but that's something what is easy for me to find :)
If you are using Intel CPU, then try Intel XGBoost. Intel has powered several optimizations for XGBoost to accelerate gradient boosting models and improve its training and inference capabilities. Also, please check out the article, https://www.intel.com/content/www/us/en/developer/articles/technical/easy-introduction-xgboost-for-intel-architecture.html#gs.q4c6p6 on how to use XGBoost with Intel optimizations.
You can use either of lasso or ridge, these methods could improve the performance.
For hyper parameter tuning, you can use loops. iterate the values and check where you getting lowest RMSE values.
You can also try stacked ensemble techniques.
If you use R, use h20.ai package, It gives good result.

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Resources