RandomizedSearchCV-Appropriate hyper-parameter distribution - machine-learning

In "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" book I see below distributions(reciprocal and Expon) being applied for Hyperparameters C and gamma. How did Author(Aurelion) came up with these distributions ? I mean how to determine which distribution would be appropriate for application in RandomizedSearchCV ?
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200000),
'gamma': expon(scale=1.0),
}

I hope I got the question right.
It depends on the ML model. Randomized or Grid Search is used to the search for the best hyper-parameter that would result in the best estimator for prediction.
For example, consider the following code example. The ```rf_clf`` is the Random Forest model object. The param_distribs will contain the parameters with arbitrary choice of the values.
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=500),
'max_depth': randint(low=1, high=10),
'max_features':randint(low=1,high=10),
}
rf_clf = RandomForestClassifier(random_state=42)
rnd_search_rf = RandomizedSearchCV(rf_clf, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='accuracy', random_state=42)
rnd_search_rf.fit(X_train,y_train)
The best estimator can be accessed via
rnd_search_rf.best_estimator_

I found below comments in sample code on Github.
C-->The distribution we used for C looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is Reciprocal-:The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.

Related

ML accuracy for a particular group/range

General terms that I used to search on google such as Localised Accuracy, custom accuracy, biased cost functions all seem wrong, and maybe I am not even asking the right questions.
Imagine I have some data, may it be the:
The famous Iris Classification Problem
Pictures of felines
The Following Dataset that I made up on predicting house prices:
In all these scenario, I am really interested in the accuracy of one set/one regression range of data.
For irises, I really need Iris "setosa" to be classified correctly, really don't care if Iris virginica and Iris versicolor are all wrong.
for Felines, I really need the model to tell me if you spotted a tiger (for obvious reason), whether it is a Persian or ragdoll or not I dont really care.
For the house prices one, i want the accuracy of higher-end houses error to be minimised. Because error in those is costly.
How do I do this? If I want Setosa to be classified correctly, removing virginica or versicolour both seem wrong. Trying different algorithm like Linear/SVM etc are all well and good, but it only improves the OVERALL accuracy. But I really need, for example, "Tigers" to be predicted correctly, even at the expense of the "overall" accuracy of the model.
Is there a way to have a custom cost-function to allow me to have a high accuracy in a localise region in a regression problem, or a specific category in a classification problem?
If this cannot be answered, if you could just point me to some terms that i can search/research that would still be greatly appreciated.
You can use weights to achieve that. If you're using the SVC class of scikit-learn, you can pass class_weight in the constructor. You could also pass sample_weight in the fit-method.
For example:
from sklearn import svm
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = svm.SVC(class_weight={0: 3, 1: 1, 2: 1})
clf.fit(X, y)
This way setosa is more important than the other classes.
Example for regression
from sklearn.linear_model import LinearRegression
X = ... # features
y = ... # house prices
weights = []
for house_price in y:
if house_price > threshold:
weights.append(3)
else:
weights.append(1)
clf = LinearRegression()
clf.fit(X, y, sample_weight=weights)

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Why it's only working with setting kernel: 'rbf' in SVM Classifier?

from sklearn.model_selection import GridSearchCV
from sklearn import svm
params_svm = {
'kernel' : ['linear','rbf','poly'],
'C' : [0.1,0.5,1,10,100],
'gamma' : [0.001,0.01,0.1,1,10]
}
svm_clf = svm.SVC()
estimator_svm = GridSearchCV(svm_clf,param_grid=params_svm,cv=4,verbose=1,scoring='accuracy')
estimator_svm.fit(data,labels)
print(estimator_svm.best_params_)
estimator_svm.best_score_
/*
data.shape is (891,9)
labels.shape is (891) both are numeric 2-D and 1-D arrays.
*/
when I am using GridSearchCV with rbf it's giving the best parameter combination in just 2.7seconds..!
but when I make a list of kernel including any 'poly' or 'linear' separately or with 'rbf' it's taking too long to produce output, i.e. not giving output even after 15-20 minutes, which means I am doing something wrong. I am new to Machine Learning(supervised). I am not able to find any bug in the coding...I am not getting what's going wrong behind the scenes!
Can anyone explain this to me ,what i am doing wrong
No you are not doing anything wrong as per your code. There are many factors that come into play here
SVC is a complex classfier which requires computation of a distance between each pair of points in the dataset.
The complexity also varies with different kernel. I am not sure but i think it is O((no_of_samples)^2 * n_features) for rbf kernel, while it is O(n_samples*n_features) for linear kernel. So, it is not the case that just because rbf kernel works in 15 mins, then linear kernel will also work in similar time.
Also the time taken depends drastically on the dataset and the data patterns present in it. For e.g. an rbf kernel may converge quickly with say C = 0.5 but may take drastically more time for polynomial kernel to converge for the same value of C.
Also, without using the cache the running time increase a lot. In this answer, the author mentions it might increase to O(n_samples^3 *n_features).
Here is the offical documentation from sklearn about SVM complexity. See this section about practical tips on using SVM as well.
You can set verbose to True to see the progress of your classfier and how it is trained.
References
GridSearchCV goes to endless execution using SVC
Computational complexity of SVM
Official Documentation of SVM for scikit-learn

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Resources