I am trying to find the best hyperparameters for Support Vector classification. So far, Grid Search worked fine for tasks like that, but with the SVCs it seems to be hitting walls everywhere.
A minimal attempt with only a few suggestions for the C parameter works and produces results:
param_grid = {
'C' : [0.01, 0.1, 1, 10],
}
classifier = SVC()
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, scoring='f1',
error_score=0, n_jobs=-1, verbose=42)
grid_search.fit(data[0], np.ravel(data[1]))
Similarly, other parameters like gamma, coef0 or shrinking don't create any problems.
However, anything involving searching for a kernel function just seems to go on processing infinitely. Even just adding one other choice leads to Python hogging all available processors for some ominous work that does not finish (at least not within 10 minutes or so).
param_grid = {
'C' : [0.01, 0.1, 1, 10],
'kernel': ['rbf', 'linear'],
}
What's really confusing me is that it starts out alright, producing good output within the first minute, then just seemingly stops doing anything while still making the coolers run in full speed. The verbose output looks like this:
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] kernel=rbf, C=0.01 ..............................................
[CV] kernel=rbf, C=0.01 ..............................................
[CV] kernel=rbf, C=0.01 ..............................................
[CV] kernel=linear, C=0.01 ...........................................
[...]/python3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
[CV] ............... kernel=rbf, C=0.01, score=0.564932, total= 0.9s
[CV] kernel=linear, C=0.01 ...........................................
[CV] ............... kernel=rbf, C=0.01, score=0.574120, total= 0.8s
[CV] kernel=linear, C=0.01 ...........................................
[CV] ............... kernel=rbf, C=0.01, score=0.000000, total= 0.9s
[CV] kernel=rbf, C=0.1 ...............................................
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.3s
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 1.4s
[Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 1.5s
[CV] ................ kernel=rbf, C=0.1, score=0.555556, total= 1.0s
[CV] kernel=rbf, C=0.1 ...............................................
[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 2.9s
[CV] ................ kernel=rbf, C=0.1, score=0.564932, total= 1.1s
[CV] kernel=rbf, C=0.1 ...............................................
[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 4.5s
[CV] ................ kernel=rbf, C=0.1, score=0.574120, total= 1.0s
[CV] kernel=linear, C=0.1 ............................................
[Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 5.9s
Setting njobs to any other number leads to similar results: A part of the computation is done quickly and without complaints, then it seems to get stuck and uses all available CPU without any visible progress.
Furthermore, giving only one kernel choice has different results. While rbf and sigmoid work fine and finish within seconds, poly and linear apparently get stuck.
I'm at a loss - what is the problem here, and how can I run grid search usefully? My data consists of a bit over 5000 instances with 12 numerical features each. The classes are either 0 or 1, in equal distribution. Is that too much, maybe? If so, why would some searches work just fine and the trouble only start for certain kernel functions?
EDIT It looks like this is a problem with the data I'm using. The only thing that helped, so far, was normalizing the features (all values in a range from 0 to 1).
Now, normalization is generally recommended for Support Vector approaches, as they are not scale invariant, so I was going for it in any case. But I thought of it as a way to improve performance, not a necessary precaution for it to work, like it seems to be in this case.
I'll be able to work with this, for now, but I'm still curious whether anybody knows just what might be wrong with the data and how it could stil be fed to a support vector algorithm.
I used my data and I modified a bit you code. It runs fine the following code for me in Windows 8.
Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
if __name__=='__main__':
data= pd.read_csv('Prior Decompo2.csv', header=None)
X, y = data.iloc[0:, 0:26].values, data.iloc[0:,26].values
param_grid = {'C' : [0.01, 0.1, 1, 10], 'kernel': ('rbf', 'linear')}
classifier = SVC()
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, scoring='accuracy', n_jobs=-1, verbose=42)
grid_search.fit(X,y)
Change #1:
I added the if __name__=='__main__':
Change #2:
Use parenthesis for the kernel:
param_grid = {'C' : [0.01, 0.1, 1, 10], 'kernel': ('rbf', 'linear')}
Important:
In your code that you posted after 'kernel': ['rbf', 'linear'], you have a comma that is not needed at all !
Change #3:
Inside GridSearchCV use another scoring e.g. scoring='accuracy':
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid,scoring='accuracy', n_jobs=-1, verbose=42)
The results is:
You can clearly see in the image that both linear and rbf are tested.
Related
I am performing a classification task which is essentially doing algorithm configuration, i.e. trying to pick a configuration (or 'mode') which is likely to make the problem-solving algorithm finish in the quickest time.
I am learning to classify the "best" configuration based on features of problem instances. I see that scikit-learn enables you to create your own scoring function to use in tuning the models. However the score_func only takes the true label and the predicted label as input.
Is it possible to identify which row in the dataset a prediction came from (when passing to this custom scorer)? That way I could figure out the performance hit of a predicted ("wrong") config and score the model accordingly. Basically sometimes a "wrong" selection can still be very good and close to the best, but a naive classification has no way of knowing this when the classification labels are purely based on the best config.
Here's a contrived example to illustrate what I'm trying to do
import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].\
rename(columns={'config':'best_config'})
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem config time
0 instance_0 analytic 15.307044
1 instance_0 bruteforce 36.742846
2 instance_0 hybrid 35.053416
3 instance_1 analytic 57.781358
4 instance_1 bruteforce 31.723275
5 instance_1 hybrid 8.080238
6 instance_2 analytic 4.211297
7 instance_2 bruteforce 24.034830
8 instance_2 hybrid 39.073023
9 instance_3 analytic 36.325485
10 instance_3 bruteforce 14.717841
11 instance_3 hybrid 57.103908
12 instance_4 analytic 7.358539
13 instance_4 bruteforce 10.805536
14 instance_4 hybrid 2.605044
15 instance_5 analytic 0.489870
16 instance_5 bruteforce 42.888858
17 instance_5 hybrid 58.634073
dataset:
best_config feature_0 feature_1 feature_2 feature_3 feature_4
0 analytic 0.645388 0.641626 0.975619 0.680713 0.209235
5 hybrid 0.993443 0.221038 0.893763 0.408532 0.254791
6 analytic 0.263872 0.142887 0.264538 0.166985 0.800054
10 bruteforce 0.155023 0.601300 0.258767 0.614732 0.850529
14 hybrid 0.766183 0.993692 0.597047 0.401482 0.275133
15 analytic 0.386327 0.065699 0.349115 0.370136 0.357329
I am using sklearn with the dataset where the X would be the feature columns and the y would be the best_config column. In this example, the "bad" choices for instance_0 are both almost equally bad, but for instance_1, the two wrong choices are not equally bad. So I'd like my custom scorer to be able to reflect this somehow. Is that possible?
In the end I did find a way to get the information I was after in the original question. If you're passing a pandas.Series as your target labels, the index attribute is available, so you can look up whatever you want in the full dataset.
In the solution below, the first part is pretty much the same as the original minimal working example - i.e. generating a fake dataset.
In the second part, a custom scorer function is defined, which is then passed to the cross-validating hyperparameter tuner, RandomizedSearchCV. Please bear in mind the data is garbage, so the "results" are meaningless; this is just a demo of how to refer back to a fuller set of results so that you can evaluate the quality of predictions made during hyperparameter tuning based on more specialised information rather than just "match / fail" when doing a classification.
import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES = 5
HP_ITER = 10
SEED = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]\
.rename(columns={'config':'target'})\
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn
data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds
def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)
print("Timings for all configs:", df_times, "", sep="\n")
print("Labelled dataset:", dataset, "", sep="\n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))
Here's the output:
** Timings for all configs **
problem config time
0 p_000 analytic 21.811701
1 p_000 bruteforce 29.652341
2 p_000 hybrid 20.376605
3 p_001 analytic 12.989269
4 p_001 bruteforce 51.759137
.. ... ... ...
595 p_198 bruteforce 10.874092
596 p_198 hybrid 14.723661
597 p_199 analytic 24.984775
598 p_199 bruteforce 4.899111
599 p_199 hybrid 36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem feature_0 feature_1 feature_2 feature_3 feature_4
0 hybrid p_000 0.864952 0.487293 0.946654 0.863503 0.310866
1 analytic p_001 0.514093 0.007643 0.948784 0.582419 0.258159
2 bruteforce p_002 0.319059 0.872320 0.321495 0.807644 0.158471
3 analytic p_003 0.421063 0.955742 0.114808 0.980013 0.900057
4 hybrid p_004 0.325935 0.125824 0.697967 0.037196 0.923626
.. ... ... ... ... ... ... ...
195 hybrid p_195 0.179126 0.578338 0.391535 0.632501 0.442677
196 bruteforce p_196 0.827637 0.641567 0.710201 0.833341 0.215357
197 hybrid p_197 0.116661 0.480170 0.253893 0.623913 0.465419
198 bruteforce p_198 0.670555 0.037084 0.954332 0.408546 0.935973
199 bruteforce p_199 0.371541 0.463060 0.549176 0.581093 0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 9.1s finished
Test loss with custom CV scorer : 522.3236277796698
I have a sparse dataset of dimensions (40000, 21). I am trying to build a classification model for it using xgboost. Unfortunately it is so slow it never terminates for me. However, on the same data set scikit-learn's RandomForestClassifer takes about 1 second. This is the code I am using:
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
[...]
t0 = time()
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(trainX, trainY)
print("RF score", rf.score(testX, testY))
print("Time to fit and score random forest", time()-t0)
t0 = time()
clf = XGBClassifier(n_jobs=-1)
clf.fit(trainX, trainY, verbose=True)
print(clf.score(testX, testY))
print("Time taken to fit and score xgboost", time()-t0)
To show the type of trainX:
print(repr(trainX))
<40000x21 sparse matrix of type '<class 'numpy.int64'>'
with 360000 stored elements in Compressed Sparse Row format>
Notice I am using all the default parameters except for n_jobs.
What am I doing wrong?
In [3]: print(xgboost.__version__)
0.6
print(sklearn.__version__)
0.19.1
I tried the following so far from advice in the comments:
I set n_enumerators = 5. Now at least it finishes in 62 seconds. This is still about 60 times slower than RandomForestClassifier.
With n_enumerators = 5 I removed n_jobs=-1 and set n_jobs=1. It then finished in about 107 seconds (about 100 times slower than RandomForestClassifier). If I increase n_jobs to 4 this speeds up to 27 seconds. Still about 27 times slower than RandomForestClassifier.
If I leave the default number of estimators it still never finishes for me.
Here is full code to reproduce the problem using fake data. I set n_estimators=50 for both classifier which slows the RandomForestClassifier down to about 16 seconds. Xgboost on the other hand still never terminates for me.
#!/usr/bin/python3
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from time import time
(trainX, trainY) = make_classification(n_informative=10, n_redundant=0, n_samples=50000, n_classes=120)
print("Shape of trainX and trainY", trainX.shape, trainY.shape)
t0 = time()
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
rf.fit(trainX, trainY)
print("Time elapsed by RandomForestClassifier is: ", time()-t0)
t0 = time()
xgbrf = XGBClassifier(n_estimators=50, n_jobs=-1,verbose=True)
xgbrf.fit(trainX, trainY)
print("Time elapsed by XGBClassifier is: ", time()-t0)
It turns out that the running time of xgboost scales quadratically with the number of classes. See https://github.com/dmlc/xgboost/issues/2926 .
I'm using sklearn to train a classification model, the data shape and training pipeline is:
clf = Pipeline([
("imputer", Imputer(missing_values='NaN', strategy="mean", axis=0)),
('feature_selection', VarianceThreshold(threshold=(.97 * (1 - .97)))),
('scaler', StandardScaler()),
('classification', svm.SVC(kernel='linear', C=1))])
print X.shape, y.shape
(59381, 895) (59381,)
I have checked that feature_selection will reduce the feature vector size from 895 to 124
feature_selection = Pipeline([
("imputer", Imputer(missing_values='NaN', strategy="mean", axis=0)),
('feature_selection', VarianceThreshold(threshold=(.97 * (1 - .97))))
])
feature_selection.fit_transform(X).shape
(59381, 124) (59381,)
then I try to get accuracy as below
scores = cross_validation.cross_val_score(clf, X, y)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
but the training process is very slow, I want to know to speed up the process in this situation? or the size of feature vector 124 is still too large to svm model?
Try using sklearn.svm.LinearSVC.
It suppose to give very similar results to svm.SVC(kernel='linear'), but training process will be faster(at least when d<m, when d-feature dimension and m- size of train sample).
If you want to use other kernel, like rbf, you can't use LinearSVC.
However, you can add kernel cache size: the size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
I'm using SVMs, specifically libsvm, in order to predict peaks in electricity consumption. In the training set, each vector has 24 values, representing the accumulated kWh for each hour. The vector is labeled "peak", if the next value is defined as a peak (basic outlier detection).
Sample vectors from the training set:
1 1:4.05 2:2.75 3:2.13 4:1.82 5:1.5 6:2.92 7:1.78 8:1.71 9:2.1 10:2.74 11:2.75 12:2.41 13:2.38 14:2.37 15:3.57 16:2.38 17:2.48 18:2.44 19:2.35 20:2.78 21:3.03 22:2.29 23:2.41 24:2.71
0 1:2.75 2:2.13 3:1.82 4:1.5 5:2.92 6:1.78 7:1.71 8:2.1 9:2.74 10:2.75 11:2.41 12:2.38 13:2.37 14:3.57 15:2.38 16:2.48 17:2.44 18:2.35 19:2.78 20:3.03 21:2.29 22:2.41 23:2.71 24:(3.63)<- Peak
0 1:2.13 2:1.82 3:1.5 4:2.92 5:1.78 6:1.71 7:2.1 8:2.74 9:2.75 10:2.41 11:2.38 12:2.37 13:3.57 14:2.38 15:2.48 16:2.44 17:2.35 18:2.78 19:3.03 20:2.29 21:2.41 22:2.71 23:3.63 24:(1.53)<- No peak
The training seems fine and I get a ~85% accuracy when performing cross validation. However, when I try to classify the testing set, the predicted class labels are all the same. No peaks are discovered.
I'm using the default radial basis function and haven't changed any parameters.
Output from training.model (without the vectors):
svm_type c_svc
kernel_type rbf
gamma 0.0416667
nr_class 2
total_sv 174
rho -0.883122
label 0 1
nr_sv 122 52
Am I doing something fundamentally wrong here?
Is it possible to use stochastic gradient descent for time-series analysis?
My initial idea, given a series of (t, v) pairs where I want an SGD regressor to predict the v associated with t+1, would be to convert the date/time into an integer value, and train the regressor on this list using the hinge loss function. Is this feasible?
Edit: This is example code using the SGD implementation in scikit-learn. However, it fails to properly predict a simple linear time series model. All it seems to do is calculate the average of the training Y-values, and use that as its prediction of the test Y-values. Is SGD just unsuitable for time-series-analysis or am I formulating this incorrectly?
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1)
i = 0
training = []
for _ in xrange(12):
i += 1
training.append([[date(2012,1,i).toordinal()], i])
testing = []
for _ in xrange(12):
i += 1
testing.append([[date(2012,1,i).toordinal()], i])
clf = SGDRegressor(loss='huber')
print 'Training...'
for _ in xrange(20):
try:
print _
clf.partial_fit(X=[X for X,_ in training], y=[y for _,y in training])
except ValueError:
break
print 'Testing...'
for X,y in testing:
p = clf.predict(X)
print y,p,abs(p-y)
SGDRegressor in sklearn is numerically not stable for not scaled input parameters. For good result it's highly recommended that you scale the input variable.
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1).toordinal()
i = 0
training = []
for _ in range(1,13):
i += 1
training.append([[s+i], i])
testing = []
for _ in range(13,25):
i += 1
testing.append([[s+i], i])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform([X for X,_ in training])
after training the SGD regressor, you will have to scale the test input variable accordingly.
clf = SGDRegressor()
clf.fit(X=X_train, y=[y for _,y in training])
print(clf.intercept_, clf.coef_)
print('Testing...')
for X,y in testing:
p = clf.predict(scaler.transform([X]))
print(X[0],y,p[0],abs(p[0]-y))
Here is the result:
[6.31706122] [3.35332573]
Testing...
733786 13 12.631164799851827 0.3688352001481725
733787 14 13.602565350686039 0.39743464931396133
733788 15 14.573965901520248 0.42603409847975193
733789 16 15.545366452354457 0.45463354764554254
733790 17 16.51676700318867 0.48323299681133136
733791 18 17.488167554022876 0.5118324459771237
733792 19 18.459568104857084 0.5404318951429161
733793 20 19.430968655691295 0.569031344308705
733794 21 20.402369206525506 0.5976307934744938
733795 22 21.373769757359714 0.6262302426402861
733796 23 22.34517030819392 0.6548296918060785
733797 24 23.316570859028133 0.6834291409718674
The method of choice for time series prediction depends on what you know about your time series. If you choose a specific method for your task you always make implicit assumptions about the nature of your signal and the kind of system that generated the signal. Any method is always a model of the system. The more you know a priori about your signal and the system the better you are able to model it.
If your signal for instance is of stochastic nature, usually ARMA processes or Kalman filters are a good choice. If those fail, other more deterministic models might help, given, of corse, you have some information about you system.