GridSearchCV parameters don't improve classification - machine-learning

I have a target dataset that I divide into 5 non-overlapping folds.
At each iteration (total iterations == 5) I use 1 fold (let's call it fold_for_tuning) to do parameter tuning, and I use 4 folds for testing.
The reason for this is that I want to do domain adaptation, and before tuning, I fit the source data to the classifier, and I am tuning using small subset of target data).
I call GridSearchCV and fit fold_for_tuning, and also I pass a bunch of parameters that I want to tune:
param_test1 = {
'max_depth': [5, 7],
'min_child_weight': [0.5, 1, 2],
'gamma': [0.1, 1],
'subsample': [0.6, 0.7],
'colsample_bytree': [0.6, 0.7],
'reg_alpha': [0.01, 0.1]
}
gsearch = GridSearchCV(estimator=classifierXGB,
param_grid=param_test1,
scoring='accuracy',
n_jobs=4, iid=False, cv=2)
gsearch.fit(fold_for_tuning_data, fold_for_tuning_labels)
After each iteration, I get gsearch.best_params_, and I set them into classifierXGB(because they should give better prediction, in my understanding).
Then, when I call
test_y_predicted = classifierXGB.predict(4_unseen_folds)
I get not improvement:
prediction before tuning:
acc: 0.690658872245
auc: 0.700764301397
f1: 0.679211922203
prediction after tuning:
acc: 0.691382460414
auc: 0.701595887248
f1: 0.680132554837
But if I call gsearch.predict(4_unseen_folds)
I get MUCH BETTER performance:
prediction grid search :
acc: 0.933313032887
auc: 0.930058979926
f1: 0.920623414281
So I am confused: what is happening inside grid search? Shouldn't it be optimizing only the parameters that I pass in param_grid? If so, then why setting the very same parameters in classifierXGB doesn't result into better performance?

Your gsearch.predict(...) call is the prediction of the best classifier.
I'm not sure what's happening in the background of ClassifierXGB, but if you create a new classifierXGB:
classifierXGB = ClassifierXGB(**gsearch.best_params_)`
and then call classifierXGB.predict(4_unseen_folds) you should see something similar to gsearch.predict(4_unseen_folds).
It might be that applying changes to classifierXGB after the fact doesn't do what you expect. Creating a new instance of ClassifierXGB should help.

Once you have set your parameters to your classifierXGB, you need to fit it on the whole train data, and then, use it to predict things
The grid search found the "right" parameters, you gave them to your classifier for it to learn efficiently, but you did not gave him the actual trees/weights of the model. It is still an empty shell.

Related

How to pass a gradient through a categorical distribution?

I have a sequence of words:
my_sequence = "This is my sequence"
I also have model_1 that predicts a label for each word, out of 2 possible labels:
model_1_probabilities_predictions = [[0.1, 0.9], [0.4, 0.6], [0.7, 0.3], [0.9, 0.1]]
So far so good. Now here is where I'm stuck:
I have another model, call it model_2. This model needs to take the categorical predictions (i.e., the argmax of model_1_probabilities_predictions) in order to make a prediction. It only uses the words corresponding to the first label.
model_2_predictions = model_2( [This, is] )
Using this I can get a loss:
loss = loss_function(model_2_predictions, true_label)
There are 3 issues:
1. I can't differentiate through argmax
2. Even if there was a way to differentiate through argmax, the loss is 0 if the categorical labels are not exact and 1 otherwise (e.g., 0 for [label_1, label_1, label_0, label_0] but 1 for [label_0, label_1, label_0, label_0]).
3. Since model_2 only uses the words corresponding to the first label, the loss can't really propagate to the other words
I looked into the Gumbel softmax trick but it's not quite what I need, as model_2 must have the categorical labels, so there isn't really a way around it

Layers for predicting financial data using Tensorflow/tflearn

I'd like to predict the interest rate and I've got some relevant factors like stock index and money supply number, something like that. The number of factors may be up to 200.
For example,the training data like, X contains factors and y is the interest rate I want to train and predict.
factor1 factor2 factor3 factor176 factor177 factor178
X= [[ 2.1428 6.1557 5.4101 ..., 5.86 6.0735 6.191 ]
[ 2.168 6.1533 5.2315 ..., 5.8185 6.0591 6.189 ]
[ 2.125 4.7965 3.9443 ..., 5.7845 5.9873 6.1283]...]
y= [[ 3.5593]
[ 3.014 ]
[ 2.7125]...]
So I want to use tensorflow/tflearn to train this model but I don't really know what method exactly I should choose to do regression. I have tried LinearRegression from tflearn before, but the result is not so great.
For now, I just use the code I found online.
net = tflearn.input_data([None, 178])
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, optimizer=
tflearn.optimizers.AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01),
loss='mean_square', learning_rate=0.05)
model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='tmp/')
model.fit(X, y, show_metric=True,
batch_size=1, n_epoch=100)
The result is roughly 50% accuracy when the error range is ±10%.
I have tried to make the window to 7 days but the result is still bad. So I want to know what additional layer I can use to make this network better.
First of all this network makes no sense. If you do not have any activations on your hidden units, you network is equivalent to linear regression.
So first of all change
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
to
net = tflearn.fully_connected(net, 64, activation='relu',
weight_decay=0.0005)
Another general thing is to always normalise your data. Your X's are big, y's are big as well - make sure they aren't, by for example whitening them (making them 0 mean and 1 std).
Finding right architecture is hard problem and you will not find any "magical recipies" for that. Start with understanding what you are doing. Log your training, see if the training loss converges to small values, if it does not - you either do not train long enough, network is too small, or training hyperparameters are off (like too big learning right, too high regularisation etc.)

How to plot a learning curve for a keras experiment?

I'm training an RNN using keras and would like to see how the validation accuracy changes with the data set size. Keras has a list called val_acc in its history object which gets appended after every epoch with the respective validation set accuracy (link to the post in google group). I want to get the average of val_acc for the number of epochs run and plot that against the respective data set size.
Question: How can I retrieve the elements in the val_acc list and perform an operation like numpy.mean(val_acc)?
EDIT: As #runDOSrun said, getting the mean of the val_accs doesn't make sense. Let me focus on getting the final val_acc.
I tried what's been suggested by #nemo but no luck. Here's what I got when I print
model.fit(X_train, y_train, batch_size = 512, nb_epoch = 5, validation_split = 0.05).__dict__
output:
{'model': <keras.models.Sequential object at 0x000000001F752A90>, 'params': {'verbose': 1, 'nb_epoch': 5, 'batch_size': 512, 'metrics': ['loss', 'val_loss'], 'nb_sample': 1710, 'do_validation': True}, 'epoch': [0, 1, 2, 3, 4], 'history': {'loss': [0.96936064512408959, 0.66933631673890948, 0.63404161288724303, 0.62268789783555867, 0.60833334699708819], 'val_loss': [0.84040999412536621, 0.75676006078720093, 0.73714292049407959, 0.71032363176345825, 0.71341043710708618]}}
It turns out there's no list as val_acc in my history dictionary.
Question: How to include val_acc in to the history dictionary?
To get accuracy values, you need to request that they are calculated during fit, because accuracy is not an objective function, but a (common) metric. Sometimes calculating accuracy does not make sense, so it is not enabled by default in Keras. However, it is a built-in metric, and easy to add.
To add the metric, use metrics=['accuracy'] parameter to model.compile.
In your example:
history = model.fit(X_train, y_train, batch_size = 512,
nb_epoch = 5, validation_split = 0.05)
You can then access validation accuracy as history.history['val_acc']
The history object is created during fit()ting the model. See keras/engine/training.py for details.
You can access the history using the history attribute on the model: model.history.
After fitting the model you simply average over the attribute.
np.mean([v['val_acc'] for v in model.history])
Note that the pattern is val_<your output name here> for every output you specify.
Why do you find the average accuracy more important than the final accuracy? Depending on your initial values, your average might be quite misleading. It's easy to come up with different curves that have the same average but different interpretations.
I'd just plot the complete history of train_acc and val_acc to decide whether the RNN is performing well within the given setup. And also don't forget to have a sample size N > 1. Random initialization can have a big impact on RNNs, take at least N=10 different initializations for each setup to make sure that the different performance is actually caused by your set size and not by better/worse initializations.

Multi Label classification with Sklearn

I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)
Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.
When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.
Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.
This is a known interesting (and often painful!) problem with hierarchical predictions. A problem with training a number of predictors over the train data, then training a higher predictor over them, again using the train data - has to do with the bias-variance decomposition.
Suppose you have two predictors, one essentially an overfitting version of the other, then the former will appear over the train set to be better than latter. The combining predictor will favor the former for no true reason, just because it cannot distinguish overfitting from true high-quality prediction.
The known way of dealing with this is to prepare, for each row in the train data, for each of the predictors, a prediction for the row, based on a model not fit for this row. For the overfitting version, e.g., this won't produce a good result for the row, on average. The combining predictor will then be able to better assess a fair model for combining the lower-level predictors.
Shahar Azulay & I wrote a transformer stage for dealing with this:
class Stacker(object):
"""
A transformer applying fitting a predictor `pred` to data in a way
that will allow a higher-up predictor to build a model utilizing both this
and other predictors correctly.
The fit_transform(self, x, y) of this class will create a column matrix, whose
each row contains the prediction of `pred` fitted on other rows than this one.
This allows a higher-level predictor to correctly fit a model on this, and other
column matrices obtained from other lower-level predictors.
The fit(self, x, y) and transform(self, x_) methods, will fit `pred` on all
of `x`, and transform the output of `x_` (which is either `x` or not) using the fitted
`pred`.
Arguments:
pred: A lower-level predictor to stack.
cv_fn: Function taking `x`, and returning a cross-validation object. In `fit_transform`
th train and test indices of the object will be iterated over. For each iteration, `pred` will
be fitted to the `x` and `y` with rows corresponding to the
train indices, and the test indices of the output will be obtained
by predicting on the corresponding indices of `x`.
"""
def __init__(self, pred, cv_fn=lambda x: sklearn.cross_validation.LeaveOneOut(x.shape[0])):
self._pred, self._cv_fn = pred, cv_fn
def fit_transform(self, x, y):
x_trans = self._train_transform(x, y)
self.fit(x, y)
return x_trans
def fit(self, x, y):
"""
Same signature as any sklearn transformer.
"""
self._pred.fit(x, y)
return self
def transform(self, x):
"""
Same signature as any sklearn transformer.
"""
return self._test_transform(x)
def _train_transform(self, x, y):
x_trans = np.nan * np.ones((x.shape[0], 1))
all_te = set()
for tr, te in self._cv_fn(x):
all_te = all_te | set(te)
x_trans[te, 0] = self._pred.fit(x[tr, :], y[tr]).predict(x[te, :])
if all_te != set(range(x.shape[0])):
warnings.warn('Not all indices covered by Stacker', sklearn.exceptions.FitFailedWarning)
return x_trans
def _test_transform(self, x):
return self._pred.predict(x)
Here is an example of the improvement for the setting described in #MaximHaytovich's answer.
First, some setup:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import ensemble
from sklearn import metrics
y = np.random.randn(100)
x0 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x1 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x = np.zeros((100, 2))
Note that x0 and x1 are just noisy versions of y. We'll use the first 80 rows for train, and the last 20 for test.
These are the two predictors: a higher-variance gradient booster, and a linear predictor:
g = ensemble.GradientBoostingRegressor()
l = linear_model.LinearRegression()
Here is the methodology suggested in the answer:
g.fit(x0[: 80, :], y[: 80])
l.fit(x1[: 80, :], y[: 80])
x[:, 0] = g.predict(x0)
x[:, 1] = l.predict(x1)
>>> metrics.r2_score(
y[80: ],
linear_model.LinearRegression().fit(x[: 80, :], y[: 80]).predict(x[80: , :]))
0.940017788444
Now, using stacking:
x[: 80, 0] = Stacker(g).fit_transform(x0[: 80, :], y[: 80])[:, 0]
x[: 80, 1] = Stacker(l).fit_transform(x1[: 80, :], y[: 80])[:, 0]
u = linear_model.LinearRegression().fit(x[: 80, :], y[: 80])
x[80: , 0] = Stacker(g).fit(x0[: 80, :], y[: 80]).transform(x0[80:, :])
x[80: , 1] = Stacker(l).fit(x1[: 80, :], y[: 80]).transform(x1[80:, :])
>>> metrics.r2_score(
y[80: ],
u.predict(x[80:, :]))
0.992196564279
The stacking prediction does better. It realizes that the gradient booster is not that great.
Ok, after spending some time on googling 'stacking' (as mentioned by #andreas earlier) I found out how I could do the weighting in python even with scikit-learn. Consider the below:
I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.
Then I just train a linear regression on this new dataframe:
#df - dataframe with results of 3 regressors and true output
from sklearn linear_model
stacker= linear_model.LinearRegression()
stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])
So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:
stacker.predict()
on outputs of my 3 regressors. And get a result.
The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.
What you describe is called "stacking" which is not implemented in scikit-learn yet, but I think contributions would be welcome. An ensemble that just averages will be in pretty soon: https://github.com/scikit-learn/scikit-learn/pull/4161
Late response, but I wanted to add one practical point for this sort of stacked regression approach (which I use this frequently in my work).
You may want to choose an algorithm for the stacker which allows positive=True (for example, ElasticNet). I have found that, when you have one relatively stronger model, the unconstrained LinearRegression() model will often fit a larger positive coefficient to the stronger and a negative coefficient to the weaker model.
Unless you actually believe that your weaker model has negative predictive power, this is not a helpful outcome. Very similar to having high multi-colinearity between features of a regular regression model. Causes all sorts of edge effects.
This comment applies most significantly to noisy data situations. If you're aiming to get RSQ of 0.9-0.95-0.99, you'd probably want to throw out the model which was getting a negative weighting.

Resources