Scikit Learn - Train New Model with GridSearchCV - machine-learning

If I get the optimal parameters using GridSearchCV and a pipeline, is there anyway to save the trained model, so in the future I can call the entire pipeline onto new data and generate a prediction for it? For example, I have the following pipeline followed by a gridsearchcv of the parameters:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SVC(probability=True))),
])
parameters = {
'vect__ngram_range': ((1, 1),(1, 2),(1,3)), # unigrams or bigrams
'clf__estimator__kernel': ('rbf','linear'),
'clf__estimator__C': tuple([10**i for i in range(-10,11)]),
}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
#Conduct the grid search
grid_search.fit(X,y)
print("done in %0.3fs" % (time() - t0))
print()
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
#Obtain the top performing parameters
best_parameters = grid_search.best_estimator_.get_params()
#Print the results
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Now I want to save all of these steps into a single flow so that I can apply it to a new, unseen dataset and it will use the same parameters, vectorizers and transformers to transform, implement and report back the results on it?

You can just pickle the GridSearchCV object to save it and then unpickle it when you want to use it to predict new data.
import pickle
# Fit model and pickle fitted model
grid_search.fit(X,y)
with open('/model/path/model_pickle_file', "w") as fp:
pickle.dump(grid_search, fp)
# Load model from file
with open('/model/path/model_pickle_file', "r") as fp:
grid_search_load = pickle.load(fp)
# Predict new data with model loaded from disk
y_new = grid_search_load.best_estimator_.predict(X_new)

Related

How to get a loss from Huggingface's pipeline method in order to finetune a model?

I'm trying to use this model on huggingface for QA. The code for it is in the link:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
'question': 'Why is model conversion important?',
'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)
# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(res)
>>>
{'score': 0.2117144614458084,
'start': 59,
'end': 84,
'answer': 'gives freedom to the user'}
However, I can't figure out how to get a loss so I can finetune this model. I was looking at the huggingface tutorial but didn't see anything other than using the Trainer method or another training method in the link (which is not QA):
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
batch["labels"] = torch.tensor([1, 1])
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()
Say that the true answer is freedom to the user instead of gives freedom to the user
You do not have to get the loss. There is the Trainer class in Hugginface and you can use it to train your model. It is also optimized for hugginface models and contains lots of different kind of deep learning best practices that you might be interested in. See here: https://huggingface.co/docs/transformers/main_classes/trainer

unable to get any entity after training a blank spacy model

I have been working on a legal data to recognize the custom entities within a legal document. I am training empty spacy model on my cleaned training dataset(JSON format as mentioned on spacy documentation). I have tried to remove punctuation, special character, brackets etc from the training dataset. I have labeled the new entity by name 'dictkey' and mentioned the startindex and endindex in it's JSON in order to create the training dataset.
Below is the links for training dataset and the code which i am using. Can you please have a look on the attached training dataset if there is any issue or further cleaning required for this dataset?
https://techmailer.online/TRAIN_DATA3json.txt
https://techmailer.online/train_ner%20-%20Copy.py
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
import re
#import createtrainingdataset_updated_v2 as traindataset
# training data
#TRAIN_DATA = [
# ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
# ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
#]
with open('/Users/modis1/Desktop/24-01/TRAIN_DATA3json.txt', 'r', encoding='utf-8') as openfile:
TRAIN_DATA = openfile.read()
#plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir='/Users/A-GUPTA50/Desktop/ShubhamModi/NewSavedModel/', n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
# test the trained model
for text, _ in TRAIN_DATA:
# text = re.sub(r'\\u\d{4,}','', text.rstrip())
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__":
plac.call(main)

How to save trained ARIMA model to use later

I am using a univariate time series dataset to forecast. I am using ARIMA model to train. But it is a time-consuming process to train every time. Is there any process to save the trained ARIMA model and use it later for prediction?
Here I am giving the github link of the code that I have tried. I am inspired by this post. Below, I am giving the function of the ARIMA model and the necessary line of code.
Actual = [x for x in train_set]
Predictions = list()
def StartARIMAForecasting(Actual, P, D, Q):
print('from function screaming')
model = ARIMA(Actual, order=(P, D, Q))
model_fit = model.fit(disp=0)
# save_model=model_fit.save('model_22.pkl')
prediction = model_fit.forecast()[0]
# save_model=model_fit.save('model_22.pkl')
return prediction,save_model
for timepoint in range(len(test_set)):
print('I am in for loop')
ActualValue = test_set[timepoint]
#forcast value
Prediction,save_model_1 = StartARIMAForecasting(Actual, 2,1,2)
print('Actual=%f, Predicted=%f' % (ActualValue, Prediction))
#add it in the list
Predictions.append(Prediction)
Actual.append(ActualValue)
Here, I want to save the model and then after loading the model I want to do the prediction using my test data.
My assumption is --
after saving the model I will load it and then will follow the following steps
loaded = ARIMAResults.load('model_arima.pkl')
#don't know how and in which line to create this model_arima.pkl
start_index = len(Actual)
end_index = start_index + len(test_set)-1
forecast = loaded.predict(start=start_index, end=end_index)
from sklearn.metrics import mean_squared_error
Error = mean_squared_error(test_set, forecast)
print(Error)

Using a stateful Keras model in pure TensorFlow

I have a stateful RNN model with several GRU layers that was created in Keras.
I have to run this model now from Java, so I dumped the model as protobuf, and I'm loading it from Java TensorFlow.
This model must be stateful because features will be fed one timestep at-a-time.
As far as I understand, in order to achieve statefulness in a TensorFlow model, I must somehow feed in the last state every time I execute the session runner, and also that the run would return the state after the execution.
Is there a way to output the state in the Keras model?
Is there a simpler way altogether to get a stateful Keras model to work as such using TensorFlow?
Many thanks
An alternative solution is to use the model.state_updates property of the keras model, and add it to the session.run call.
Here is a full example that illustrates this solutions with two lstms:
import tensorflow as tf
class SimpleLstmModel(tf.keras.Model):
""" Simple lstm model with two lstm """
def __init__(self, units=10, stateful=True):
super(SimpleLstmModel, self).__init__()
self.lstm_0 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
self.lstm_1 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
def call(self, inputs):
"""
:param inputs: [batch_size, seq_len, 1]
:return: output tensor
"""
x = self.lstm_0(inputs)
x = self.lstm_1(x)
return x
def main():
model = SimpleLstmModel(units=1, stateful=True)
x = tf.placeholder(shape=[1, 1, 1], dtype=tf.float32)
output = model(x)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
res_at_step_1, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_1)
res_at_step_2, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_2)
if __name__ == "__main__":
main()
Which produces the following output:
[[[0.00168626]]]
[[[0.00434444]]]
and shows that the lstm state is preserved between batches.
If we set stateful to False, the output becomes:
[[[0.00033928]]]
[[[0.00033928]]]
Showing that the state is not reused.
ok, so I managed to solve this problem!
What worked for me was creating tf.identity tensors for not only the outputs, as is standard, but also for the state tensors.
In the Keras models, the state tensors can be found by doing:
model.updates
Which gives something like this:
[(<tf.Variable 'gru_1_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_1_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_2_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_2_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_3_1/Variable:0' shape=(1, 4) dtype=float32_ref>,
<tf.Tensor 'gru_3_1/while/Exit_2:0' shape=(1, 4) dtype=float32>)]
The 'Variable' is used for inputting the states, and the 'Exit' for outputs of the new states.
So I created tf.identity out of the 'Exit' tensors. I gave them meaningful names, e.g.:
tf.identity(state_variables[j], name='state'+str(j))
Where state_variables contained only the 'Exit' tensors
Then used the input variables (e.g. gru_1_1/Variable:0) to feed the model state from TensorFlow, and the identity variables I created out of the 'Exit' tensors were used to extract the new states after feeding the model at each timestep

Put customized functions in Sklearn pipeline

In my classification scheme, there are several steps including:
SMOTE (Synthetic Minority Over-sampling Technique)
Fisher criteria for feature selection
Standardization (Z-score normalisation)
SVC (Support Vector Classifier)
The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.
The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])
and breaks the scheme into two parts:
Tune the percentile of features to keep through the first grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for percentile in percentiles:
# Fisher returns the indices of the selected features specified by the parameter 'percentile'
selected_ind = Fisher(X_train, y_train, percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.
After determining the percentile, tune the hyperparameters by second grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for parameters in parameter_comb:
# Select the features based on the tuned percentile
selected_ind = Fisher(X_train, y_train, best_percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.
My questions are:
In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?
If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline
clf_all = Pipeline([('smote', SMOTE()),
('fisher', Fisher(percentile=best_percentile))
('normal',preprocessing.StandardScaler()),
('svc',svm.SVC(class_weight='auto'))])
and simply use GridSearchCV(clf_all, parameter_comb) for tuning?
Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.
It would be so much appreciated for any comment.
SMOTE and Fisher are shown below:
def Fscore(X, y, percentile=None):
X_pos, X_neg = X[y==1], X[y==0]
X_mean = X.mean(axis=0)
X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
F = num/deno
sort_F = argsort(F)[::-1]
n_feature = (float(percentile)/100)*shape(X)[1]
ind_feature = sort_F[:ceil(n_feature)]
return(ind_feature)
SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.
def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos)
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)
scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.
For example, if the input to a pipeline is a series, the transformer would be as follows:
def trans_func(input_series):
return output_series
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)
sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)
where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.
I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.
EDIT:
I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>>
>>> class Fisher(BaseEstimator, TransformerMixin):
... def __init__(self,percentile=0.95):
... self.percentile = percentile
... def fit(self, X, y):
... from numpy import shape, argsort, ceil
... X_pos, X_neg = X[y==1], X[y==0]
... X_mean = X.mean(axis=0)
... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
... F = num/deno
... sort_F = argsort(F)[::-1]
... n_feature = (float(self.percentile)/100)*shape(X)[1]
... self.ind_feature = sort_F[:ceil(n_feature)]
... return self
... def transform(self, x):
... return x[self.ind_feature,:]
...
>>>
>>> data = load_iris()
>>>
>>> pipeline = Pipeline([
... ('fisher', Fisher()),
... ('normal',StandardScaler()),
... ('svm',SVC(class_weight='auto'))
... ])
>>>
>>> grid = {
... 'fisher__percentile':[0.75,0.50],
... 'svm__C':[1,2]
... }
>>>
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.
You actually can put all of these functions into a single pipeline!
In the accepted answer, #David wrote that your functions
transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were.
It is true that sklearn's pipeline does not support this. However imblearn's pipeline here supports this. The imblearn pipeline is just like that of sklearn but it allows you to call transformations separately on the training and testing data via sample methods. Moreover, these sample methods are actually designed so that you can change both the data X and the labels y. This is important because many times you want to include smote in your pipeline but you want to smote just the training data, not the testing data. And with the imblearn pipeline, you can call smote in the pipeline to transform just X_train and y_train and not X_test and y_test.
So you can create an imblearn pipeline that has a smote sampler, pre-processing step, and svc.
For more details check out this stack overflow post here and machine learning mastery article here.

Resources