Save a meta-model for future use - save

I am using openMDAO to construct a co-kriging metamodel that I would like to export and then import in another python code.
I've found a message on the old forum (http://openmdao.org/forum/questions/444/how-can-i-save-the-metamodel-for-later-use?sort=votes) in which someone used pickle to save a meta-model.
I have also read about the recorders however I didn't succeed in the different tests I performed.
Is there a way to save the meta-model and use it in a future code?
EDIT: I think I found a kind of solution using 'pickle'. I succeded to do this with a kriging meta-model but i assume I would work the same with the co-kriging.
Like in the post on the 'old' forum of openMDAO, I saved the trained meta-model in a file and then reuse it in another python script. I joined here the part of the code saving the trained meta-model:
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
prob['mm.train:x1'] = DATA_HF_dim
prob['mm.train:x1_fi2'] = DATA_LF_dim
prob['mm.train:y1'] = rastri_e
prob['mm.train:y1_fi2'] = rastri_c
prob.run()
import pickle
f = open('meta_model_info.p','wb')
pickle.dump(prob,f)
f.close
Once the trained meta-model is saved in the file meta_model_info.p, I can load it in another script, skipping the learning phase. Part of the code of the second script is here:
class Simulation(Group):
def __init__(self, surrogate, nfi):
super(Simulation, self).__init__()
self.surrogate = surrogate
mm = self.add("mm", MultiFiMetaModel(nfi=nfi))
mm.add_param('x1', val=0.)
mm.add_output('y1', val=(0.,0.), surrogate=surrogate)
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
import pickle
f = open('meta_model_info.p','rb')
clf = pickle.load(f)
pred_cok_clf = []
for x in inputs:
clf['mm.x1'] = x
clf.run()
pred_cok_clf.append(clf['mm.y1'])
pred_mu_clf = np.array([float(p[0]) for p in pred_cok_clf])
pred_sigma_clf = np.array([float(p[1]) for p in pred_cok_clf])
However I was forced to redefine the class of the problem and to setup the problem either in this second script.
I don't know if it is a proper use of 'pickle' or if there is another way to do this, if you have any suggestion :)

There is not currently any provision for saving and reloading the surrogate model. You have two options:
1) Save off the training data, then import and re-train the model in your other script. You can call the fit and predict methods of the surrogate model directly for this by importing them from the library.
2) If you want to skip the cost of re-training each time, then you need to modify the surrogate model itself to save off the result of the fitting process, then re-load it into a new instance later: https://github.com/OpenMDAO/OpenMDAO/blob/c69e00f6f9eeb617863e782246e2e7ed1fe9e019/openmdao/surrogate_models/multifi_cokriging.py#L322

Related

Integrate the ImageDataGenerator in own customized fit_generator

I want to fit a Siamese CNN with multiple inputs that are stored in my memory and no label (just an arbitrary dummy label). Therefore, I had to write my own data_generator function for using a CNN model in Keras.
My data generator is of the following form
class DataGenerator(keras.utils.Sequence):
def __init__(self, train_data, train_triplets, batch_size=32, dim=(128,128), n_channels=3, shuffle=True):
self.dim = dim
self.batch_size = batch_size
#Added
self.train_data = train_data
self.train_triplets = train_triplets
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
n_row = self.train_triplets.shape[0]
return int(np.floor(n_row / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
#print(index)
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = self.train_triplets.iloc[indexes,]
# Generate data
[anchor, positive, negative] = self.__data_generation(list_IDs_temp)
y_train = np.random.randint(2, size=(1,2,self.batch_size)).T
return [anchor,positive, negative], y_train
def on_epoch_end(self):
'Updates indexes after each epoch'
n_row = self.train_triplets.shape[0]
self.indexes = np.arange(n_row)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples'
# anchor positive and negatives: (n_samples, *dim, n_channels)
# Initialization
anchor = np.zeros((self.batch_size,*self.dim,self.n_channels))
positive = np.zeros((self.batch_size,*self.dim,self.n_channels))
negative = np.zeros((self.batch_size,*self.dim,self.n_channels))
nrow_temp = list_IDs_temp.shape[0]
# Generate data
for i in range(nrow_temp):
list_ind = list_IDs_temp.iloc[i,]
anchor[i] = self.train_data[list_ind[0]]
positive[i] = self.train_data[list_ind[1]]
negative[i] = self.train_data[list_ind[2]]
return [anchor, positive, negative]
where train_data is a list of all images and train triplets a data frame containing image indices to create my inputs containing of a triplet of images.
Now, I want to do some data augmenting for each mini batch supplied to my CNN. I have tried to integrate the ImageDataGenarator of Keras but I couldn't implement it in my code. Is it somehow possible to do it ? I am not very experienced with python and would appreciate any help.
Does this article answer your question?
To put it in a nutshell, Kera's ImageDataGenerator lacks flexibility when it comes to personalized batch generators, and the easiest way to still use data augmentation is simply to switch to another data augmentation tool (like the albumentations library described in the previous article, but you could also use imgaug as well).
I just want to warn you that I encountered several issues with albumentations (that I described in this question on GitHub, but for now I still have had no answers), so maybe using imgaug is a better idea.
Hope this helps, good luck with your model !

Considering one column more important than others

In case of 3 columns data, (In my test case) I can see that all the columns are valued as equal.
random_forest.feature_importances_
array([0.3131602 , 0.31915436, 0.36768544])
Is there any way to add waitage to one of the columns?
Update:
I guess xgboost can be used in this case.
I tried, but getting this error:
import xgboost as xgb
param = {}
num_round = 2
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(x_test_split)
dtrain_split = xgb.DMatrix(X_train, label=y_train)
dtest_split = xgb.DMatrix(X_test)
gbdt = xgb.train(param, dtrain_split, num_round)
y_predicted = gbdt.predict(dtest_split)
rmse_pred_vs_actual = xgb.rmse(y_predicted, y_test)
AttributeError: module 'xgboost' has no attribute 'rmse'
Error is by assuming xgb has method "rmse":
rmse_pred_vs_actual = xgb.rmse(y_predicted, y_test)
It is literally written: AttributeError: module 'xgboost' has no attribute 'rmse'
Use sklearn.metrics.mean_squared_error
By:
from sklearn.metrics import mean_squared_error
# Your code
rmse_pred_vs_actual = mean_squared_error(y_test, y_predicted)
It'll fix your error but it still doesn't control a feature importance.
Now, if you really want to change the importance of a feature, you need to be creative about how to make a change like this. There is no text book solution that I know of and no method in xgboost that I know of. You can follow the link Stev posted in a comment to your question and maybe get some ideas (including changing your ML algorithm).

Generating words from trained RNN model: "Variable already exists, disallowed. Did you mean to set reuse=True in VarScope? "

So I implemented a RNN word generator model in jupytor notebook.
When I was trying to use the trained model to generate some words:
with open(os.path.join(cfgs['save_dir'], 'config.pkl'), 'rb') as f:
saved_args = cPickle.load(f)
with open(os.path.join(cfgs['save_dir'], 'words_vocab.pkl'), 'rb') as f:
words, vocab = cPickle.load(f)
with tf.Session() as sess:
model = Model(saved_args, True)
tf.global_variables_initializer().run()
saver = tf.train.Saver(tf.global_variables())
ckpt = tf.train.get_checkpoint_state(cfgs['save_dir'])
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
print(model.sample(sess, words, vocab, cfgs['n'], cfgs['prime'], cfgs['sample'], cfgs['pick'], cfgs['width']))
It works for the first time, but if I run the code again there is an error:
ValueError: Variable rnnlm/softmax_w already exists, disallowed. Did you mean to set reuse=True in VarScope?
Right now I have to shut down the ipynb file then run the code to get a new sample.
How to change the code to avoid this situation?
You can call the model.sample function multiple times without a problem but everything else (creating the session, constructing the Model, loading the checkpoint) should only be run once. If you refactor your code then you won't see that error message anymore.

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Convert Text corpus into sequences using Tokenizer object/class
Build a model using the model.fit() method
Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)
The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].
I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True
I found the following snippet provided at following link by #thusv89.
Save objects:
import pickle
with open('data_objects.pickle', 'wb') as handle:
pickle.dump(
{'input_tensor': input_tensor,
'target_tensor': target_tensor,
'inp_lang': inp_lang,
'targ_lang': targ_lang,
}, handle, protocol=pickle.HIGHEST_PROTOCOL)
Load objects:
with open("dataset_fr_en.pickle", 'rb') as f:
data = pickle.load(f)
input_tensor = data['input_tensor']
target_tensor = data['target_tensor']
inp_lang = data['inp_lang']
targ_lang = data['targ_lang']
Quite easy, because Tokenizer class has provided two funtions for save and load:
save —— Tokenizer.to_json()
load —— keras.preprocessing.text.tokenizer_from_json
In to_json() method,it call "get_config" method which handle this:
json_word_counts = json.dumps(self.word_counts)
json_word_docs = json.dumps(self.word_docs)
json_index_docs = json.dumps(self.index_docs)
json_word_index = json.dumps(self.word_index)
json_index_word = json.dumps(self.index_word)
return {
'num_words': self.num_words,
'filters': self.filters,
'lower': self.lower,
'split': self.split,
'char_level': self.char_level,
'oov_token': self.oov_token,
'document_count': self.document_count,
'word_counts': json_word_counts,
'word_docs': json_word_docs,
'index_docs': json_index_docs,
'index_word': json_index_word,
'word_index': json_word_index
}

Why the following partial fit is not working property?

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
Hello I have the following list of comments:
comments = ['I am very agry','this is not interesting','I am very happy']
These are the corresponding labels:
sents = ['angry','indiferent','happy']
I am using tfidf to vectorize these comments as follows:
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing
I am using label encoder to vectorize the labels:
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
Here I am using passive aggressive to fit the model:
clf2 = PassiveAggressiveClassifier()
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
Here I am trying to test the usage of partial fit as follows with three new comments and their corresponding labels:
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)
The problem is that I am not getting the right results after the partial fit as follows:
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
however I am getting this output:
[2 2 2]
So I really appreciate support to find, why the model is not being updated if I am testing it with the same examples that it has used to be trained the desired output should be:
[1,0,2]
I would like to appreciate support to ajust maybe the hyperparameters to see the desired output.
this is the complete code, to show the partial fit:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
clf2 = PassiveAggressiveClassifier()
clf2.fit(tfidf, labels)
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
clf2.partial_fit(vec_new_comments, new_labels)
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
However I got:
AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]
Well there are multiple problems with your code. I will start by stating the obvious ones to more complex ones:
You are pickling the clf2 before it has learnt anything. (ie. you pickle it as soon as it is defined, it doesnt serve any purpose). If you are only testing, then fine. Otherwise they should be pickled after the fit() or equivalent calls.
You are calling clf2.fit() before the clf2.partial_fit(). This defeats the whole purpose of partial_fit(). When you call fit(), you essentially fix the classes (labels) that the model will learn about. In your case it is acceptable, because on your subsequent call to partial_fit() you are giving the same labels. But still it is not a good practice.
See this for more details
In a partial_fit() scenario, dont call the fit() ever. Always call the partial_fit() with your starting data and new coming data. But make sure that you supply all the labels you want the model to learn in the first call to parital_fit() in a parameter classes.
Now the last part, about your tfidf_vectorizer. You call fit_transform()(which is essentially fit() and then transformed() combined) on tfidf_vectorizer with comments array. That means that it on subsequent calls to transform() (as you did in transform(new_comments)), it will not learn new words from new_comments, but only use the words which it saw during the call to fit()(words present in comments).
Same goes for LabelEncoder and sents.
This again is not prefereble in a online learning scenario. You should fit all the available data at once. But since you are trying to use the partial_fit(), we assume that you have very large dataset which may not fit in memory at once. So you would want to apply some sort of partial_fit to TfidfVectorizer as well. But TfidfVectorizer doesnt support partial_fit(). In fact its not made for large data. So you need to change your approach. See the following questions for more details:-
Updating the feature names into scikit TFIdfVectorizer
How can i reduce memory usage of Scikit-Learn Vectorizers?
All things aside, if you change just the tfidf part of fitting the whole data (comments and new_comments at once), you will get your desired results.
See the below code changes (I may have organized it a bit and renamed vec_new_comments to new_tfidf, please go through it with attention):
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()
# The below lines are important
# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)
# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)
Below is the Not so preferred code (which you are using, and about which I talked in point 2), but results are good as long as you make the above changes.
tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)
clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]
new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2] As you wanted
Correct approach, or the way partial_fit() is intended to be used:
# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only
all_classes = le.transform(le.classes_)
# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]
# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

Resources