How to save/serializing a glm model as zip/pickle file? - machine-learning

I built a tweedie glm model using statsmodels.
Just wondering how to save/serializing it as zip file or pkl file?
I tried
from statsmodels.formula.api import glm
formula4 = "y ~ x1 + C(x2)"
mod4 = glm(formula=formula4, var_weights = 'one', data=train, family=sm.families.Tweedie())
res4 = mod4.fit()
import pickle
filename = 'test.pkl'
#Use pickle to save your object to a file:
pickle.dump(mod4, open(filename, 'wb'))
But the saved pickle file is too large.
Any idea?
--
Answer:
not to use formula directly in the model building process. use dmatrices to process the data ahead. then save the model, the result is around 10kb.

Related

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

Using torchtext for inference

I wonder what is the right way to use torchtext for inference.
Let's assume I've trained the model and dump all Fields with built vocabularies. It seems the next step is to use torchtext.data.Example to load one single example. Somehow I should numeralize it by using loaded Fields and create an Iterator.
I would appreciate any simple examples of using torchtext for inference.
For a trained model and vocabulary (which is part of the text field , you don't have to save the whole class) :
def read_vocab(path):
#read vocabulary pkl
import pickle
pkl_file = open(path, 'rb')
vocab = pickle.load(pkl_file)
pkl_file.close()
return vocab
def load_model_and_vocab():
import torch
import os.path
my_path = os.path.abspath(os.path.dirname(__file__))
vocab_path = os.path.join(my_path, vocab_file)
weights_path = os.path.join(my_path, WEIGHTS)
vocab = read_vocab(vocab_path)
model = classifier(vocab_size=len(vocab))
model.load_state_dict(torch.load(weights_path))
model.eval()
return model, vocab
def predict(model, vocab, sentence):
tokenized = [w.text.lower() for w in nlp(sentence)] # tokenize the sentence
indexed = [vocab.stoi[t] for t in tokenized] # convert to integer sequence
length = [len(indexed)] # compute no. of words
tensor = torch.LongTensor(indexed).to('cpu') # convert to tensor
tensor = tensor.unsqueeze(1).T # reshape in form of batch,no. of words
length_tensor = torch.LongTensor(length) # convert to tensor
prediction = model(tensor, length_tensor) # prediction
return round(1-prediction.item())
"classifier" is the class I defined for my model.
For saving the vocabulary pkl :
def save_vocab(vocab):
import pickle
output = open('vocab.pkl', 'wb')
pickle.dump(vocab, output)
output.close()
And for saving the model after training you can use :
torch.save(model.state_dict(), 'saved_weights.pt')
Tell me if it worked for you!

how to split datasets into training and test data with sklearn

I'm using at&t faces dataset, a main directory of contains 40 sub-directories, each sub-directory contains different images of a specific person. I created a list that contains the sub-directories names. I want to use the data to train a neural network so I want to split the data into 80% training and 20% testing. Here is what I have done so far :
import os
import cv2
path = r"C:\Users\Desktop\att_faces"
directory = []
directory = [x[1] for x in os.walk(path)]
non_empty_dirs = [x for x in directory if x]
directory = [item for subitem in non_empty_dirs for item in subitem]
How should I proceed after this step?
You want to split your data in to train and test sets. For that, you can either
Manually or using a script separate train and test to folders and load them to train with the help of a Data Generator.
Load whole data and split them to train and test in memory.
Let's discuss the second option.
main directory contains 40 sub-directories
Let assume your main directory is Train// and there are 40 subfolders namely 1-40. Also, I assume class label is the folder name.
# imports
import cv2
import numpy as np
import os
from sklearn.model_selection import train_test_split
# seed for reproducibility
SEED = 44000
# lists to store data
data = []
label = []
# folder where data is placed
BASE_FOLDER = 'Train//'
folders = os.listdir(BASE_FOLDER)
# loading data to lists
for folder in folders:
for file in os.listdir(BASE_FOLDER + folder + '//'):
img = cv2.imread(BASE_FOLDER + folder + '//' + file)
# do any pre-processing if needed like resize, sharpen etc.
data = data.append(img)
label = label.append(folder)
# now split the data in to train and test with the help of train_test_split
train_data, test_data, train_label, test_label = train_test_split(data, label, test_size=0.2, random_state=SEED)

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Convert Text corpus into sequences using Tokenizer object/class
Build a model using the model.fit() method
Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)
The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].
I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True
I found the following snippet provided at following link by #thusv89.
Save objects:
import pickle
with open('data_objects.pickle', 'wb') as handle:
pickle.dump(
{'input_tensor': input_tensor,
'target_tensor': target_tensor,
'inp_lang': inp_lang,
'targ_lang': targ_lang,
}, handle, protocol=pickle.HIGHEST_PROTOCOL)
Load objects:
with open("dataset_fr_en.pickle", 'rb') as f:
data = pickle.load(f)
input_tensor = data['input_tensor']
target_tensor = data['target_tensor']
inp_lang = data['inp_lang']
targ_lang = data['targ_lang']
Quite easy, because Tokenizer class has provided two funtions for save and load:
save —— Tokenizer.to_json()
load —— keras.preprocessing.text.tokenizer_from_json
In to_json() method,it call "get_config" method which handle this:
json_word_counts = json.dumps(self.word_counts)
json_word_docs = json.dumps(self.word_docs)
json_index_docs = json.dumps(self.index_docs)
json_word_index = json.dumps(self.word_index)
json_index_word = json.dumps(self.index_word)
return {
'num_words': self.num_words,
'filters': self.filters,
'lower': self.lower,
'split': self.split,
'char_level': self.char_level,
'oov_token': self.oov_token,
'document_count': self.document_count,
'word_counts': json_word_counts,
'word_docs': json_word_docs,
'index_docs': json_index_docs,
'index_word': json_index_word,
'word_index': json_word_index
}

Save a meta-model for future use

I am using openMDAO to construct a co-kriging metamodel that I would like to export and then import in another python code.
I've found a message on the old forum (http://openmdao.org/forum/questions/444/how-can-i-save-the-metamodel-for-later-use?sort=votes) in which someone used pickle to save a meta-model.
I have also read about the recorders however I didn't succeed in the different tests I performed.
Is there a way to save the meta-model and use it in a future code?
EDIT: I think I found a kind of solution using 'pickle'. I succeded to do this with a kriging meta-model but i assume I would work the same with the co-kriging.
Like in the post on the 'old' forum of openMDAO, I saved the trained meta-model in a file and then reuse it in another python script. I joined here the part of the code saving the trained meta-model:
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
prob['mm.train:x1'] = DATA_HF_dim
prob['mm.train:x1_fi2'] = DATA_LF_dim
prob['mm.train:y1'] = rastri_e
prob['mm.train:y1_fi2'] = rastri_c
prob.run()
import pickle
f = open('meta_model_info.p','wb')
pickle.dump(prob,f)
f.close
Once the trained meta-model is saved in the file meta_model_info.p, I can load it in another script, skipping the learning phase. Part of the code of the second script is here:
class Simulation(Group):
def __init__(self, surrogate, nfi):
super(Simulation, self).__init__()
self.surrogate = surrogate
mm = self.add("mm", MultiFiMetaModel(nfi=nfi))
mm.add_param('x1', val=0.)
mm.add_output('y1', val=(0.,0.), surrogate=surrogate)
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
import pickle
f = open('meta_model_info.p','rb')
clf = pickle.load(f)
pred_cok_clf = []
for x in inputs:
clf['mm.x1'] = x
clf.run()
pred_cok_clf.append(clf['mm.y1'])
pred_mu_clf = np.array([float(p[0]) for p in pred_cok_clf])
pred_sigma_clf = np.array([float(p[1]) for p in pred_cok_clf])
However I was forced to redefine the class of the problem and to setup the problem either in this second script.
I don't know if it is a proper use of 'pickle' or if there is another way to do this, if you have any suggestion :)
There is not currently any provision for saving and reloading the surrogate model. You have two options:
1) Save off the training data, then import and re-train the model in your other script. You can call the fit and predict methods of the surrogate model directly for this by importing them from the library.
2) If you want to skip the cost of re-training each time, then you need to modify the surrogate model itself to save off the result of the fitting process, then re-load it into a new instance later: https://github.com/OpenMDAO/OpenMDAO/blob/c69e00f6f9eeb617863e782246e2e7ed1fe9e019/openmdao/surrogate_models/multifi_cokriging.py#L322

Resources