Saving an sklearn.svm.SVR model as JSON instead of pickling - machine-learning

I have a trained SVR model which needs to be saved in a JSON format instead of pickling.
The idea behind JSONifying the trained model is to simply capture the state of the weights and other 'fitted' attributes. Then, I can set these attributes later to make predictions. Here is an implementation of it I did:
# assume SVR has been trained
regressor = SVR()
regressor.fit(x_train, y_train)
# saving the regressor params in a JSON file for later retrieval
with open(f'saved_regressor_params.json', 'w', encoding='utf-8') as outfile:
json.dump(regressor.get_params(), outfile)
# finding the fitted attributes of SVR()
# if an attribute is trailed by '_', it's a fitted attribute
attrs = [i for i in dir(regressor) if i.endswith('_') and not i.endswith('__')]
remove_list = ['coef_', '_repr_html_', '_repr_mimebundle_'] # unnecessary attributes
for attr in remove_list:
if attr in attrs:
attrs.remove(attr)
# deserialize NumPy arrays and save trained attribute values into JSON file
attr_dict = {i: getattr(regressor, i) for i in attrs}
for k in attr_dict:
if isinstance(attr_dict[k], np.ndarray):
attr_dict[k] = attr_dict[k].tolist()
# dump JSON for prediction
with open(f'saved_regressor_{index}.json', 'w', encoding='utf-8') as outfile:
json.dump(attr_dict,
outfile,
separators=(',', ':'),
sort_keys=True,
indent=4)
This would create two separate json files. One file called saved_regressor_params.json which saves certain required parameters for SVR and another is called saved_regressor.json which stores attributes and their trained values as objects. Example (saved_regressor.json):
{
"_dual_coef_":[
[
-1.0,
-1.0,
-1.0,
]
],
"_intercept_":[
1.323423423
],
...
...
"_n_support_":[
3
]
}
Later, I can create a new SVR() model and simply set these parameters and attributes into it by calling them from the existing JSON files we just created. Then, call in the predict() method to predict. Like so (in a new file):
predict_svr = SVR()
#load the json from the files
obj_text = codecs.open('saved_regressor_params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)
obj_text = codecs.open('saved_regressor.json', 'r', encoding='utf-8').read()
attributes = json.loads(obj_text)
#setting params
predict_svr.set_params(**params)
# setting attributes
for k in attributes:
if isinstance(attributes[k], list):
setattr(predict_svr, k, np.array(attributes[k]))
else:
setattr(predict_svr, k, attributes[k])
predict_svr.predict(...)
However, during this process, a particular attribute called: n_support_ cannot be set due to some reason. And even if I ignore n_support_ attribute, it creates additional errors. (Is my logic wrong or am I missing something here?)
Therefore, I am looking for different ways or ingenious methods to save an SVR model into JSON.
I have tried the existing third party helper libraries like: sklearn_json. These libraries tend to export perfectly for linear models but not for support vectors.

Making a reproducible example missing in the OP, based on the docs (version 1.1.2)
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
regressor = SVR(C=1.0, epsilon=0.2)
regressor.fit(X, y)
Then a sketch of the a JSON serialization/deserialization
import json
# serialize
serialized = json.dumps({
k: v.tolist() if isinstance(v, np.ndarray) else v
for k, v in regressor.__dict__.items()
})
# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
k: np.asarray(v) if isinstance(v, list) else v
for k, v in json.loads(serialized).items()
}
# test
assert np.all(regressor.predict(X) == regressor2.predict(X))
EDIT: Serialization preserving data type
A not so elegant solution to address the first issue mentioned in a comment is to save the data type together with the data.
import json
# serialize
serialized = json.dumps({
k: [v.tolist(), 'np.ndarray', str(v.dtype)] if isinstance(v, np.ndarray) else v
for k, v in regressor.__dict__.items()
})
# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
k: np.asarray(v[0], dtype=v[2]) if isinstance(v, list) and v[1] == 'np.ndarray' else v
for k, v in json.loads(serialized).items()
}
# test
assert np.all(regressor.predict(X) == regressor2.predict(X))

Related

How to handle alphanumeric values in machine learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?
You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

Pytorch: Add information to images in image prediction

I would like to add information to my current dataset. At the moment, I have six-frame sequences in folders. The DataLoader reads all 6 and uses the first 3 for predicting the last 1/2/3 (depending on how many I tell him to). This is the function for the DataLoader.
class TrainFeeder(Dataset):
def init(self, data_set):
super(TrainFeeder, self).init()
self.input_data = data_set
#print(torch.cuda.current_device())
if torch.cuda.current_device() ==0:
print('There are total %d sequences in trainset' % len(self.input_data))
def getitem(self, index):
path = self.input_data[index]
imgs_path = sorted(glob.glob(path + '/*.png'))
imgs = []
for img_path in imgs_path:
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (256,448))
img = cv2.resize(img, (0, 0), fx=0.5, fy=0.5, interpolation=cv2.INTER_CUBIC) #has been 0.5 for official data, new is fx = 2.63 and fy = 2.84
img_tensor = ToTensor()(img).float()
imgs.append(img_tensor)
imgs = torch.stack(imgs, dim=0)
return imgs
def len(self):
return len(self.input_data)
Now I'd like to add one value to these images. It is a boolean, I have stored in a list in a .json in the same folder, like the six-frame-sequences. But I don't know how to add the values of the list in the .json to the tensor. Which dimension should I use? Will the system work at all, if I change the shape of the input?
The function getitem can return anything, so you can return a tuple instead of just images :
def __getitem__(self, index):
path = ...
# load your 6 images
imgs = torch.stack( ... )
# load your boolean metadata
metadata = load_json_data( ... )
# return them both
return (imgs, metadata)
You will need to make metadata a tensor before returning it, otherwise I expect that pytorch will complain about not being able to collate (i.e stack) them to make batches
"Will the system work" is a question only you can answer, since you did not provide the code of your ML model. I would bet on : "no but it won't require significant changes to work". Most likely you currently have a loop like
for imgs in dataloader:
# do some training
output = model(imgs)
...
And you will have to make it like
for imgs, metadata in dataloader:
# do some training
output = model(imgs)
...

Getting the column names chosen after a feature selection method

Given a simple feature selection code below, I want to know the selected columns after the feature selection (The dataset includes a header V1 ... V20)
import pandas as pd
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
def feature_selection(data):
y = data['Class']
X = data.drop(['Class'], axis=1)
fs = SelectKBest(score_func=f_regression, k=10)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# TODO: determine the columns being selected
return X_selected
data = pd.read_csv("../dataset.csv")
new_data = feature_selection(data)
I appreciate any help.
I have used the iris dataset for my example but you can probably easily modify your code to match your use case.
The SelectKBest method has the scores_ attribute I used to sort the features.
Feel free to ask for any clarifications.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
from sklearn.datasets import load_iris
def feature_selection(data):
y = data[1]
X = data[0]
column_names = ["A", "B", "C", "D"] # Here you should use your dataframe's column names
k = 2
fs = SelectKBest(score_func=f_regression, k=k)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# Find top features
# I create a list like [[ColumnName1, Score1] , [ColumnName2, Score2], ...]
# Then I sort in descending order on the score
top_features = sorted(zip(column_names, fs.scores_), key=lambda x: x[1], reverse=True)
print(top_features[:k])
return X_selected
data = load_iris(return_X_y=True)
new_data = feature_selection(data)
I don't know the in-build method, but it can be easily coded.
n_columns_selected = X_new.shape[0]
new_columns = list(sorted(zip(fs.scores_, X.columns))[-n_columns_selected:])
# new_columns order is perturbed, we need to restore it. We use the names of the columns of X as a reference
new_columns = list(sorted(cols_new, key=lambda x: list(X.columns).index(x)))

How to prevent Keras predict_generator from shuffling data?

I created a deep learning model, and I want to check the performance of the model by using predict_generator. I am using the following code which compares the images' labels with the predicted classes and then returns the prediction error.
validation_generator = validation_datagen.flow_from_directory(
validation_dir,
target_size=(image_size, image_size),
batch_size=val_batchsize,
class_mode='categorical',
shuffle=False)
# Get the filenames from the generator
fnames = validation_generator.filenames
# Get the ground truth from generator
ground_truth = validation_generator.classes
# Get the label to class mapping from the generator
label2index = validation_generator.class_indices
# Getting the mapping from class index to class label
idx2label = dict((v,k) for k,v in label2index.items())
# Get the predictions from the model using the generator
predictions = model.predict_generator(validation_generator, steps=validation_generator.samples/validation_generator.batch_size,verbose=1)
predicted_classes = np.argmax(predictions,axis=1)
errors = np.where(predicted_classes != ground_truth)[0]
print("No of errors = {}/{}".format(len(errors),validation_generator.samples))
# Show the errors
for i in range(len(errors)):
pred_class = np.argmax(predictions[errors[i]])
pred_label = idx2label[pred_class]
title = 'Original label:{}, Prediction :{}, confidence : {:.3f}'.format(
fnames[errors[i]].split('/')[0],
pred_label,
predictions[errors[i]][pred_class])
original = load_img('{}/{}'.format(validation_dir,fnames[errors[i]]))
plt.figure(figsize=[7,7])
plt.axis('off')
plt.title(title)
plt.imshow(original)
plt.show()
validation_generator.classes is arranged but predicted_classes is not arranged.
I take the code from here https://www.learnopencv.com/keras-tutorial-fine-tuning-using-pre-trained-models/
How can I prevent predict_generator from shuffling data?

Unable to get pipeline.fit() to work using Sklearn and Keras Wrappers

I am getting a value error for parameters (not enough to unpack expected 2 got 1) I have a network I want to train:
def build(self):
numpy.random.seed(self.seed)
self.estimators.append(('standardize', StandardScaler))
self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
self.pipeline = Pipeline(self.estimators)
Now if I want to fit the data to some values: say self.X, self.Y
self.model = self.pipeline.fit(self.X, self.Y, verbose=1)
I get
Traceback (most recent call last):
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 257, in
<module>
model.run()
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 138, in run
self.model = self.pipeline.fit(self.X, self.Y, verbose=1)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site-
packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site-
packages\sklearn\pipeline.py", line 197, in _fit
step, param = pname.split('__', 1)
ValueError: not enough values to unpack (expected 2, got 1)
Am I doing something wrong here? I was under the impression I could just run a fit and it would return a history object, which I could save and load at any time
I even tried...
self.pipeline.fit(self.X, self.Y)
Which throws...
AttributeError: 'numpy.ndarray' object has no attribute 'fit'
I have no idea what is going on here.
Full Code
class Cerebro:
def __init__(self):
self.model = None
self.build_fn = None
self.data = None
self.X = None
self.Y = None
#these three are for encoding string values to integer_encodings / one hot encodings
self.encoder = LabelEncoder()
self.encodings = {}
self.one_hot_encodings = {}
self.seed = numpy.random.seed(7) #this is to ensure we have reproducible results.
self.estimators = []
self.pipeline = None
self.kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=self.seed)
self.cross_validation_score = 0.0
def preprocess(self):
"""
This method will preprocess the dataset we want to train our network on.
Example:
import preproccessing
...
dataset, X, Y = preprocessing.main()
"""
self.data = pandas.read_csv('src_examples/hwtxn_final_for_influx.txt', sep='\t').values
self.X = numpy.delete(self.data, 13, axis=1)
self.Y = self.data[:, 13].astype(numpy.float16)
def build(self):
self.build_fn = self.base_model()
self.preprocess()
numpy.random.seed(self.seed)
self.estimators.append(('standardize', StandardScaler()))
self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
self.pipeline = Pipeline(self.estimators)
def run(self):
"""This will actually take the pipeline (preprocessing standardization, model)
and fit it to our dataset (X, Y) (We don't need test/train since we are using stratified k fold cross val.)
Args:
None
Returns:
None
"""
# this is the 'model'
# self.pipeline
print(type(self.pipeline))
print(self.X.shape)
self.model = self.pipeline.fit(self.X, self.Y)
def load(self, fn):
"""This will load a saved model (history object)
Args:
fn (filename): represents saved model file
Returns:
model (pkl object): represents model
"""
return pickle.load(open(fn, 'rb'))
def save(self, fn):
"""This will save a model (history object)
Args:
fn (filename): represents a filename to save the model as
Returns:
None
"""
pickle.dump(self.model, open(fn, 'wb'))
def encode(self, vals, key):
""" This method will encode a list of values and take a key (representing column name, or index) to save
in the class object (self.encodings)
This will help us keep track of encodings we have for values we need to translate/decipher.
Args:
vals(np.array): array of values to encode
key(str): str representing the key used to encode this particular set of values
Returns:
transformed values (np.array) representing the encoded versions of values
"""
# int encoding for non int values
self.encodings[key] = self.encoder.fit_transform(vals)
return self.encoder.fit_transform(vals)
def decoder(self, vals, key):
"""This method will decode the integer_encodings for class variables. It will take vals which
represents a list of values to decode (i.e. [1,2,3] -- [apple, pear, orange])
It will also take a key (since every decoding has a corresponding encoding) to find which encoding
scheme to map to
Args:
vals(np.array) : array of values to decode
key(str) : string representing the key used for encoding the values (for decoding it)
Returns:
inverse transform of encoded values (np.array)
"""
# translate int encodings to original values (encoder._classes)
return self.encodings[key].inverse_transform(vals)
def cross_validate(self):
"""
This will perform a cross validation score using a stratified kfold method. (Think traditional Kfold but
with the values evenly distributed for each subsample)
Args:
None
Returns:
None
"""
self.cross_validation_score = cross_val_score(self.pipeline, self.X, self.Y, cv=self.kfold)
return self.cross_validation_score
#staticmethod
def base_model():
"""
This will return a base model for us to try. The good thing about this implementation is that
when we decide we want something more complex then all we have to do is define a class function and replace
the values in the build f(x)
Args:
None
Returns:
model (keras.models.Sequential): Keras based DNN Model
"""
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
#staticmethod
def one_hot_encoder(int_encoding):
"""
This will take an integer encoding of string variables (traditional preprocessing step, will probably
move this to the preprocessing package.
Essential it returns a binary 'one hot' encoding of the values we wish to encode
Example
#Dataset Values
[apple, orange, pear]
#Integer Encoding
[1, 2, 3]
#One Hot Encoding
[[1, 0, 0]
[0, 1, 0]
[0, 0, 1]]
Args:
None
Returns:
Matrix (np.array): matrix representing one hot vectors for a class of values
"""
# we might not need this... so for now we will keep it static
return OneHotEncoder(sparse=False).fit_transform(int_encoding.reshape(len(int_encoding), 1))
if __name__ == '__main__':
# Step 1 is to initialize class (with seed == 7)
model = Cerebro()
model.build()
model.cross_validate()
print("Here are our estimators:\n {}".format(model.estimators))
print("Here is our pipeline:\n {}".format(model.pipeline))
model.run()
EDIT
The answer is that .fit() build_fn argument requires a function pointer and not the model itself.
IMHO I feel an error should be thrown for specifically that case.
This is due to the following line:
self.build_fn = self.base_model()
This should actually be:
self.build_fn = self.base_model
KerasClassifier requires a pointer to the function which creates the model, but by appending () at the end, you are assigning build_fn with the actual model, which is wrong.
Now in addition to above error, I would recommend checking the following lines in your code, which if not corrected will give error in future when you will use the code.
1) self.encodings[key] = self.encoder.fit_transform(vals)
Here you are assigning the transformed data to the encodings[key] not the model. So when you do this:-
self.encodings[key].inverse_transform(vals)
It makes no sense to call inverse_transform() on the transformed data.
inverse_transform() is a method of scikit-learn transformers. But self.encodings[key] will give out a ndarray, because you have saved the output array from fit_transform().
2) Something similar to 2 is also happening with one_hot_encoder()
The error "AttributeError: 'numpy.ndarray' object has no attribute 'fit'" seems related to 1 and 2.

Resources