Getting explainable predictions from Vertex AI for custom trained model - docker

I've created a custom docker container to deploy my model on Vertex AI (the model uses LightGBM, so I can't use the pre-built containers provided by Vertex AI for TF/SKL/XGBoost).
I'm getting errors while trying to get explainable predictions from the model (I deployed the model and normal predictions are working just fine). I have gone through the official Vertex AI guide(s) for getting predictions/explanations, and also tried different ways of configuring the explanation parameters and metadata, but it's still not working. The errors are not very informative, and this is what they look like:
400 {"error": "Unable to explain the requested instance(s) because: Invalid response from prediction server - the response field predictions is missing. Response: {'error': '400 Bad Request: The browser (or proxy) sent a request that this server could not understand.'}"}
This notebook from Vertex provided by Google has some examples on how to configure the explanation parameters and metadata for models trained with different frameworks. I'm trying something similar.
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage4/get_started_with_vertex_xai.ipynb
The model is a classifier that takes tabular input with 5 features (2 string, 3 numeric), and output value from model.predict() is 0/1 for each input instance. My custom container returns predictions in this format:
# Input for prediction
raw_input = request.get_json()
input = raw_input['instances']
df = pd.DataFrame(input, columns = ['A', 'B', 'C', 'D', 'E'])
# Prediction from model
predictions = model.predict(df).tolist()
response = jsonify({"predictions": predictions})
return response
This is how I am configuring the explanation parameters and metadata for the model:
# Explanation parameters
PARAMETERS = {"sampled_shapley_attribution": {"path_count": 10}}
exp_parameters = aip.explain.ExplanationParameters(PARAMETERS)
# Explanation metadata (this is probably the part that is causing the errors)
COLUMNS = ["A", "B", "C", "D", "E"]
exp_metadata = aip.explain.ExplanationMetadata(
inputs={
"features": {"index_feature_mapping": COLUMNS, "encoding": "BAG_OF_FEATURES"}
},
outputs={"predictions": {}},
)
For getting predictions/explanations, I tried using the below format, among others:
instance_1 = {"A": <value>, "B": <>, "C": <>, "D": <>, "E": <>}
instance_2 = {"A": <value>, "B": <>, "C": <>, "D": <>, "E": <>}
inputs = [instance_1, instance_2]
predictions = endpoint.predict(instances=inputs) # Works fine
explanations = endpoint.explain(instances=inputs) # Returns error
Could you please suggest me how to correctly configure the explanation metadata, or provide input in the right format to the explain API, to get explanations from Vertex AI? I have tried many different formats, but nothing is working so far. :(

Related

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs).
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector)
Is model.docvecs designed to provide all training docvecs?
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.
Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

Why does sklearn Imputer need to fit?

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:
imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)
imputer = Imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.
Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean?
I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.
Thank you.
The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data.
To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.
from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')
obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
# array([ 1.5, 2.5, 3.5])
X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
# array([[ 4. , 2.5, 6. ],
# [ 5. , 6. , 3.5]])
You can do both steps in one if your train and test data are identical, using fit_transform.
X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
# array([[ 1. , 2. , 4. ],
# [ 2. , 3. , 4. ]])
This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.
See the doc for more information about cross-validation.

Placeholder tensors require a value in ml-engine predict but not local predict

I've been developing a model for use with the cloud ML engine's online prediction service. My model contains a placeholder_with_default tensor that I use to hold a threshold for prediction significance.
threshold = tf.placeholder_with_default(0.01, shape=(), name="threshold")
I've noticed that when using local predict:
gcloud ml-engine local predict --json-instances=data.json --model-dir=/my/model/dir
I don't need to supply values for this tensor. e.g. this is a valid input:
{"features": ["a", "b"], "values": [10, 5]}
However when using online predict:
gcloud ml-engine predict --model my_model --version v1 --json-instances data.json
If I use the above JSON I get an error:
{
"error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"input size does not match signature\")"
}
However if I include threshold, then I don't. e.g:
{"features": ["a", "b"], "values": [10, 5], "threshold": 0.01}
Is there a way to have "threshold" be an optional input?
Thanks
Matthew
Looks like currently it's not possible in CloudML. If you're getting predictions from a JSON file, you need to add the default values explicitly (like you did with "threshold": 0.01).
In Python I'm just dynamically adding the required attributes before doing the API request:
def add_empty_fields(instance):
placeholder_defaults = {"str_placeholder": "", "float_placeholder": -1.0}
for ph, default_val in placeholder_defaults.items():
if ph not in instance:
instance[ph] = default_val
which would mutate the instance dict that maps placeholder names to placeholder values. For a model with many optional placeholders, this is a bit nicer than manually setting missing placeholder values for each instance.

PYTHON: Memory Error - MultinomialNB.partial_fit() - 17k classes

Hi i am new to Python SKLearn and ML in general. Im encountering a Memory Error when using MultinomialNB partial fit, Im trying to do Multi Label Classification on the DMOZ directory data.
My questions:
What am i doing wrong? Is it my lack of memory or is the data wrong?
Am i using the right approach ?
Anything i can do to improve my appraoch ?
Approach:
Store DMOZ DB directories into MongoDB/TokuMX
{
"_id": {
"$oid": "54e758c91d41c804d8ace196"
},
"docs": [
{
"url": "http://www.awn.com/",
"description": "Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.",
"title": "Animation World Network"
}
],
"labels": [
"Top",
"Arts",
"Animation"
]
}
Itterate over the docs array and pass docs elements into my classifier function.
Vectorizer and Classifier
classifier = MultinomialNB()
vectorizer = HashingVectorizer(
stop_words='english',
strip_accents='unicode',
norm='l2'
)
My classifier function
def classify(doc, labels, classifier, vectorizer, *args):
r = requests.get(doc['url'], verify=False)
print "Retrieving URL = {0}\n".format(doc['url'])
if r.status_code == 200:
html = lxml.html.fromstring(r.text)
doc['content'] = []
tags = ['font', 'td', 'h1', 'h2', 'h3', 'p', 'title']
for tag in tags:
for x in html.xpath('//'+tag):
try:
bag_of_words = nltk.word_tokenize(x.text_content())
pos_tagged = nltk.pos_tag(bag_of_words)
for word, pos in pos_tagged:
if pos[:2] == 'NN':
doc['content'].append(word)
except AttributeError as e:
print e
x_train = vectorizer.fit_transform(doc['content'])
#if we are the first one to run partial_fit, pass all classes
if len(args) == 1:
classifier.partial_fit(x_train, labels, classes=args[0])
else:
classifier.partial_fit(x_train, labels)
return doc
X: doc['content'] consists of a array with NOUNS. (600)
Y: labels consists of a array with labels inside the mongo document showed above. (3)
Classes args[0] consists of array with all the (UNIQUE)labels in the database. ( 17490)
Running inside VirtualBox on a Quadcore laptop with 4gb ram assigned to VM.
What are the 17490 unique labels? There will be one coefficient for each label and each feature, which is likely where your memory error comes from.

Resources