How to access cluster labels from a fit method in AWS Sagemaker - machine-learning

Background information:
AWS Sagemaker offers the possibility to use external Sklearn clustering methods, like DBSCAN, as well as internal clustering methods like kmeans for fitting and deploying/predicting. By default you have access to a clustered labels after deploying your method as a predictor object:
Example:
kmeans_customers_3 = KMeans(role=role,
instance_count=1,
instance_type='ml.c4.xlarge',
output_path=output_path_cluster, # specified, above
k=3,
epochs=20,
sagemaker_session=sagemaker_session)
kmeans_customers_3.fit(some_data)
kmeans_predict_3 = kmeans_customers_3.deploy(
initial_instance_count=1,
instance_type="ml.t2.medium"
)
cluster_info=kmeans_predict_3.predict(aws_conform_data_in_record_set)
cluster_labels = [cluster.label['closest_cluster'].float32_tensor.values[0] for cluster in cluster_info]
Problem:
When using an external clustering method from sklearn, these methods mostly have no predict() function. E.g. Agglomerative Clustering or DBSCAN have only a fit() or fit_predict() method, which is not compatible with AWS deploying, only methods that have a predict method, like Kmeans or affinity clustering, work well with AWS (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
Question:
How can I access a fitted clutering model from AWS, so that I have access to model.class_labels attributes after fit (in hope of not only using clustering methods that have a predict method)? I now how to download the model.tar.gz but, I'm a bit confused what to do with it, since opening it does not help.
It could be also possible to write an own predict function for such a method, that only returns class labels, however, I dont know how to do that in this environment, since AWS uses an SKLEARN object, from which I dont believe I can overwrite or the method of e.g. DBSCAN itself.
Any ideas how to retrieve class labels of clustering methods from a.fit method in AWS Sagemaker?

Once your Sklearn model is trained and saved in S3 as a model.tar.gz, you can download it to the client of your choice, untar it and re-open it with the same libraries you used to save it (pickle, joblib, etc).

If you're looking for the way to open the model.tar.gz after training the model with the built-in KMeans SageMaker algorithm, check the Analyze US census data for population segmentation SageMaker example, in particular, the section Accessing the KMeans model attributes that has this code sample:
Kmeans_model_params = mx.ndarray.load("model_algo-1")
The code sample, which you provided in your question, is correct, if you want to calculate (predict) the labels for all data points in your dataset.
In another Bring Your Own Model (k-means) example there's a code on how to pack your own KMeans model, e. g. trained with sklearn.cluster.KMeans for the inference inside SageMaker built-in KMeans container, in particular, this code is the main part:
centroids = mx.ndarray.array(kmeans.cluster_centers_)
mx.ndarray.save("model_algo-1", [centroids])
If you're looking for the way to host another SKLearn model in SageMaker, you need to create an inference.py script and define predict_fn() and model_fn() as in the SageMaker scikit-learn Bring Your Own Model example.

Related

How to extract weights from trained tensorflow object detection api model

I am using the Tensorflow Object Detection API to train a couple of models (with SSD and Faster RCNN) in a custom dataset. Everything works well, but I want to know how to extract the convolutional and classification model weights, in order to load those weights in an external (for instance keras) convolutional and full connected corresponding model. I've read about the meta architectures (SSDMetaArch and FasterRCNNMetaArch) and restoring checkpoint, but I am not sure yet how to do it for my purpose.
The above because I want to use something like CAM or GradCAM to visually check what the model learns for every class in my dataset.
Thank you

Strategies to assign specific weights to training instances

I am working on a Machine Learning Classification Model in which the user can provide label instances that should help improve the model.
More relevance needs to be given to the latest instances given by the user than for those instances that were previously available for training.
In particular, I am developing my machine learning models in python using Sklearn libraries.
So far I've only found the strategy of oversampling particular instances as a possible solution to the problem. With this strategy I would create multiple copies of the instances for which I want to give higher relevance.
Other strategy that I've found, but it seems not help under these conditions is:
Strategies that focus on giving weights for each class. This strategy is highly used in multiple libraries like Sklearn by default. However, this generalizes the idea to a class level and doesn't help me to put focus on particular instances
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
I read some suggestions to multiple the loss function by some factors for instances in tensor flow models, but this seems to be mostly applicable to neural network models in Tensor flow.
I wonder if anyone has information of other approaches that might helps with this problem
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
This is not accurate; most scikit-learn classifiers provide a sample_weight argument in their fit methods, which does exactly that. For example, here is the documentation reference for Logistic Regression:
sample_weight : array-like, shape (n_samples,) optional
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Similar arguments exist for most scikit-learn classifiers, e.g. decision trees, random forests etc, even for linear regression (not a classifier). Be sure to check the SVM: Weighted samples example in the docs.
The situation is roughly similar for other frameworks; see for example own answer in Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?
What's more, scikit-learn also provides a utility function to compute sample_weight in cases of imbalanced datasets: sklearn.utils.class_weight.compute_sample_weight

Is there any way to preserve the internal variables of a trainer in MxNet?

I wrote a program which contains an algorithm called distributed randomized gradient descent (DRGD). There are some internal variables in the algorithm which are used to calculate the step lengths. The training algorithms should be much complex than DRGD, so there should be more internal variables. If we preserve these variables, we can pause training and test the model; then, we will resume the training again.
It is possible to save the states of the trainer and resume training by calling the .save_states() and .load_states() functions on the Trainer class during a training with MXNet Gluon.
Here is an example:
trainer = gluon.Trainer(net.collect_params(), 'adam')
trainer.save_states('training.states')
trainer.load_states('training.states')
If you want to store some data across multiple devices (GPUs or machines) you can use KVStore. Here is the tutorial on how to use it.
Please note, that KVStore is considered to be quite an advanced feature, and should be used with care.
I am not sure, but it could be that what you call a "Trainer" in MXNet world may actually be called an "Optimizer". So, please consider reading this API page as well.

Exporting Tensorflow model to Google Cloud Storage

I am trying to export my model to Google Cloud Storage. I used tf.contrib.learn to build my model and followed the iris classification example.
After my training and evaluation is done I would like to store the model on the cloud so I can make predictions, but I don't know how to export the model.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[100],
n_classes=50,
model_dir="Model_Logs")
The best example of training on the cloud is probably census (canned estimator) or census (custom estimator). They use the same Estimator API, so that part should be familiar. In addition, they use the Estimator class to help perform the training automatically. The train_and_evaluate method is called on that class by learn_runner.run, which will export the model if properly configured, which basically boils down to setting the export_strategy and the model_dir
If you want to do things outside the Experiment and learn_runner frameworks, you can just call Estimator.export_savedmodel

Applying custom costfunction in TensorFlow's SKFlow model training

I'm trying to make a regression model with TensorFlow while using the sklearn implementation so it plays nicely with all the other models I've made. However I cannot seem to find a way to train the model with a custom score function (cost function or objective function).
Is this simply impossible with skflow?
Thanks loads!
Many of the examples uses learn.models.logistic_regression, which is basically a built-in high-level model that returns predictions and losses. For example, models.logistic_regression uses ops.losses_ops.softmax_classifier, which means you can look into how ops.losses_ops.softmax_classifier is implemented and implement your own loss function using perhaps TensorFlow low-level APIs.

Resources