I am not using any of the recommended algorithms in the AWS Sagemaker Examples. My question is simply understanding whether this error occurs because of the way I create docker image.
I use MacBook Pro M1 to create the Docker image, and I am launching in a ml.m5.xlarge instance. (First I want to know if this is the problem).
I want to also mention that my algorithm is a bit weird; in the sense I it is raw RL job. Not optimised with Ray or StableBaselines, so when I try to use RLEstimator in Sagemaker instead of Estimator class I wasn't able to execute the training. However, I do not believe this error is due to that problem. Hoping to get some insight from anyone who has experience in AWS Sagemaker, Docker.
I have built an XGBoost Classifier and RandomForest Classifier model for the audio classification project. I want to deploy these models which are saved in pickle (.pkl) format on AWS Sagemaker. From what I have observed, there isn't a lot of resources available online. Can anyone guide me with the steps and if possible also provide the code? I already have the models built and I'm just left with deploying it on Sagemaker.
By saying that you want to deploy to sagemaker, I assume you mean a sagemaker endpoint.
The answer is sagemaker inference toolkit. It's basically about educating sagemaker how to load and do inference. More details here: https://github.com/aws/sagemaker-inference-toolkit and here is an example implementation: https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_bring_your_own
I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
training script
input data
additional step params
If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
3 Passing a model to a downstream step
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.
The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.
2 Control Plane vs Data Plane
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()
PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift
1 Pipeline encapsulation
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
you need to predict live (else batch prediction is sufficient), and
your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
do as we do and have one pipeline that can be configured to do either using script arguments.
The workflow :
to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.
i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.
Question:
What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?
What's the best way to run machine learning algorithms on Google Cloud Dataflow? I can imagine that using Mahout would be one option given it's Java based.
The answer is probably no, but is there a way to invoke R or Python (that have strong support for algorithms) based scripts to offload ML execution?
-Girish
You can already implement many algorithms in terms of Dataflow transforms.
A class of algorithms that may not be as easy to implement are iterative algorithms, where the pipeline's execution graph depends on the data itself. Simplifying implementation of iterative algorithms is something that we are interested in, and you can expect future improvements and simplifications in this area.
Invoking Python (or any other) executable shouldn't be hard from a Dataflow pipeline. A ParDo can, for example, shell out and start an arbitrary process. You can use, for example, --filesToStage pipeline option to add additional files to the Dataflow worker environment.
There is also http://quickml.org/ (haven't used personally) and Weka. I remember the docs mention that it's possible to launch a new process from within the job, but AFAIK it's not recommended.