AWS SageMaker ML DevOps tooling / architecture - Kubeflow? - machine-learning

I'm tasked with defining AWS tools for ML development at a medium-sized company. Assume about a dozen ML engineers plus other DevOps staff familiar with serverless ( lambdas and the framework ). The main questions are: a) what is an architecture that allows for the main tasks related to ML development (creating, training, fitting models, data pre-processing, hyper parameter optimization, job management, wrapping serverless services, gathering model metrics, etc ), b) what are the main tools that can be used for packaging and deploying things and c) what are the development tools (IDEs, SDKs, 'frameworks' ) used for it?
I just want to set Jupyter notebooks aside for a second. Jupyter notebooks are great for proof-of-concepts and the closest thing to PowerPoint for management... But I have a problem with notebooks when thinking about deployable units of code.
My intuition points to a preliminary target architecture with 5 parts:
1 - A 'core' with ML models supporting basic model operations (create blank, create pre-trained, train, test/fit, etc). I foresee core Python scripts here - no problem.
2- (optional) A 'containerized-set-of-things' that performs hyper parameter optimization and/or model versioning
3- A 'contained-unit-of-Python-scripts-around-models' that exposes an API and that does job management and incorporates data pre-processing. This also reads and writes to S3 buckets.
4- A 'serverless layer' with high level API ( in Python ). It talks to #3 and/or #1 above.
5- Some container or bundling thing that will unpack files from Git and deploy them onto various AWS services creating things from the previous 3 points.
As you can see, my terms are rather fuzzy:) If someone can be specific with terms that will be helpful.
My intuition and my preliminary readings say that the answer will likely include a local IDE like PyCharm or Anaconda or a cloud-based IDE (what can these be? - don't mention notebooks please).
The point that I'm not really clear about is #5. Candidates include Amazon SageMaker Components for Kubeflow Pipelines and/or Amazon SageMaker Components for Kubeflow Pipelines and/or AWS Step Functions DS SDK For SageMaker. It's unclear to me how they can perform #5, however. Kubeflow looks very interesting but does it have enough adoption or will it die in 2 years? Are Amazon SageMaker Components for Kubeflow Pipelines, Amazon SageMaker Components for Kubeflow Pipelines and AWS Step Functions DS SDK For SageMaker mutually exclusive? How can each of them help with 'containerizing things' and with basic provisioning and deployment tasks?

Its a long question although and these things totally make sense when you think to design ML infrastructure for production. So there are three levels that defines the maturity of your machine learning process.
1- CI/CDeployment: in this docker image will go through stages like build, test and push the versioned training image to the registry. You can also perform training in these and can store versioned model using git references.
2- Continuous Training: Here we deal with the ML Pipeline. Automation of the process using new data to retrain models. It becomes very useful when you have to run whole ML pipeline with new data or new implementation.
Tools for implementation: Kubeflow pipelines, Sagemaker, Nuclio
3- Continuous delivery: Where?? On cloud or on Edge? On cloud then you can use KF serving or use sage maker with kubeflow pipelines and deploy the model with sagemaker through Kubeflow.
Sagemaker and Kubeflow somehow give same functionality but each of them have their unique power. Kubeflow has power of kubernetes, pipelines, portability, cache and artifacts meanwhile Sagemaker have power of Manged infrastructure and scale from 0 capability and AWS ML services like Athena or Groundtruth.
Solution:
Kubeflow pipelines standalone + AWS Sagemaker(Training+Serving Model) + Lambda to trigger pipelines from S3 or Kinesis.
Infra required.
-Kubernettess cluster (Atleast 1 m5)
-MinIo or S3
-Container registry
-Sagemaker credentials
-MySQL or RDS
-Loadbalancer
-Ingress for using kubeflow SDK
Again you asked me my year journey in one question. If you are intrested lets connect :)
Permissions:
Kube --> registry (Read)
Kube --> S3 (Read, Write)
Kube --> RDS (Read, Write)
Lambda --> S3 (Read)
Lambda --> Kube (API Access)
Sagemaker --> S3, Registery
A good starting guide
https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/aws
https://aws.amazon.com/blogs/machine-learning/introducing-amazon-sagemaker-components-for-kubeflow-pipelines/
https://github.com/shashankprasanna/kubeflow-pipelines-sagemaker-examples/blob/master/kfp-sagemaker-custom-container.ipynb

Related

Export a set up virtual machine from cloud infrastructure to use it locally

I need to perform some machine learning tasks using a Tensor Flow based Neuronal network architecture (PointNet https://github.com/charlesq34/pointnet). I would like to use cloud infrastructure to do this, because I do not have the physical resources needed. The demands of the customer are, that they would like to get the whole set up machine I used for the training afterward and not only the final model. This is because they are researchers and would like to use the machine themselves, play around and understand what I did but they do not want to do the setup/installation work on their own. Unfortunately they can not provide a (physical or virtual) machine themselves right now.
The question is: Is it possible/reasonable to set up a machine on a cloud infrastructure provider like google cloud or AWS, install the needed software (which uses Nvidia Cuda) and export this machine after a while when suitable hardware is available, import it to a virtualisation tool (like Virtual Box) and continue the usage on ones own system? Will the installed GPU/Cuda-related software like TensorFlow etc. still work?
I guess it's possible, but it will be needed to configure the specific hardware to make it work on the local environment.
For Google Cloud Platform,
the introduction to Deep Learning Containers, will you allow to create portable environments.
Deep Learning Containers are a set of Docker containers with key data science frameworks, libraries, and tools pre-installed. These containers provide you with performance-optimized, consistent environments that can help you prototype and implement workflows quickly. Learn more.
In addition, please check Running Instances with GPU accelerators
Google provides a seamless experience for users to run their GPU workloads within Docker containers on Container-Optimized OS VM instances so that users can benefit from other Container-Optimized OS features such as security and reliability as well.
To configure Docker with Virtualbox, please check this external blog.

Is Serverless Cloud Native too?

Is Server-less a subset or attribute of Cloud Native? Or is it another way round -- Is Cloud Native a subset or attribute of Server-less?
Nathan Aw (Singapore)
Cloud native is a more general approach to building and running applications that take advantage of cloud computing. Serverless is more of an execution model in the cloud.
A Cloud native stack will usually aim to make use of containers and microservices:
Each part of the stack is packaged in its own container. This promotes reproducibility, transparency, and resource isolation. Dynamically orchestrated containers are then actively scheduled and managed to optimize resource utilization.
Applications are also segmented or broken-down into microservices, which are more easily testable and maintainable, are loosely-coupled, and independently deployable.
Serverless describes a model of providing backend services on an as-used basis.The cloud provider (AWS Lambda/Google Cloud Functions/Azure Functions) is responsible for executing a piece of code by dynamically allocating the resources.
Many of today's apps apply elements of both.

How can I make an API after compiling and running my pipeline on Kubeflow?

I built a pipeline which takes an image and returns a number of persons. I want to make an API which takes an image and returns a JSON file with count using Kubeflow.
There are a few ways that you can deploy a model for inference from a pipeline:
You can use Kubeflow components like KFServing and the KFServing Deployer component for Kubeflow Pipelines
If you are using a cloud provider, they may have services you can use for inference. For example, there is a component that deploys trained models to Google Cloud AI Platform
Or, you could build a custom solution

How to fully manage ML Lifecycle in git for reproducible

All components of the end to end machine learning life cycle including data preparation steps, data clean, training model code, models and more want to be stored and version controlled in Git.
Can you please share the steps to do this?
Azure Machine Learning Pipelines support everything. Further along the deployment and scale maturity level, use Kubeflow+MLflow to manage the scheduling of jobs. Both are open-source, platform and framework agnostic.
Azure MLOps provides comprehensive ML lifecycle management.
MLOps Happy Path: happy path for MLOps - end-to-end solution,
Azure pipelines + ML CLI/MLOps Example Azure (DevOps) pipeline that uses ML CLI
Kubeflow Labs: https://github.com/Azure/kubeflow-labs
Our team does this by:
Use the SDK and Azure ML Pipelines to encapsulate all code and artifact creation within the pipeline control plane
Use pygit2 to coordniate artifact registration with feature branch names.
The Azure ML MLOps team also has this repo which provides more info on step 1.

What is the difference between google's Dataflow and google's dataproc?

DataFlow itself has ETL,computation and streaming process why do we need to go for google's Dataproc?
Google Dataflow is a fully managed and self-optimizing cloud service that lets you use the Apache Beam programming model to write your batch and streaming data processing pipelines. It's integrated with many open source and Google Cloud data sources and sinks.
Google Dataproc is a fully managed cloud service for running Apache Hadoop and Apache Spark clusters in a simple cost-effective way. If you have existing data processing pipelines that use Spark, Hive, or Pig this is an quick and easy way to move your pipelines. You can install custom packages, start/stop and scale these clusters at any time. On top Google Dataproc is integrated with many of Google Clouds data services.

Resources