Kubernetes Machine Learning Model Serving - machine-learning

Is there a suggested way to serve hundreds of machine learning models in Kubernetes?
Solutions like Kfserving seem to be more suitable for cases where there is a single trained model, or a few versions of it, and this model serves all requests. For instance a typeahead model that is universal across all users.
But is there a suggested way to serve hundreds or thousands of such models? For example, a typeahead model trained specifically on each user's data.
The most naive way to achieve something like that, would be that each typeahead serving container maintains a local cache of models in memory. But then scaling to multiple pods would be a problem because each cache is local to the pod. So each request would need to get routed to the correct pod that has loaded the model.
Also having to maintain such a registry where we know which pod has loaded which model and perform updates on model eviction seems like a lot of work.

You can use Catwalk mixed with Grab.
Grab has a tremendous amount of data that we can leverage to solve
complex problems such as fraudulent user activity, and to provide our
customers personalized experiences on our products. One of the tools
we are using to make sense of this data is machine learning (ML).
That is how Catwalk is created: an easy-to-use, self-serve, machine
learning model serving platform for everyone at Grab.
More infromation about Catwalk you can find here: Catwalk.
You can serve multiple Machine Learning models using TensorFlow and Google Cloud.
The reason the field of machine learning is experiencing such an epic
boom is because of its real potential to revolutionize industries and
change lives for the better. Once machine learning models have been
trained, the next step is to deploy these models into usage, making
them accessible to those who need them — be they hospitals,
self-driving car manufacturers, high-tech farms, banks, airlines, or
everyday smartphone users. In production, the stakes are high and one
cannot afford to have a server crash, connection slow down, etc. As
our customers increase their demand for our machine learning services,
we want to seamlessly meet that demand, be it at 3AM or 3PM.
Similarly, if there is a decrease in demand we want to scale down the
committed resources so as to save cost, because as we all know, cloud
resources are very expensive.
More information you cna find here: machine-learning-serving.
Also you can use Seldon.
Seldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.
Features:
deploying machine learning models in the cloud or on-premise.
gaining metrics ensuring proper governance and compliance for your
running machine learning models.
creating inference graphs made up of multiple components.
providing a consistent serving layer for models built using
heterogeneous ML toolkits.
Useful documentation: Kubernetes-Machine-Learning.

Related

Best practices for serving user-specific large ML/DL models in a web application?

First excuse any naive statement you may find below, i'm a newcomer to the ML/DL field.
How do web applications that integrate fine-tuning of large machine learning/deep learning models handle the storage and retrieval of these models for inference?
I'm trying to implement a web app that allows users to fine-tune a stable diffusion model using their own images with dreambooth. as the fine-tuned model is quite large reaching several gigabytes. After the model is trained and saved, the app should retrieve and use the model for inference each time a user visits the site and requests one.
The current approach I am considering is to store the fine-tuned model in a compressed format in a S3 or R2 bucket. Each time a user visits the web app and requests an inference, I would retrieve the model from the bucket, decompress it, and run the inference.
that being said adding the overhead of fetching + decompression to inference is obviously not a good idea.
I'm sort of sure that there's a standard approach that the community follows for handling such scenarios, what are those if they exist ? how typically these scenarios are handled ?

What is the difference between Deploying and Serving ML model?

Recently I have developed a ML model for classification problem and now would like to put in the production to do classification on actual production data, while exploring I have came across two methods deploying and serving ML model what is the basic difference between them ?
Based on my own readings and understanding, here's the difference:
Deploying = it means that you want to create a server/api (e.g. REST API) so that it will be able to predict on new unlabelled data
Serving = it acts as a server that is specialized for predict models. The idea is that it can serve multiple models with different requests.
Basically, if your use case requires deploying multiple ML models, you might want to look for serving like torchServe. But if it's just one model, for me, Flask is already good enough.
Reference:
Pytorch Deploying using flask
TorchServe

Migrate from running ML training and testing locally to Google Cloud

I currently have a simple Machine Learning infrastructure running locally and I want to migrate this all onto Google Cloud. I simply fetch the data I need from a database, build my model and then test the model on test data. This is all done in PyCharm locally.
I want to simply migrate this and have the possibility for all this to be done on Google Cloud, while having the flexibility to make local changes that can apply when run on the cloud as well. There are many Google Cloud resources relating to this and so I am looking for best practices people follow on running such a procedure.
Thanks and please let me know if there are any clarifications needed.
I highly suggest you to take a look at this machine learning workflow in the cloud which consists of:
Data Ingestion and Collection
Storing the data.
Processing data.
ML training.
ML deployment.
Data Ingestion and Collection
There are multiple resources you can use if you would like to ingest data with Google Cloud Platform. The simplest solution I can recommend to you are both Google Compute Engine or an App Engine App (for example for a forum where a user fill some data up).
Nonetheless, if you would like to ingest data in real-time, you can also use Cloud Pub/Sub.
Storing the data
As you mentioned, you are retrieving all the information from a database. If you are used to work with SQL or NoSQL I highy suggest you to go after Cloud SQL. Not only provides a good interface when building your instance, but also lets you access it securely and very rapidly.
If it not the case, you can also use Google Cloud Storage or BigQuery, but over those two, I will pick BigQuery since it has also the possibility to work with stream data.
Processing data
For processing data before feeding it to the model you can use either:
Cloud DataFlow: Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.
Cloud Dataproc: Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Cloud Dataprep: Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
ML training & ML deployment
For training/deploying your ML model I would suggest to use AI platform.
AI Platform makes it easy for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.
If you have to work with huge datasets, the best practices are run the model as a Tensorflow job with AI Platform so you can have a training cluster.
Finally for deploying your models using AI Platform, you can take a look here.

Data processing while using tensorflow serving (Docker/Kubernetes)

I am looking to host 5 deep learning models where data preprocessing/postprocessing is required.
It seems straightforward to host each model using TF serving (and Kubernetes to manage the containers), but if that is the case, where should the data pre and post-processing take place?
I'm not sure there's a single definitive answer to this question, but I've had good luck deploying models at scale bundling the data pre- and post-processing code into fairly vanilla Go or Python (e.g., Flask) applications that are connected to my persistent storage for other operations.
For instance, to take the movie recommendation example, on the predict route it's pretty performant to pull the 100 films a user has watched from the database, dump them into a NumPy array of the appropriate size and encoding, dispatch to the TensorFlow serving container, and then do the minimal post-processing (like pulling the movie name, description, cast from a different part of the persistent storage layer) before returning.
Additional options to josephkibe's answer, you can:
Implementing processing into model itself (see signatures for keras models and input receivers for estimators in SavedModel guide).
Install Seldon-core. It is a whole framework for serving that handles building images and networking. It builds service as a graph of pods with different API's, one of them are transformers that pre/post-process data.

How to get a specific machine type for ML Engine online prediction?

Is there an option to request a faster node for online prediction in ML Engine?
For example, when training I can configure any of these machines for my job:
standard,
large_model,
complex_model_s,
complex_model_m,
complex_model_l,
standard_gpu,
complex_model_m_gpu,
complex_model_l_gpu,
standard_p100,
complex_model_m_p100
See description of available clusters and machines for training here and here
I am struggling to find if it is possible to control what kind of machine runs my online prediction.
We are currently adding that capability and will let you know when it's publicly available.
ML Engine offers 4-core instance type in addition to the default serving instance type for online prediction. However the feature is still at alpha stage and it will only be available to a selected list of accounts who opted in as "Trusted Testers". Please contact cloudml-feedback#google.com if you need help to setup prediction service with faster node.

Resources