What is the difference between Deploying and Serving ML model? - machine-learning

Recently I have developed a ML model for classification problem and now would like to put in the production to do classification on actual production data, while exploring I have came across two methods deploying and serving ML model what is the basic difference between them ?

Based on my own readings and understanding, here's the difference:
Deploying = it means that you want to create a server/api (e.g. REST API) so that it will be able to predict on new unlabelled data
Serving = it acts as a server that is specialized for predict models. The idea is that it can serve multiple models with different requests.
Basically, if your use case requires deploying multiple ML models, you might want to look for serving like torchServe. But if it's just one model, for me, Flask is already good enough.
Reference:
Pytorch Deploying using flask
TorchServe

Related

Best practices for serving user-specific large ML/DL models in a web application?

First excuse any naive statement you may find below, i'm a newcomer to the ML/DL field.
How do web applications that integrate fine-tuning of large machine learning/deep learning models handle the storage and retrieval of these models for inference?
I'm trying to implement a web app that allows users to fine-tune a stable diffusion model using their own images with dreambooth. as the fine-tuned model is quite large reaching several gigabytes. After the model is trained and saved, the app should retrieve and use the model for inference each time a user visits the site and requests one.
The current approach I am considering is to store the fine-tuned model in a compressed format in a S3 or R2 bucket. Each time a user visits the web app and requests an inference, I would retrieve the model from the bucket, decompress it, and run the inference.
that being said adding the overhead of fetching + decompression to inference is obviously not a good idea.
I'm sort of sure that there's a standard approach that the community follows for handling such scenarios, what are those if they exist ? how typically these scenarios are handled ?

Cascading two Machine Learning models

I built a machine learning (ML) model to classify real-time network traffic as an attack or normal traffic using a dataset consisting of approximately 3 million records.
Then, I built a second ML model to classify the real-time network traffic according to their application, i.e., Google, Facebook, YouTube, etc. using another dataset consisting of approximately 1.5 million records.
Now I want to cascade these two models so that if the traffic is normal, then the traffic should be classified by the second ML model. Otherwise, it should be discarded since there is no need to pass through the second model.
Can I cascade these two models even though they are built using different datasets? And if so, how can I do that?
I do the cascading logic simply in a programming language code C++ or Python, not using ML-tool features. If the data from the second model, doesn't contribute to the decision of the first model - just keep the models separated.

Kubernetes Machine Learning Model Serving

Is there a suggested way to serve hundreds of machine learning models in Kubernetes?
Solutions like Kfserving seem to be more suitable for cases where there is a single trained model, or a few versions of it, and this model serves all requests. For instance a typeahead model that is universal across all users.
But is there a suggested way to serve hundreds or thousands of such models? For example, a typeahead model trained specifically on each user's data.
The most naive way to achieve something like that, would be that each typeahead serving container maintains a local cache of models in memory. But then scaling to multiple pods would be a problem because each cache is local to the pod. So each request would need to get routed to the correct pod that has loaded the model.
Also having to maintain such a registry where we know which pod has loaded which model and perform updates on model eviction seems like a lot of work.
You can use Catwalk mixed with Grab.
Grab has a tremendous amount of data that we can leverage to solve
complex problems such as fraudulent user activity, and to provide our
customers personalized experiences on our products. One of the tools
we are using to make sense of this data is machine learning (ML).
That is how Catwalk is created: an easy-to-use, self-serve, machine
learning model serving platform for everyone at Grab.
More infromation about Catwalk you can find here: Catwalk.
You can serve multiple Machine Learning models using TensorFlow and Google Cloud.
The reason the field of machine learning is experiencing such an epic
boom is because of its real potential to revolutionize industries and
change lives for the better. Once machine learning models have been
trained, the next step is to deploy these models into usage, making
them accessible to those who need them — be they hospitals,
self-driving car manufacturers, high-tech farms, banks, airlines, or
everyday smartphone users. In production, the stakes are high and one
cannot afford to have a server crash, connection slow down, etc. As
our customers increase their demand for our machine learning services,
we want to seamlessly meet that demand, be it at 3AM or 3PM.
Similarly, if there is a decrease in demand we want to scale down the
committed resources so as to save cost, because as we all know, cloud
resources are very expensive.
More information you cna find here: machine-learning-serving.
Also you can use Seldon.
Seldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.
Features:
deploying machine learning models in the cloud or on-premise.
gaining metrics ensuring proper governance and compliance for your
running machine learning models.
creating inference graphs made up of multiple components.
providing a consistent serving layer for models built using
heterogeneous ML toolkits.
Useful documentation: Kubernetes-Machine-Learning.

Data processing while using tensorflow serving (Docker/Kubernetes)

I am looking to host 5 deep learning models where data preprocessing/postprocessing is required.
It seems straightforward to host each model using TF serving (and Kubernetes to manage the containers), but if that is the case, where should the data pre and post-processing take place?
I'm not sure there's a single definitive answer to this question, but I've had good luck deploying models at scale bundling the data pre- and post-processing code into fairly vanilla Go or Python (e.g., Flask) applications that are connected to my persistent storage for other operations.
For instance, to take the movie recommendation example, on the predict route it's pretty performant to pull the 100 films a user has watched from the database, dump them into a NumPy array of the appropriate size and encoding, dispatch to the TensorFlow serving container, and then do the minimal post-processing (like pulling the movie name, description, cast from a different part of the persistent storage layer) before returning.
Additional options to josephkibe's answer, you can:
Implementing processing into model itself (see signatures for keras models and input receivers for estimators in SavedModel guide).
Install Seldon-core. It is a whole framework for serving that handles building images and networking. It builds service as a graph of pods with different API's, one of them are transformers that pre/post-process data.

How to deploy machine learning algorithm in production environment?

I'm new to machine learning algorithm. I'm learning basic algorithms like regression, classification, clustering, sequence modelling, on-line algorithms. All the article that are available on internet shows how to use these algorithm with specific data. There is no article regarding deployment of those algorithm in production environment. So my questions are
1) How to deploy machine learning algorithm in production environment?
2) The typical approach follows in machine learning tutorial is to build the model using some training data, use it for testing data. But is it advisable to use that kind of model in production environment? Incoming data may keep changing so the model will be ineffective. What should be duration for the model refresh cycle to accommodate such changes?
I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:
1) Obtain some good amount of data which you could use to build a prototype of machine-learning system
2) Split your data into train set, cross-validation set and test set
3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set
4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.
Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:
1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).
2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.
Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).
DISCLAIMER: I work for this company, Datmo building a better workflow for ML. We’re always looking to help fellow developers working on ML so feel free to reach out to me at anand#datmo.com if you have any questions.
1) In order to deploy, you should first split up your code into preprocessing, training and test. This way you can easily encapsulate the required components for deployment. Usually, you will then want to take your preprocessing, test, as well as your weights file (the output of your training process) and put them in one folder. Next, you will want to host this on a server and wrap an API server around this. I would suggest a Flask Restful API so that you can use query parameters as your inputs and output your response in standard JSON blobs.
To host it on a server, you can use this article which talks about how you can deploy a Flask API on EC2.
You can load and model and serve it as API as given in this code.
2) Hard for me to answer without more details. It's highly dependent on the type of data and the type of model. For example, for deep learning, there is no such thing as online learning.
I am sorry that my comments does not include too much detail* since I am also a newbie in "deployment" of ML. But since the author is also new in ML, I hope these basic guidance could be helpful as well.
For "deployment", you should
Have ML algorithms: You may use free-tools, or develop your own tool using libraries in Python, R, Java, .Net, .. or use a system on cloud..)
Train those ML models using training datasets
Save those trained models (You should search this topic based on your development environment. There are some file formats that Tensorflow/Keras provide, or formats like pickle, ONNX,.. I would like to write a whole list here, with their supporting language & environment, advantage&disadvantage and loadability but I am also trying to investigate this topic, as a newbie)
And THEN, you can deploy these saved-models on production. On production you should either have your own-developed application to run the saved model (For example: an application that you developed with Python that takes trained&saved .pickle file and TestData as input; and simply gives "prediction for the test data" as output) or you should have an environment/framework that runs the saved models (search for ML environments/frameworks on cloud). At first, you should clarify your need: Do you need a stand-alone program on production, or will you serve a internal web-service, or via-cloud, etc.
For the second question; as above answers indicate the issue is "online training ability" of the models. Please additionally note that; for "online learning", your production environment has to feed your production tool/system with the real-correct label of the test data as well. Will you have that capability?
Note: All above are just small "comments" instead of a clear answer, but technically I am not able to write comments yet. Thanks for not de-voting :)
Regarding the first question, my service mlrequest makes deploying models to production simple. You can get started with a free API key that provides 50k model transactions a month.
This code will train and deploy, or update your model across 5 global data centers.
from mlrequest import Classifier
classifier = Classifier('my-api-key')
features = {'feature1': 'val1', 'feature2': 45}
training_data = {'features': features, 'label': 2}
r = classifier.learn(training_data=training_data, model_name='my-model', class_count=2)
This is how you make predictions, latency-routed to the nearest data center to get the quickest response.
features = {'feature1': 'val1', 'feature2': 77}
r = classifier.predict(features=features, model_name='my-model', class_count=2)
r.predict_result
Regarding your second question, it completely depends on the problem you are solving. Some models need to be frequently updated, while others almost never need to be updated.

Resources