I have an application where a streaming dataflow pipeline does inference on an incoming stream of images. It does so by loading a tensorflow CNN model saved in a GCS location as h5 file and using that model loaded in a dataflow user defined PTransform to do the inference.
I have been going through GCP documentations but the following is still not clear to me.
Instead of having to load the tensorflow model from a GCS bucket, is it possible to deploy them model on vertex AI endpoint and call the endpoint to do inference on an image from a dataflow Ptransform? Is it feasible?
Related
According to the documentation, you can use Cloud Storage FUSE during training. However, I cannot find in the documentation if the filesystem is also available during inference.
My goal is to load various pre-trained models from GCS. I do not want to package these models (huggingface) into the Docker image, so I am able to make changes to these models without pushing the entire image.
I'm new to sagemaker pipeline, doing some reasearch on how can i train models not just in jupyter notebook but I want to set it up as a sagemaker pipeline in sagamaker studio. I tried and followed some examples based on blogs/docs provided here -> https://aws.amazon.com/blogs/machine-learning/hugging-face-on-amazon-sagemaker-bring-your-own-scripts-and-data/
and was able to run these steps in a jupyter notebook in sagemaker, but if i wanted to set up a sagemaker pipeline and create steps for training , how can i convert these to sagemaker pipeline steps, any examples or blogs doing similar in sagemaker pipeline/studio would be helpful?
While there is a way to convert a jupyter notebook to python script, well this is not sufficient to convert code written in notebook context to pipeline context.
The only things that remain intact are scripts written as entry_point for training/inference/processing in general (barring any minor internal readjustments e.g. on used environment variables that may be present in a pipelined context differently).
This official guide seems to me the most complete to follow as a prerequisite:
"Amazon SageMaker Model Building Pipeline"
Next you can see a fairly recurring application scenario, again in official guide:
"Orchestrate Jobs to Train and Evaluate Models with Amazon SageMaker Pipelines"
By setting the name of your pipeline, once it is launched, you will see it as a graph with the various states running in SageMaker Studio.
You will probably have written data manipulation code within your notebook. Remember that the pipeline is composed of steps, so you cannot manipulate data between steps without understanding this step as a step itself (e.g. processing step). Perhaps therefore the biggest code change to readjust is this part.
What is the difference between GCP pipeline services:
Cloud Dataflow and Cloud Data fusion ...
which to you when?
I did a high level pricing taking 10 instances with Basic in Data fusion.
and 10 instance cluster (n1-standard-8) in Dataflow.
The pricing is more than double for Datafusion.
What are the pros and cons for each over one another
Cloud Dataflow is purpose built for highly parallelized graph processing. And can be used for batch processing and stream based processing. It is also built to be fully managed, obfuscating the need to manage and understand underlying resource scaling concepts e.g how to optimize shuffle performance or deal with key imbalance issues. The user/developer is responsible for building the graph via code; creating N transforms and or operations to achieve desired goal. For example: read files from storage, process each line in file, extract data from line, cast data to numeric, sum data in groups of X, write output to data lake.
Cloud Data Fusion is focused on enabling data integration scenarios => reading from source (via extensible set of connectors) and writing to targets e.g. BigQuery, storage, etc. It does have parallelization concepts, but they are not fully managed like Cloud Dataflow. CDF rides on top of Cloud Dataproc which is a managed version for Hadoop based processing. It's sweet spot is visual based graph development leveraging an extensible set of connectors and operators.
Your question is based on "cost" concepts. My advice is to take a step back and define what your processing/graph goal(s) look like. Then look at each products value. If you want full control over processing semantics with greater focus on analytics and want to run in batch and or must have streaming focus on Dataflow. If you want point and click data movement, with less focus need on data analytics AND do not need streaming then look at CDF.
I currently have a simple Machine Learning infrastructure running locally and I want to migrate this all onto Google Cloud. I simply fetch the data I need from a database, build my model and then test the model on test data. This is all done in PyCharm locally.
I want to simply migrate this and have the possibility for all this to be done on Google Cloud, while having the flexibility to make local changes that can apply when run on the cloud as well. There are many Google Cloud resources relating to this and so I am looking for best practices people follow on running such a procedure.
Thanks and please let me know if there are any clarifications needed.
I highly suggest you to take a look at this machine learning workflow in the cloud which consists of:
Data Ingestion and Collection
Storing the data.
Processing data.
ML training.
ML deployment.
Data Ingestion and Collection
There are multiple resources you can use if you would like to ingest data with Google Cloud Platform. The simplest solution I can recommend to you are both Google Compute Engine or an App Engine App (for example for a forum where a user fill some data up).
Nonetheless, if you would like to ingest data in real-time, you can also use Cloud Pub/Sub.
Storing the data
As you mentioned, you are retrieving all the information from a database. If you are used to work with SQL or NoSQL I highy suggest you to go after Cloud SQL. Not only provides a good interface when building your instance, but also lets you access it securely and very rapidly.
If it not the case, you can also use Google Cloud Storage or BigQuery, but over those two, I will pick BigQuery since it has also the possibility to work with stream data.
Processing data
For processing data before feeding it to the model you can use either:
Cloud DataFlow: Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.
Cloud Dataproc: Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Cloud Dataprep: Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
ML training & ML deployment
For training/deploying your ML model I would suggest to use AI platform.
AI Platform makes it easy for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.
If you have to work with huge datasets, the best practices are run the model as a Tensorflow job with AI Platform so you can have a training cluster.
Finally for deploying your models using AI Platform, you can take a look here.
I have created a sklearn model at my local machine. Then I have uploaded it on google storage. I have created a model and version in AI Platform using the same model. It is working for online prediction. Now I want to perform batch prediction and store the data into big query such as it updates big query table every time I perform the prediction.
Can someone suggest me how to do it?
AI Platform does not support writing prediction results to BigQuery at the moment.
You can write the prediction results to BigQuery with Dataflow. There are two options here:
Create Dataflow job that makes the predictions itself.
Create Dataflow job that uses AI Platform to get the model's predictions. Probably this would use online predictions.
In both cases you can define a BigQuery sink to insert new rows to your table.
Alternatively, you can use Cloud Functions to update a BigQuery table whenever a new file appears in GCS. This solution would look like:
Use gcloud to run the batch prediction (`gcloud ml-engine jobs submit prediction ... --output-path="gs://[My Bucket]/batch-predictions/"
Results are written in multiple files: gs://[My Bucket]/batch-predictions/prediction.results-*-of-NNNNN
Cloud function is triggered to parse and insert the results to BigQuery. This Medium post explains how to this up setup