Call AutoML prediction model from Dataflow SQL - google-cloud-dataflow

Is there a way to call a AutoML prediction model from within Dataflow SQL?

No, from Dataflow SQL alone that's not possible. You could write a Beam pipeline, and have some SQL transforms in that pipeline, as well as a DoFn to call the AutoML prediction endpoint.
Dataflow SQL uses the Zeta SQL variant of Beam SQL (if you want to reuse the same SQL code you are currently using with Dataflow SQL).
To run the Beam SQL pipeline, you would run it like any other pipeline (a "regular" Dataflow job).

Related

Feasibility of calling vertex AI endpoints from dataflow streaming pipeline

I have an application where a streaming dataflow pipeline does inference on an incoming stream of images. It does so by loading a tensorflow CNN model saved in a GCS location as h5 file and using that model loaded in a dataflow user defined PTransform to do the inference.
I have been going through GCP documentations but the following is still not clear to me.
Instead of having to load the tensorflow model from a GCS bucket, is it possible to deploy them model on vertex AI endpoint and call the endpoint to do inference on an image from a dataflow Ptransform? Is it feasible?

using PySpark pipeline models at inference time without a Spark context

The workflow :
to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.
i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.
Question:
What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?

Big-query predict using sk-learn model

I have created a sklearn model at my local machine. Then I have uploaded it on google storage. I have created a model and version in AI Platform using the same model. It is working for online prediction. Now I want to perform batch prediction and store the data into big query such as it updates big query table every time I perform the prediction.
Can someone suggest me how to do it?
AI Platform does not support writing prediction results to BigQuery at the moment.
You can write the prediction results to BigQuery with Dataflow. There are two options here:
Create Dataflow job that makes the predictions itself.
Create Dataflow job that uses AI Platform to get the model's predictions. Probably this would use online predictions.
In both cases you can define a BigQuery sink to insert new rows to your table.
Alternatively, you can use Cloud Functions to update a BigQuery table whenever a new file appears in GCS. This solution would look like:
Use gcloud to run the batch prediction (`gcloud ml-engine jobs submit prediction ... --output-path="gs://[My Bucket]/batch-predictions/"
Results are written in multiple files: gs://[My Bucket]/batch-predictions/prediction.results-*-of-NNNNN
Cloud function is triggered to parse and insert the results to BigQuery. This Medium post explains how to this up setup

Complex join with google dataflow

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.
I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).
So a couple of questions here:
(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.
Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.
Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).
(2) The complex non-key joins, how would these be handled?
I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).
A good worked example of Dataflow for this use case would really push adoption forward!
Thanks,
Mike S
It sounds like Dataflow would be a good fit. We allow you to write a pipeline that takes a PCollection of business events and performs the ETL. The pipeline could either be batch (executed periodically) or streaming (executed whenever input data arrives).
The various joins are for the most part relatively expressible in Dataflow. For the cartesian product, you can look at using side inputs to make the contents of a PCollection available as an input to the processing of each element in another PCollection.
You can also look at using GroupByKey or CoGroupByKey to implement the joins. These flatten multiple inputs, and allow accessing all values with the same key in one place. You can also use Combine.perKey to compute associative and commutative combinations of all the elements associated with a key (eg., SUM, MIN, MAX, AVERAGE, etc.).
Date-banded joins sound like they would be a good fit for windowing which allows you to write a pipeline that consumes windows of data (eg., hourly windows, daily windows, 7 day windows that slide every day, etc.).
Edit: Mention GroupByKey and CoGroupByKey.

Google cloud dataflow and machine learning

What's the best way to run machine learning algorithms on Google Cloud Dataflow? I can imagine that using Mahout would be one option given it's Java based.
The answer is probably no, but is there a way to invoke R or Python (that have strong support for algorithms) based scripts to offload ML execution?
-Girish
You can already implement many algorithms in terms of Dataflow transforms.
A class of algorithms that may not be as easy to implement are iterative algorithms, where the pipeline's execution graph depends on the data itself. Simplifying implementation of iterative algorithms is something that we are interested in, and you can expect future improvements and simplifications in this area.
Invoking Python (or any other) executable shouldn't be hard from a Dataflow pipeline. A ParDo can, for example, shell out and start an arbitrary process. You can use, for example, --filesToStage pipeline option to add additional files to the Dataflow worker environment.
There is also http://quickml.org/ (haven't used personally) and Weka. I remember the docs mention that it's possible to launch a new process from within the job, but AFAIK it's not recommended.

Resources