Big-query predict using sk-learn model - machine-learning

I have created a sklearn model at my local machine. Then I have uploaded it on google storage. I have created a model and version in AI Platform using the same model. It is working for online prediction. Now I want to perform batch prediction and store the data into big query such as it updates big query table every time I perform the prediction.
Can someone suggest me how to do it?

AI Platform does not support writing prediction results to BigQuery at the moment.
You can write the prediction results to BigQuery with Dataflow. There are two options here:
Create Dataflow job that makes the predictions itself.
Create Dataflow job that uses AI Platform to get the model's predictions. Probably this would use online predictions.
In both cases you can define a BigQuery sink to insert new rows to your table.
Alternatively, you can use Cloud Functions to update a BigQuery table whenever a new file appears in GCS. This solution would look like:
Use gcloud to run the batch prediction (`gcloud ml-engine jobs submit prediction ... --output-path="gs://[My Bucket]/batch-predictions/"
Results are written in multiple files: gs://[My Bucket]/batch-predictions/prediction.results-*-of-NNNNN
Cloud function is triggered to parse and insert the results to BigQuery. This Medium post explains how to this up setup

Related

Difference Between Cloud Data fusion and DataFlow on GCP

What is the difference between GCP pipeline services:
Cloud Dataflow and Cloud Data fusion ...
which to you when?
I did a high level pricing taking 10 instances with Basic in Data fusion.
and 10 instance cluster (n1-standard-8) in Dataflow.
The pricing is more than double for Datafusion.
What are the pros and cons for each over one another
Cloud Dataflow is purpose built for highly parallelized graph processing. And can be used for batch processing and stream based processing. It is also built to be fully managed, obfuscating the need to manage and understand underlying resource scaling concepts e.g how to optimize shuffle performance or deal with key imbalance issues. The user/developer is responsible for building the graph via code; creating N transforms and or operations to achieve desired goal. For example: read files from storage, process each line in file, extract data from line, cast data to numeric, sum data in groups of X, write output to data lake.
Cloud Data Fusion is focused on enabling data integration scenarios => reading from source (via extensible set of connectors) and writing to targets e.g. BigQuery, storage, etc. It does have parallelization concepts, but they are not fully managed like Cloud Dataflow. CDF rides on top of Cloud Dataproc which is a managed version for Hadoop based processing. It's sweet spot is visual based graph development leveraging an extensible set of connectors and operators.
Your question is based on "cost" concepts. My advice is to take a step back and define what your processing/graph goal(s) look like. Then look at each products value. If you want full control over processing semantics with greater focus on analytics and want to run in batch and or must have streaming focus on Dataflow. If you want point and click data movement, with less focus need on data analytics AND do not need streaming then look at CDF.

Best Practices for Azure Machine Learning Pipelines

I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
training script
input data
additional step params
If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
3 Passing a model to a downstream step
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.
The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.
2 Control Plane vs Data Plane
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()
PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift
1 Pipeline encapsulation
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
you need to predict live (else batch prediction is sufficient), and
your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
do as we do and have one pipeline that can be configured to do either using script arguments.

Does tensorflow-federated support placing training data on client side?

It's very awesome to see that tensorflow-federated could support distributed training now. I referred to the example here.
However, it seems the training data are sent from server to client at each epoch, and the client(remote_executor_service) doesn't hold any dataset. It is different from typical federated learning scenario. So I was wondering could I place training data separately on each client?
Two things to note:
The example linked (High-performance Simulation with Kubernetes) is running a federated learning simulation. At this time TensorFlow Federated is primarily used for executing research simulations. There has yet to be a means for deploying to real smart phones. In the simulation, each client dataset is logically separate, but may be physically present on the same machine.
The creation of a tf.data.Dataset (e.g. the definition of train_data in the tutorial) could be thought of like creating a "recipe to read data", than actually reading the data itself. For example, adding .batch() or .map() calls to a dataset returns a new recipes, but doesn't actually materialize any dataset examples. The dataset isn't actually read until a .reduce() call, or the dataset is iterated over in a for loop. In the tutorial, the dataset "recipe" is sent to remote workers; the data is read/materialized remotely when the dataset is iterated over during local training (the data itself is not sent to the remote workers).

Migrate from running ML training and testing locally to Google Cloud

I currently have a simple Machine Learning infrastructure running locally and I want to migrate this all onto Google Cloud. I simply fetch the data I need from a database, build my model and then test the model on test data. This is all done in PyCharm locally.
I want to simply migrate this and have the possibility for all this to be done on Google Cloud, while having the flexibility to make local changes that can apply when run on the cloud as well. There are many Google Cloud resources relating to this and so I am looking for best practices people follow on running such a procedure.
Thanks and please let me know if there are any clarifications needed.
I highly suggest you to take a look at this machine learning workflow in the cloud which consists of:
Data Ingestion and Collection
Storing the data.
Processing data.
ML training.
ML deployment.
Data Ingestion and Collection
There are multiple resources you can use if you would like to ingest data with Google Cloud Platform. The simplest solution I can recommend to you are both Google Compute Engine or an App Engine App (for example for a forum where a user fill some data up).
Nonetheless, if you would like to ingest data in real-time, you can also use Cloud Pub/Sub.
Storing the data
As you mentioned, you are retrieving all the information from a database. If you are used to work with SQL or NoSQL I highy suggest you to go after Cloud SQL. Not only provides a good interface when building your instance, but also lets you access it securely and very rapidly.
If it not the case, you can also use Google Cloud Storage or BigQuery, but over those two, I will pick BigQuery since it has also the possibility to work with stream data.
Processing data
For processing data before feeding it to the model you can use either:
Cloud DataFlow: Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.
Cloud Dataproc: Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Cloud Dataprep: Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
ML training & ML deployment
For training/deploying your ML model I would suggest to use AI platform.
AI Platform makes it easy for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.
If you have to work with huge datasets, the best practices are run the model as a Tensorflow job with AI Platform so you can have a training cluster.
Finally for deploying your models using AI Platform, you can take a look here.

Using Google ML Engine with BigQuery?

I'm currently designing a data warehouse in BigQuery. I'm planning to store user data like past purchases or abandoned carts.
This seems to be perfect to manually analyze trends and to get insights. But what if I want to leverage Machine Learning, e.g. to suggest products to a group of users?
I have looked into Google ML Engine and TensorFlow, and it seems like the TensorFlow model would need to query BigQuery first. In some scenarios, this could mean that TensorFlow would need to query all or most of the data that is stored in BigQuery.
This feels a bit off, so I'm wondering if this is really how things are supposed to happen. Otherwise, I assume that my ML model would have to work with stale data?
So I would agree with you, using BigQuery as a data warehouse for your ML is expensive. It would be cheaper and much more efficient to use Google Cloud Storage to store all the data you wish to process. Once everything is processed and generated, you may then wish to push that data to BigQuery push that data to another source like Spanner or even Cloud Storage.
That being said Google has now created a beta product BigQuery ML. This now allows users to create and execute machine learning models in BigQuery via the use of SQL queries. I believe it uses python and tensorflow under the hood, but I believe it would be the best solution given that you have a light weight ML load.
Since it is still in beta as of now, I don't know well it's performance compares to Google ML engine and tensorflow.
Depending on what kind of model you want to train and how you want to server the model you can do one the following options:
You can export your data to Google Cloud Storage as CSV and then read the files in Cloud ML Engine. This will let you use the power of Tensorflow and you can then use Cloud ML Engine's serving system to send traffic to your model.
On the downside, this means that you have to export all of your BigQuery data to GCS and every time you decide to make any change to the data you need to go back to BigQuery and export again. Also if the data you want to prediction on is in BigQuery you have to export that as well and send it to Cloud ML Engine using a separate system.
If you want to explore and interactively train Logistic or Linear regression models on your data, you can use BigQuery Machine learning. This will allow you to slice and dice your data in BigQuery and experiment with different parts of your data and various preprocessing options. You can also use all the power of SQL. BigQuery ML also allows you to use the model after training within BigQuery (you can use SQL to feed data in to the model).
For many cases using full power of Tensorflow (i.e. using DNNs) is not necessary. This is especially true for structured data. On the other hand, most of your time will be spent on preprocessing and cleaning the data which would be much easier in SQL in BigQuery.
So you have two options here. Choose based on your needs.
P.S.: You can also try using BigQuery Reader in Tensorflow. I don't recommend it as it is very slow. But if your data is not huge it may work for you.

Resources