using PySpark pipeline models at inference time without a Spark context - machine-learning

The workflow :
to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.
i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.
Question:
What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?

Related

Best Practices for Azure Machine Learning Pipelines

I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
training script
input data
additional step params
If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
3 Passing a model to a downstream step
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.
The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.
2 Control Plane vs Data Plane
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()
PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift
1 Pipeline encapsulation
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
you need to predict live (else batch prediction is sufficient), and
your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
do as we do and have one pipeline that can be configured to do either using script arguments.

Is there a way to use external, compiled packages for data processing in Google's AI Platform?

I would like to set up a prediction task, but the data preprocessing step requires using tools outside of Python's data science ecosystem, though Python has APIs to work with those tools (e.g. a compiled java NLP tool set). I first thought about creating a Docker container to have an environment with those tools available, but a commentator has said that that is not currently supported. Is there perhaps some other way to make such tools available to the Python prediction class needed for AI Platform? I don't really have a clear sense of what's happening on the backend with AI platform, and how much ability a user has to modify or set that up.
Not possible today. Is there any specific use case you are targeting not satisfied today?
Cloud AI platform offers multiple prediction frameworks (TensorFlow, scikit-learn, XGboost, Pytorch, Custom predictions) in multiple versions.
After looking into the requirements you can use the new AI Platform feature custom prediction, https://cloud.google.com/ml-engine/docs/tensorflow/custom-prediction-routine-keras
To deploy a custom prediction routine to serve predictions from your trained model, do the following:
Create a custom predictor to handle requests
Package your predictor and your preprocessing module. Here you can install your custom libraries.
Upload your model artifacts and your custom code to Cloud Storage
Deploy your custom prediction routine to AI Platform

Google Cloud ML Engine: Apply Custom Function Before Training / Predicting

I currently pre-process some data on my laptop before passing it to ML Engine for training. Is it possible to apply a custom pre-processing function to my data and then train, all within ML Engine?
So instead of these steps:
Pre-process data on laptop.
Send pre-processed data to ML engine for training.
I would do:
Define pre-processing function for ML Engine
Send raw data to ML Engine, where it will:
a) pre-process my data by applying the function I've specified and
b) train on that data
Is this possible and, if so, how would I do it? I don't see anything in the docs.
Thanks!
You can use some of the samples code here:
Pre-processing is done using DataFlow and then training in ML Engine using the output generated during the pre-process phase.
For preprocessing TensorFlow models, consider TensorFlow Transform (Getting Started Guide).
You may be interested in the chicago_taxi example, which includes a script for integrating the preprocessing with classification on Cloud ML Engine.

Why use TensorFlow for Convolutional Neural Networks

I recently took a courser by Andrew Ng on Coursera. After that I shifted to Python and used Pandas, Numpy, Sklearn to implement ML algorithms. Now while surfing I came across tensorFLow and found it pretty amazing, and implemented this example which takes MNIST data as input.
But I am unsure why use such as library(TensorFlow)?
We are not doing any parallel calculations, since the weights updated in the previous epoch are used in the next one???
I am finding it difficult to find a reason to use such a Library?
There are several forms of parallelism that TensorFlow provides when training a convolutional neural network (and many other machine learning models), including:
Parallelism within individual operations (such as tf.nn.conv2d() and tf.matmul()). These operations have efficient parallel implementations for multi-core CPUs and GPUs, and TensorFlow uses these implementations wherever available.
Parallelism between operations. TensorFlow uses a dataflow graph representation for your model, and where there are two nodes that aren't connected by a directed path in the dataflow graph, these may execute in parallel. For example, the Inception image recognition model has many parallel branches in its dataflow graph (see figure 3 in this paper), and TensorFlow can exploit this to run many operations at the same time. The AlexNet paper also describes how to use "model parallelism" to run operations in parallel on different parts of the model, and TensorFlow supports that using the same mechanism.
Parallelism between model replicas. TensorFlow is also designed for distributed execution. One common scheme for parallel training ("data parallelism") involves sharding your dataset across a set of identical workers, performing the same training computation on each of those workers for different data, and sharing the model parameters between the workers.
In addition, libraries like TensorFlow and Theano can perform various optimizations when they can work with the whole dataflow graph of your model. For example, they can eliminate common subexpressions, avoid recomputing constant values, and generate more efficient fused code.
You might be able to find pre-baked models in sklearn or other libraries, but TensorFlow allows for really fast iteration of custom machine learning models. It also comes with a ton of useful functions that you would have to (and probably shouldn't) write yourself.
To me, it's less about performance (though they certainly care about performance), and more about whipping out neural networks really quickly.

Google cloud dataflow and machine learning

What's the best way to run machine learning algorithms on Google Cloud Dataflow? I can imagine that using Mahout would be one option given it's Java based.
The answer is probably no, but is there a way to invoke R or Python (that have strong support for algorithms) based scripts to offload ML execution?
-Girish
You can already implement many algorithms in terms of Dataflow transforms.
A class of algorithms that may not be as easy to implement are iterative algorithms, where the pipeline's execution graph depends on the data itself. Simplifying implementation of iterative algorithms is something that we are interested in, and you can expect future improvements and simplifications in this area.
Invoking Python (or any other) executable shouldn't be hard from a Dataflow pipeline. A ParDo can, for example, shell out and start an arbitrary process. You can use, for example, --filesToStage pipeline option to add additional files to the Dataflow worker environment.
There is also http://quickml.org/ (haven't used personally) and Weka. I remember the docs mention that it's possible to launch a new process from within the job, but AFAIK it's not recommended.

Resources