I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
training script
input data
additional step params
If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
3 Passing a model to a downstream step
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.
The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.
2 Control Plane vs Data Plane
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()
PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift
1 Pipeline encapsulation
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
you need to predict live (else batch prediction is sufficient), and
your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
do as we do and have one pipeline that can be configured to do either using script arguments.
Related
I'm new to sagemaker pipeline, doing some reasearch on how can i train models not just in jupyter notebook but I want to set it up as a sagemaker pipeline in sagamaker studio. I tried and followed some examples based on blogs/docs provided here -> https://aws.amazon.com/blogs/machine-learning/hugging-face-on-amazon-sagemaker-bring-your-own-scripts-and-data/
and was able to run these steps in a jupyter notebook in sagemaker, but if i wanted to set up a sagemaker pipeline and create steps for training , how can i convert these to sagemaker pipeline steps, any examples or blogs doing similar in sagemaker pipeline/studio would be helpful?
While there is a way to convert a jupyter notebook to python script, well this is not sufficient to convert code written in notebook context to pipeline context.
The only things that remain intact are scripts written as entry_point for training/inference/processing in general (barring any minor internal readjustments e.g. on used environment variables that may be present in a pipelined context differently).
This official guide seems to me the most complete to follow as a prerequisite:
"Amazon SageMaker Model Building Pipeline"
Next you can see a fairly recurring application scenario, again in official guide:
"Orchestrate Jobs to Train and Evaluate Models with Amazon SageMaker Pipelines"
By setting the name of your pipeline, once it is launched, you will see it as a graph with the various states running in SageMaker Studio.
You will probably have written data manipulation code within your notebook. Remember that the pipeline is composed of steps, so you cannot manipulate data between steps without understanding this step as a step itself (e.g. processing step). Perhaps therefore the biggest code change to readjust is this part.
I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning
and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff.
The people tend to split the data at the very first saying that it is to prevent the data leakage.
I am right now just so confused about the pipeline of building a model.
why do we need to slipt the data at the very beginning? and to clean the train set and test set separately when we can actually do all the data cleaning and feature engineering or things like transforming the categorical variable to dummy variable together for convenience purpose?
Please help me with this
Really wanna know a convenient and scientific pipeline
You should split the data as early as possible.
To put it simply, your data engineering pipeline builds models too.
Consider the simple idea of filling in missing values. To do this you need to "train" a mini-model to generate the mean or mode or some other average to use. Then you use this model to "predict" missing values.
If you include the test data in the training process for these mini-models, then you are letting the training process peek at that data and cheat a little bit because of that. When it fills in the missing data, with values built using the test data, it is leaving little hints about what the test set is like. This is what "data leakage" means in practice. In an ideal world you could ignore it, and instead just use all data for training use the training score to decide which model is best.
But that won't work, because in practice a model is only useful once it is able to predict any new data, and not just the data available at training time. Google Translate needs to work on whatever you and I type in today, not just what it was trained with earlier.
So, in order to ensure that the model will continue to work well when that happens, you should test it on some new data in a more controlled way. Using a test set, which has been split out as early as possible and then hidden away, is the standard way to do that.
Yes, it means some inconvenience to split the data engineering up for training vs testing. But many tools like scikit, which splits the fit and transform stages, make it convenient to build an end-to-end data engineering and modeling pipeline with the right train/test separation.
The workflow :
to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.
i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.
Question:
What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?
I am using Azure Machine Learning to build a model which will predict if a project will be approved (1) or not (0).
My dataset is composed of a list of projects. Each line represents a project and its details - starting day, theme, author, place, people involved, stage, date of last stage and approved.
There are 15 possible crescent stages a project can pass through before being approved. On the other hand, in some special cases, a project can be approved mid-way, that is, before getting to the last stage which is the most commom
I will be receiving daily updates on some projects, as well as, new projects that are coming in. I am trying to build a model which will predict the probability of it being approved or not based on my inputs (which will inclue stage).
I want to use stage as an input, but if I use it with a two-class boosted decision tree it will indirectly give the answer to my model.
I've read a little bit about HMM and tried to learn how to apply to my ML model but did not understand how to. Could anyone guide me to the right path, please? Should I really use HMM?
rather than stage, I would recommend to use duration in last stage, duration in last stage-1, duration in last stage -2
After running the Machine Learner Algorithm (SVM) on training data using GATE tool, I would like to test it on testing data. My question is, should I use the same trained data to be tested, also, how could the model extract the entities from the test data while the test data not annotated with the annotations that have been learnt in the trained data.
I followed the tutorial on this link http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf but at the end it was a bit confusing when it talks about splitting the dataset into training and testing.
In GATE you have 3 modes of the machine learning PR - for training, evaluation and application.
What happens when you train is that the ML PR is checking the selected annotation (let's say Token), collecting it's features and learning the target class (i.e. Person, Mention or whatever). Using the example docs, the ML PR creates a model which holds values for features and basically "learns" how to classify new Tokens (or sentences, or other).
When testing, you provide the ML PR only the Tokens with all their features. Then the ML PR uses them as input for its model and decides if or what Mention to create. The ML PR actually needs everything that was there in the training corpus, except the label / target class / mention - the decision that should be made.
I think the GATE ML PR ignores the labels when in test mode, so it's not crucial to remove it.
Evaluation is a helpful option, where training and testing are done automatically, the corpus is split and results are presented. What it does is split the corpus in 2, train on one part, apply the model on the other, compare the gold standard to what it labeled. Repeat with different splits.
The usual sequence is to train and evaluate, check results, fix, add features, etc. and when you're happy with the evaluation results, switch to application and run on data that doesn't have labels.
It is crucial that you run the same pre-processing when you're training and testing. For instance if in training you've run a POS tagger and you skip this when testing, the ML PR won't have the "Token.category" feature and will calculate very different results.
Now to your questions :)
NO! Don't use the same data for testing, that is a very common mistake, if you get suspiciously good results, first check if you're doing that.
In the tutorial, when you split the corpus both parts will have all the annotations as before, so the ML PR will have all the features it needed. In real life, you'll have to run some pre-processing first as docs will come without tokens or anything.
Splitting in their case is done very simple - just save all docs to files, split files in two folders, load them as two corpora.
Hope this helps :)