Running multi file python apache beam with airflow Cloud Composer - google-cloud-dataflow

I have a multi file python apache beam job that I want to run on Cloud Dataflow through airflow / cloud composer. My multi file job is designed following this recommendation https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies It works without any problem from CLI.
Now I want to run it through airflow and Cloud Composer. I tried to use https://airflow.apache.org/integration.html#dataflowpythonoperator but it does not work. Probably I'm missing the right way to configure this operator. By the way I have no problem using this operator to run one file python apache beam jobs.

Related

Build Beam pipelines using Bazel (with DataflowRunner)

I use Bazel to build my Beam pipeline. The pipeline works well using the DirectRunner, however, I have some trouble managing dependencies when I use DataflowRunner, Python can not find local dependencies (e.g. generated by py_library) in DataflowRunner. Is there any way to hint Dataflow to use the python binary (py_binray zip file) in the worker container to resolve the issue?
Thanks,
Please see here for more details on setting up dependencies for Python SDK on Dataflow. If you are using a local dependency, you should probably look into developing a Python package and using the extra_package option or developing a custom container.

Generating Dataflow Template Using Python

I have python script that creates dataflow template in the specified GCS path. I have tested the script using my GCP Free Trial and it works perfect.
My question is using same code in production environment I want to generate a template but I can not use Cloud-Shell as there are restrictions also can not directly run the Python script that is using the SA keys.
Also I can not create VM and using that generate a template in GCS.
Considering above restrictions is there any option to generate the dataflow template.
Using dataflow flex templates should obviate the need to automatically generate templates--instead you could create a single template that can be parameterized arbitrarily.
Using Composer, I have triggered the Dataflow DAGs which created Jobs in dataflow. Also managed to generate Dataflow template. Using Dataflow console & template executed the job

Tensorflow transform on beams with flink runner

It may seem stupid but it is my very first post here. Sorry for doing anything wrong.
I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer.
But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support DirectRunner and DataFlowRunner (Google Cloud).
I have been searching through the web for some time, and see the last line in TFX website,
"Please direct any questions about working with tf.Transform to Stack Overflow using the tensorflow-transform tag."
And that's why I am here. Any idea or workaround is really appreciated. Thank you!
Thanks for the question.
Disclaimer: Portable Flink Runner is still in experimental phase will only work with trivial amount of input data.
Here is how you can run TFX on Flink via Beam.
Prerequisite
Linux
Docker
Beam Repo: https://github.com/apache/beam
Distributed file system for input and output.
Instructions to run a python pipeline: https://beam.apache.org/roadmap/portability/#python-on-flink
Note: We currently only support Flink 1.5.5
Instructions
1) Build Worker Containers:
Go to Beam checkout dir
Run gradle command: ./gradlew :beam-sdks-python-container:docker
2) Run Beam JobServer for Flink:
Go to Beam checkout dir
Run gradle command: ./gradlew beam-runners-flink_2.11-job-server:runShadow
Note: this command will not finish as it starts the job server and keep it running.
3) Submit a pipeline
Please refer to https://github.com/angoenka/model-analysis/blob/hack_1/examples/chicago_taxi/preprocess_flink.sh
Note: make sure to pass following flags to your pipeline
--experiments=beam_fn_api
--runner PortableRunner
--job_endpoint=localhost:8099
--experiments=worker_threads=100
--execution_mode_for_batch=BATCH_FORCED

Bit Bucket Pipeline Docker Image

Is there somewhere out there a docker image with composer, phpunit and codeception installed. That I can use for bit bucket pipelines?
Or is there another way to prevent Composer from being installed in every build?
Composer as a official image, but I guess you have to install codeception. Composer and phpunit (which is only a php script) should work: https://hub.docker.com/_/composer/

how to develop Beam Pipeline locally on IDE and run on Dataflow?

how to develop Beam Pipeline locally on IDE and run on Dataflow?
I want to use pycharm in my PC to develop Beam pipelines and run them on Dataflow. Are there any tutorials on how to do this?
Set the parameter as :
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
and try example as:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
Its easy.
Step-1 Code Stuff as a .py file
In Terminal just run the .py file

Resources