Build Beam pipelines using Bazel (with DataflowRunner) - google-cloud-dataflow

I use Bazel to build my Beam pipeline. The pipeline works well using the DirectRunner, however, I have some trouble managing dependencies when I use DataflowRunner, Python can not find local dependencies (e.g. generated by py_library) in DataflowRunner. Is there any way to hint Dataflow to use the python binary (py_binray zip file) in the worker container to resolve the issue?
Thanks,

Please see here for more details on setting up dependencies for Python SDK on Dataflow. If you are using a local dependency, you should probably look into developing a Python package and using the extra_package option or developing a custom container.

Related

How should I use npm modules while developing Jenkins plugin UI?

I am working on simple Jenkins plugin for pipilene vizualization. I would like to use d3.js library in my project, but I have no idea how I should integrate npm modules into Jenkins plugin development workflow. Is it even possible?
I have tried putting d3.js source code directly into folder with my js scripts, but it doesn't work and some files are missing for some magical reason.

Running multi file python apache beam with airflow Cloud Composer

I have a multi file python apache beam job that I want to run on Cloud Dataflow through airflow / cloud composer. My multi file job is designed following this recommendation https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies It works without any problem from CLI.
Now I want to run it through airflow and Cloud Composer. I tried to use https://airflow.apache.org/integration.html#dataflowpythonoperator but it does not work. Probably I'm missing the right way to configure this operator. By the way I have no problem using this operator to run one file python apache beam jobs.

Tensorflow transform on beams with flink runner

It may seem stupid but it is my very first post here. Sorry for doing anything wrong.
I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer.
But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support DirectRunner and DataFlowRunner (Google Cloud).
I have been searching through the web for some time, and see the last line in TFX website,
"Please direct any questions about working with tf.Transform to Stack Overflow using the tensorflow-transform tag."
And that's why I am here. Any idea or workaround is really appreciated. Thank you!
Thanks for the question.
Disclaimer: Portable Flink Runner is still in experimental phase will only work with trivial amount of input data.
Here is how you can run TFX on Flink via Beam.
Prerequisite
Linux
Docker
Beam Repo: https://github.com/apache/beam
Distributed file system for input and output.
Instructions to run a python pipeline: https://beam.apache.org/roadmap/portability/#python-on-flink
Note: We currently only support Flink 1.5.5
Instructions
1) Build Worker Containers:
Go to Beam checkout dir
Run gradle command: ./gradlew :beam-sdks-python-container:docker
2) Run Beam JobServer for Flink:
Go to Beam checkout dir
Run gradle command: ./gradlew beam-runners-flink_2.11-job-server:runShadow
Note: this command will not finish as it starts the job server and keep it running.
3) Submit a pipeline
Please refer to https://github.com/angoenka/model-analysis/blob/hack_1/examples/chicago_taxi/preprocess_flink.sh
Note: make sure to pass following flags to your pipeline
--experiments=beam_fn_api
--runner PortableRunner
--job_endpoint=localhost:8099
--experiments=worker_threads=100
--execution_mode_for_batch=BATCH_FORCED

how to develop Beam Pipeline locally on IDE and run on Dataflow?

how to develop Beam Pipeline locally on IDE and run on Dataflow?
I want to use pycharm in my PC to develop Beam pipelines and run them on Dataflow. Are there any tutorials on how to do this?
Set the parameter as :
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
and try example as:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
Its easy.
Step-1 Code Stuff as a .py file
In Terminal just run the .py file

Can I use third party libraries with Cloud Dataflow?

Does Cloud Dataflow allows you to use it with third party library jar files? How about non-Java libraries?
Kaz
Yes you can use third party library files just fine. By default when you run your Dataflow main program to submit your job, Dataflow will analyze your classpath and upload any jars it sees and add them to the class path of the workers.
If you need more controlthen you can use the command line option --filesToStage to specify additional files to stage on the workers.
Another common technique is building a single bundled jar which contains all your dependencies. One way to build a bundled jar is to use a maven plugin like shade.

Resources