Tensorflow transform on beams with flink runner - machine-learning

It may seem stupid but it is my very first post here. Sorry for doing anything wrong.
I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer.
But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support DirectRunner and DataFlowRunner (Google Cloud).
I have been searching through the web for some time, and see the last line in TFX website,
"Please direct any questions about working with tf.Transform to Stack Overflow using the tensorflow-transform tag."
And that's why I am here. Any idea or workaround is really appreciated. Thank you!

Thanks for the question.
Disclaimer: Portable Flink Runner is still in experimental phase will only work with trivial amount of input data.
Here is how you can run TFX on Flink via Beam.
Prerequisite
Linux
Docker
Beam Repo: https://github.com/apache/beam
Distributed file system for input and output.
Instructions to run a python pipeline: https://beam.apache.org/roadmap/portability/#python-on-flink
Note: We currently only support Flink 1.5.5
Instructions
1) Build Worker Containers:
Go to Beam checkout dir
Run gradle command: ./gradlew :beam-sdks-python-container:docker
2) Run Beam JobServer for Flink:
Go to Beam checkout dir
Run gradle command: ./gradlew beam-runners-flink_2.11-job-server:runShadow
Note: this command will not finish as it starts the job server and keep it running.
3) Submit a pipeline
Please refer to https://github.com/angoenka/model-analysis/blob/hack_1/examples/chicago_taxi/preprocess_flink.sh
Note: make sure to pass following flags to your pipeline
--experiments=beam_fn_api
--runner PortableRunner
--job_endpoint=localhost:8099
--experiments=worker_threads=100
--execution_mode_for_batch=BATCH_FORCED

Related

Build Beam pipelines using Bazel (with DataflowRunner)

I use Bazel to build my Beam pipeline. The pipeline works well using the DirectRunner, however, I have some trouble managing dependencies when I use DataflowRunner, Python can not find local dependencies (e.g. generated by py_library) in DataflowRunner. Is there any way to hint Dataflow to use the python binary (py_binray zip file) in the worker container to resolve the issue?
Thanks,
Please see here for more details on setting up dependencies for Python SDK on Dataflow. If you are using a local dependency, you should probably look into developing a Python package and using the extra_package option or developing a custom container.

Running multi file python apache beam with airflow Cloud Composer

I have a multi file python apache beam job that I want to run on Cloud Dataflow through airflow / cloud composer. My multi file job is designed following this recommendation https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies It works without any problem from CLI.
Now I want to run it through airflow and Cloud Composer. I tried to use https://airflow.apache.org/integration.html#dataflowpythonoperator but it does not work. Probably I'm missing the right way to configure this operator. By the way I have no problem using this operator to run one file python apache beam jobs.

NUnit3 with SpecFlow runs in VS and as batch command but not in Jenkins

I have my selenium tests written using SpecFlow(+SpecRun) and NUnit framework (v.3.8.1.0). I've configured Jenkins to run these tests. My Jenkins Windows Batch Command is as follows:
"C:\Program Files (x86)\NUnit.ConsoleRunner\3.7.0\tools\nunit3-console.exe"
C:\Projects\Selenium\ClassLibrary1\PortalTests\bin\Debug\PortalTests.dll
--test=TransactionTabTest;result="%WORKSPACE%\TestResults\TestR.xml";format=nunit3
When I trigger build test seems to start running as I'm getting as far as end of NUNIT3-CONSOLE [inputfiles] [options] with spinner indicating that test is running but it actually never ends and estimated remaining time is: N/A.
Now, when I run this script with windows cmd.exe:
"[PATH to Console.exe]\nunit3-console.exe" PortalTests.dll -- test=TransactionTabTest
this test pass successfully and so does in VS.
Now, I know this is very generic question but any clues will be much appreciated.
As you are using SpecFlow+Runner/Specrun, you can find the documentation how to configure it for the different build servers here: http://specflow.org/plus/documentation/SpecFlowPlus-and-Build-Servers/

how to develop Beam Pipeline locally on IDE and run on Dataflow?

how to develop Beam Pipeline locally on IDE and run on Dataflow?
I want to use pycharm in my PC to develop Beam pipelines and run them on Dataflow. Are there any tutorials on how to do this?
Set the parameter as :
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
and try example as:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
Its easy.
Step-1 Code Stuff as a .py file
In Terminal just run the .py file

How to integrate meteor's velocity tests with jenkins?

On Velocity's GH page it mentions "easy CI integration" as one of the benefits, but I haven't seen any documentation about it.
How can I integrate Velocity with Jenkins?
You should use:
meteor --test
meteor run --test
This does the same thing as the velocity-ci without the extra installation
You could try the velocity-ci
velocity-cli
NPM module for running your velocity test suites from the command-line
Installation
npm install -g velocity-ci
Run
From inside your project directory type velocity
How it works
The velocity-cli spawns a meteor process and connects to it using DDP.
PhantomJS connects to the meteor process to trigger client side tests.
Test results received via DDP are printed at the console. This process
exits with the appropriate exit status code.
So the build step would be velocity inside the meteor directory

Resources