It's really confusing that every Google document for dataflow is saying that it's based on Apache Beam now and directs me to Beam website. Also, if I looked for github project, I would see the google dataflow project is empty and just all goes to apache beam repo. Say now I need to create a pipeline, from what I read from Apache Beam, I would do : from apache_beam.options.pipeline_options However, if I go with google-cloud-dataflow, I'll have error: no module named 'options' , turns out I should use from apache_beam.utils.pipeline_options. So, looks like google-cloud-dataflow is with an older beam version and is going to be deprecated?
Which one should I pick do develop my dataflow pipeline?
Ended up finding answer in Google Dataflow Release Notes
The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:
The core SDK
DirectRunner and DataflowRunner
I/O components for other Google Cloud Platform services
The Cloud Dataflow SDK distribution does not include other Beam components, such as:
Runners for other distributed processing engines
I/O components for non-Cloud Platform services
Version 2.0.0 is based on a subset of Apache Beam 2.0.0
Yes, I've had this issue recently when testing outside of GCP. This link help to determine what you need when it comes to apache-beam. If you run the below you will have no GCP components.
$ pip install apache-beam
If you run this however you will have all the cloud components.
$ pip install apache-beam[gcp]
As an aside, I use the Anaconda distribution for almost all of my python coding and packages management. As of 7/20/17 you cannot use the anaconda repos to install the necessary GCP components. Hoping to work with the Continuum folks to have this resolved not just for Apache Beam but also for Tensorflow.
Related
Absolute beginner in DevOps here. I have a Gitlab repo that I would like to build and run its tests in the Gitlab pipeline CI.
So far, I'm only testing locally on my machine with a specific runner. There's a lot information out there and I'm starting to get lost with what to use and how to use it.
How would I go about creating a container with the tools that I need ? (VS compiler, cmake, git, etc...)
My application contains an SDK that only works on windows, so I'm not sure building on another platform would work at all, so how do I select a windows based container?
How would I use that container in the yml file in gitlab so that I can build my solution and run my tests?
Any specific documentation links or suggestions are welcomed and appreciated.
How would I go about creating a container with the tools that I need ? (VS compiler, cmake, git, etc...)
you can install those tools before the pipeline script runs. I usually do this in before_script.
If there's large-ish packages that need to be installed on every pipeline run, I'd recommend that you make yourown image, with all the required build dependencies, push it to GitLab and then just use it as your job image.
My application contains an SDK that only works on windows, so I'm not sure building on another platform would work at all, so how do I select a windows based container?
If you're using gitlab.com - Windows runners are currently in beta, but available for use.
SaaS runners on Windows are in beta and shouldn’t be used for production workloads.
During this beta period, the shared runner quota for CI/CD minutes applies for groups and projects in the same manner as Linux runners. This may change when the beta period ends, as discussed in this related issue.
If you're self-hosting - setup your own runner on Windows.
How would I use that container in the yml file in gitlab so that I can build my solution and run my tests?
This really depends on:
previous parts (you're using GL.com / self hosted)
how your application is built
what infrastructure you have access to
What I'm trying to say is that I feel like I can't give you a good answer without quite some more information
I am trying to deploy an apache beam Data Flow pipeline built-in python in google cloud build. I don't find any specific details about constructing the cloud build.YAML file.
I found a link dataflow-ci-cd-with-cloudbuild, but this seems JAVA based, tried this too but did not work as my starting point is main.py
It requires a container registry. Step to build & deploy is explained in the below link
Github link
Openshift provides a standard Jenkins template with preinstalled tools allowing us to execute oc commands. If we want to build NodeJS apps we can install NodeJS plugin for Jenkins. If we want to build .NET apps we can use MSBuild plugin. But there's no .NET Core plugin. In virtually all resources online people just install .NET Core CLI using apt-get (or different app management tools) on the Jenkins machine directly.
How to properly build .NET Core applications in Jenkins within Openshift environment? Should I provision Jenkins pod with .NET Core CLI using some scripts? Can I use some custom image to create slave Jenkins instance with .NET Core CLI preinstalled (this link suggests that way)? What's the recommended way?
For using the dotnet-jenkins-slave images, there is information in the Red Hat documentation: https://access.redhat.com/documentation/en-us/net_core/2.2/html/getting_started_guide/gs_dotnet_on_openshift#using_jenkins.
There is also a blog post on Red Hat developer blog on this topic: https://developers.redhat.com/blog/2019/10/17/ci-cd-for-net-core-container-applications-on-red-hat-openshift/.
I have created Pipeline in Python using Apache Beam SDK, and Dataflow jobs are running perfectly from command-line.
Now, I'd like to run those jobs from UI. For that i have to create template file for my job. I found steps to create template in Java using maven.
But how do I do it using the Python SDK?
Templates are available for creation in the Dataflow Python SDK since April of 2017. Here is the documentation.
To run a template, no SDK is needed (which is the main problem templates try to solve), so you can run them from the UI, REST API, or CL and here is how.
We are currently using Google's Cloud Dataflow SDK (1.6.0) to run dataflow jobs in GCP, however, we are considering moving to the Apache Beam SDK (0.1.0). We will still be running our jobs in GCP using the dataflow service. Has anyone gone through this transition and have advice? Are there any compatibility issues here and is this move encouraged by GCP?
Formally Beam is not yet supported on Dataflow (although that is certainly what we are working towards). We recommend staying with the Dataflow SDK, especially if SLA or support are important to you. that said, our tests show that Beam runs on Dataflow, and although that may break at any time, you are certainly welcome to attempt at your own risk.
Update:
The Dataflow SDKs are now based on Beam as of the release of Dataflow SDK 2.0 (https://cloud.google.com/dataflow/release-notes/release-notes-java-2). Both Beam and the Dataflow SDKs are currently supported on Cloud Dataflow.
You can run Beam SDK pipelines on Dataflow now. See:
https://beam.apache.org/documentation/runners/dataflow/
You'll need to add a dependency to pom.xml, and probably a few command-line options as explained on that page.