How to get google dataflow running jobs in java using client library - google-cloud-dataflow

I am trying to get all jobs from a project using Google client library for dataflow. I am able to fetch metrics using job Id. But unable to get all jobs inside a project, any code snippet will be very helpful. We can use Apache beam runner as well. There is a method list all jobs in Apache runner but I am unable to use it.

You will want to use this API: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
This should have an example showing how to use the Java client: https://cloud.google.com/dataflow/docs/samples/dataflow-v1beta3-generated-JobsV1Beta3-ListJobs-sync.

Related

Google cloud build python apache beam data flow yaml file

I am trying to deploy an apache beam Data Flow pipeline built-in python in google cloud build. I don't find any specific details about constructing the cloud build.YAML file.
I found a link dataflow-ci-cd-with-cloudbuild, but this seems JAVA based, tried this too but did not work as my starting point is main.py
It requires a container registry. Step to build & deploy is explained in the below link
Github link

Load PostgreSQL data into BigQuery using a Cloud Dataflow pipeline

Tried to implement the Cloud data-flow task, loading data from PostgreSQL database table to Google Cloud Bigquery table help of below URL document. When executing data-flow job got issue. Refer the screen shot [1].
URL: [Approach 2:ETL into BigQuery with Cloud Dataflow][1]
Google Cloud Dataflow supports 2.x version of SDK. Since this code is with version 1.x so got this generic issue when run the dataflow pipeline in Google Cloud. After migrate it to 2.x version able to run the code successfully.

google-cloud-dataflow vs apache-beam

It's really confusing that every Google document for dataflow is saying that it's based on Apache Beam now and directs me to Beam website. Also, if I looked for github project, I would see the google dataflow project is empty and just all goes to apache beam repo. Say now I need to create a pipeline, from what I read from Apache Beam, I would do : from apache_beam.options.pipeline_options However, if I go with google-cloud-dataflow, I'll have error: no module named 'options' , turns out I should use from apache_beam.utils.pipeline_options. So, looks like google-cloud-dataflow is with an older beam version and is going to be deprecated?
Which one should I pick do develop my dataflow pipeline?
Ended up finding answer in Google Dataflow Release Notes
The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:
The core SDK
DirectRunner and DataflowRunner
I/O components for other Google Cloud Platform services
The Cloud Dataflow SDK distribution does not include other Beam components, such as:
Runners for other distributed processing engines
I/O components for non-Cloud Platform services
Version 2.0.0 is based on a subset of Apache Beam 2.0.0
Yes, I've had this issue recently when testing outside of GCP. This link help to determine what you need when it comes to apache-beam. If you run the below you will have no GCP components.
$ pip install apache-beam
If you run this however you will have all the cloud components.
$ pip install apache-beam[gcp]
As an aside, I use the Anaconda distribution for almost all of my python coding and packages management. As of 7/20/17 you cannot use the anaconda repos to install the necessary GCP components. Hoping to work with the Continuum folks to have this resolved not just for Apache Beam but also for Tensorflow.

Steps to create Cloud Dataflow template using the Python SDK

I have created Pipeline in Python using Apache Beam SDK, and Dataflow jobs are running perfectly from command-line.
Now, I'd like to run those jobs from UI. For that i have to create template file for my job. I found steps to create template in Java using maven.
But how do I do it using the Python SDK?
Templates are available for creation in the Dataflow Python SDK since April of 2017. Here is the documentation.
To run a template, no SDK is needed (which is the main problem templates try to solve), so you can run them from the UI, REST API, or CL and here is how.

Access build.xml in Jenkins via REST api

Each finished build in jenkins has a build.xml file (in work/jobs/...BuildName.../builds/...BuildNumber... with a lot of info about the build. Is it possible to access that file using the rest api? I tried a lot of variations, but I could not find it.
Look in the Jenkins itself for the documentation.
If you access the URL
http://SERVER/job/JOB/api/
You will see the way to use REST api, which can access all elements of your Jenkins (including parameters and logs from the build).
I hope this helps.

Resources