How to add labels to an existing Google Dataflow job? - google-cloud-dataflow

I am using the Java GAPI client to work with Google Cloud Dataflow (v1b3-rev197-1.22.0). I am running a pipeline from template and the method for doing that (com.google.api.services.dataflow.Dataflow.Projects.Templates#create) does not allow me to set labels for the job. However I get the Job object back when I execute the pipeline, so I updated the labels and tried to call com.google.api.services.dataflow.Dataflow.Projects.Jobs#update to persist that information in Dataflow. But the labels do not get updated.
I also tried updating labels on finished jobs (which I also need to do), which didn't work either, so I thought it's because the job is in a terminal state. But updating labels seems to do nothing regardless of the state.
The documentation does not say anything about labels not being mutable on running or terminated pipelines, so I would expect things to work. Am I doing something wrong and if not what is the rationale behing the decision no to allow label updates? (And how are template users supposed to set the initial label set when executing the template?)
Background: I want to mark terminated pipelines that have been "processed", i.e. those that our automated infrastructure already sent notification about to appropriate places. Labels seemed as a good approach that would shield me from having to use some kind of local persitence to track stuff (big complexity jump). Any suggestions on how to approach this if labels are not the right tool? Sadly, Stackdriver cannot monitor finished pipelines, only failed ones. And sending a notification from within the pipeline code doesn't seem as a good idea to me (wrong?).

Related

Refusing to split GroupedShuffleRangeTracker proposed split position is out of range

I am sporadically getting the following errors:
W Refusing to split
at '\x00\x00\x00\x15\xbc\x19)b\x00\x01': proposed
split position is out of range
['\x00\x00\x00\x15\x00\xff\x00\xff\x00\xff\x00\xff\x00\x01',
'\x00\x00\x00\x15\xbc\x19)b\x00\x01'). Position of last group
processed was '\x00\x00\x00\x15\xbc\x19)a\x00\x01'.
When it happens, the error is logged every so often and the job never seems to end. Although it seems that it did actually complete the job otherwise.
In the last instance I am using 10 workers and have auto scaling disabled. I am using the Python implementation of Apache Beam.
This is not an error, it's part of normal operation of a pipeline. We should probably reduce its logging level to INFO and rephrase it, because it very frequently confuses people.
This message (rather obscurely) signals that Dataflow is trying to apply dynamic rebalancing, but there's no work that can be further subdivided.
I.e. your job is stuck doing something non-parallelizable on a small number of workers, while other workers are staying idle. To investigate this further, one would need to look at the code of your job and the Dataflow job id.

Creating a structured Jenkins Failing Test Report

The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

Jenkins - monitoring the estimated time of builds

I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds

Showing build queue per slave Jenkins

I wanted to ask if you can show build queue that will be executed on a slave. I have many slaves and I can check the global queue but I would like to see a queue dedicated to one slave. Every build fits in regular expression like : "(jobName)_(slaveNumber)".
I could not find any solution from jenkins settings so I wanted to use queue API and get the json that is provided up there then filter the results by this regular expression and display them on one jenkins page. But it turned out that jenkins does not allow to use javascript in description.
Is there a workaround? Or there is some kind of fancy button that will display build queue per slave? I would like to use as simple way as possible so installing "Anything Goes" won't be the solution I want.
have you tried REST api for computers AKA nodes AKA slaves? It shows current load statistic.
http://jenkins_site/computer/api/json?depth=1&pretty=true

Resources