Testing triggers with processing time - google-cloud-dataflow

Can we test a pipeline that has windows with triggers that depend on processing time? For instance a streaming pipeline with a global window and a trigger to fire on elementCountAtLeast 0 will have different outputs depending on when the data comes in, so can we simulate that in any way?
Even if not for automated tests, being able to try out different windowing strategies and see their affects would be very useful

Totally agree. We're in the process of rewriting the DirectPipelineRunner to support this. Stay tuned!

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

Streaming pipeline publish to pubsub after write step completes

I have a use case where I have a Dataflow job running in streaming mode with an hourly fixed window.
When the pipeline runs for a given window, we calculate some data and write it to a data source. What I want to do next is publish some message to PubSub once the write is complete - how might I go about making sure that the write step is complete before writing to PubSub?
If the pipeline was executed in batch mode I know I could execute it in a blocking fashion as suggested here, but the tricky part is that this constantly running in streaming mode.
Wait.on() transform is designed for this use case. See documentation for usage example.

How to add labels to an existing Google Dataflow job?

I am using the Java GAPI client to work with Google Cloud Dataflow (v1b3-rev197-1.22.0). I am running a pipeline from template and the method for doing that (com.google.api.services.dataflow.Dataflow.Projects.Templates#create) does not allow me to set labels for the job. However I get the Job object back when I execute the pipeline, so I updated the labels and tried to call com.google.api.services.dataflow.Dataflow.Projects.Jobs#update to persist that information in Dataflow. But the labels do not get updated.
I also tried updating labels on finished jobs (which I also need to do), which didn't work either, so I thought it's because the job is in a terminal state. But updating labels seems to do nothing regardless of the state.
The documentation does not say anything about labels not being mutable on running or terminated pipelines, so I would expect things to work. Am I doing something wrong and if not what is the rationale behing the decision no to allow label updates? (And how are template users supposed to set the initial label set when executing the template?)
Background: I want to mark terminated pipelines that have been "processed", i.e. those that our automated infrastructure already sent notification about to appropriate places. Labels seemed as a good approach that would shield me from having to use some kind of local persitence to track stuff (big complexity jump). Any suggestions on how to approach this if labels are not the right tool? Sadly, Stackdriver cannot monitor finished pipelines, only failed ones. And sending a notification from within the pipeline code doesn't seem as a good idea to me (wrong?).

Creating a structured Jenkins Failing Test Report

The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.

How do I trigger a job when another completes?

I have two jobs, consider them to be the super simple jobs that just print a line and have no triggers or timeouts defines. They work fine when I call them from a controller class through: <name of my class>Job.triggerNow()
What I want is to trigger one job and, as it as it finishes, trigger a consequent different job.
I have tried using the quartzScheduler, but I can't seem to get a JobDetail from my job classes, so I'm not sure what is the correct way for doing this. I also want to pass some results from the first job onto the second one.
I know I can trigger the second job as the last line on my first job's execute method, but this is not desirable since its technically not part of the first job and couples things more than I would like.
Any help will be greatly appreciated. thanks
What it sounds like you are after is an asynchronous "pipeline" of work where there are different workers that are all in a line and pass data to be worked on from one to the next. This sort of architecture is amazingly flexible and applies to a large number of very common applications
The best way that I have found to get such an architecture in place with Grails is to use a message queue, like RabbitMQ for example, with a series of queues (one for each step in the pipeline), and then have the controller(s) put messages into the first step of the pipeline.
Then, you have a worker (just a service within the Grails app if you use the excellent RabbitMQ Grails plugin) listen to the queue that holds jobs for them to work on. As work comes into the queue, the worker will pop the job off, processes it, and then put a message into the queue of the next step in the pipeline.
I've found this to be the best way to architect just about any asynchronous pipeline, since it allows you to scale each piece separately as needed and doesn't have too much overhead. There are also ways to decouple the jobs from having to know about the next step in the pipeline, but I've found that in most cases this isn't really needed and just adds useless complexity.
Quartz is great for jobs that need to happen on a schedule, but a pipeline is much better at processing things as it comes in in a scaleable way
Please have a look #
JobListener
You can utilize
public void jobWasExecuted(JobExecutionContext context,
JobExecutionException jobException);
I built something similar to this in my web application using queue messaging technique with Redis. I simply define the dependency structure for all the jobs, and have a master job with the only purpose is to monitor/update the status of other jobs and trigger dependent jobs if needed.
Each job will have to report its status running/finish/cancel using the Redis queue. Master job pop each queue message and process it properly.

Resources