PubsubIO does not output custom timestamp attribute as context.timestamp when running with DataflowRunner and Dataflow service - google-cloud-dataflow

I am working on a Apache Beam project that ran onto an issue with Dataflow service and PubsubIO related to the custom timestamp attribute. Current version of Beam SDK is 2.7.0.
In the project, we have 2 Dataflow jobs communicating via a PubSub topic and subscription:
The first pipeline (sinking data to PubSub)
This pipeline works on per-basis messages, therefore it had no custom window strategy applied besides the GlobalWindows (default by Beam). At the end of this pipeline, we sunk (wrote) all the messages which had already been assigned a map of attributes including their event timestamp (e.g. "published_at") to a PubSub topic using PubsubIO.writeMessages().
Note: if we use PubsubIO.writeMessages().withTimestampAttribute(), this method will tell PubsubIO.ShardFn, PubsubIO.WriteFn and PubsubClient to write/overwrite the sinking pipeline's processing time to this attribute in the map.
The second pipeline (reading data from PubSub)
In the second pipeline (reading pipeline), we have tried PubsubIO.readMessagesWithAttributes().withTimestampAttribute("published_at") and PubsubIO.readStrings().withTimestampAttribute("published_at") for the source.
When running with DirectRunner, everything worked well as expected. The messages
were read from the PubSub subscription and outputted to the
downstream stages with a ProcessContext.timestamp() equals to their
event timestamp "published_at".
But when running with DataflowRunner, the ProcessContext.timestamp()
was always set near real time which is closed to the sinking
pipeline's processing time. We checked and can confirm that those
timestamps were not from PubSub's publishing time. All the data were
then assigned to the wrong windows compared to their event domain
timestamp. We expected late data to be dropped not to be assigned
into invalid windows.
Note: We had left the Pubsub topic populated with a considerable amount of data before we turned on the second pipeline to have some kind of historical/late data.
Pubsub messages with invalid context timestamp
Assumed root cause
Looking deeper into the source code of DataflowRunner, we can see that Dataflow Service uses a completely different Pubsub code (overriding the PubsubIO.Read at the pipeline's construction time) to Read from and Sink to Pubsub.
So if we want to use the Beam SDK's PubsubIO, we have to use the experimental option "enable_custom_pubsub_source". But so far no luck yet as we have run into this issue https://jira.apache.org/jira/browse/BEAM-5674 and have not been able to test Beam SDK' Pubsub codes.
Workaround solution
Our current workaround is that, after the step assigning windows to the messages, we implemented a DoFn to check their event timestamp against their IntervalWindow. If the windows are invalid, then we just drop the messages and later on run a weekly or half a week jobs to correct them from a historical source. It is better to have some missing data rather than the improperly calculated ones.
Messages dropped due to invalid windows
Please share with us experiences on this case. We know that from the perspective of the Dataflow watermark management, the watermark is said to adjust itself into the current real time if the ingested data is sparsed (not dense enough overtime).
We also believe that we are misunderstanding something about the way Dataflow service maintains the PubsubUnboundedSource's output timestamp as we are still new to Apache Beam and Google's Dataflow so there are things that we have not come to know of yet.
Many Thanks!

I found the fix for this issue. In my sinking pipeline, the timestamp attribute is set with wrong date format compared to RFC 3339 standard. The formatted dates missed 'Z' character. We either did fix the 'Z' character or changed to use the milliseconds since epoch. Both worked well.
But one thing is that when Dataflow service could not parse the wrong date formats, it did warn or throw error but instead took the processing time for all the elements therefore, they were assigned to the wrong event_time windows.

Related

Debugging slow reads from BigQuery on Google Cloud Dataflow

Background:
We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.
Problem:
Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.
Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)
Is it maybe possible to define a timeout for the job?
This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.
On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.
The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

Creating a structured Jenkins Failing Test Report

The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

Dataflow OutOfMemoryError while reading small tables from BigQuery

We have a pipeline reading data from BigQuery and processing historical data for various calendar years. It fails with OutOfMemoryError errors if the input data is small (~500MB)
On startup it reads from BigQuery about 10.000 elements/sec, after short time it slows down to hundreds elements/s then it hangs completely.
Observing 'Elements Added' on the next processing step (BQImportAndCompute), the value increases and then decreases again. That looks to me like some already loaded data is dropped and then loaded again.
Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:
Error reporting workitem progress update to Dataflow service:
"java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)
I would suspect that there is a problem with topology of the pipe, but running the same pipeline
locally with DirectPipelineRunner works fine
in cloud with DataflowPipelineRunner on large dataset (5GB, for another year) works fine
I assume problem is how Dataflow parallelizes and distributes work in the pipeline. Are there any possibilities to inspect or influence it?
The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.
Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).
Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.
Have you tried debugging with Stackdriver?
https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger

Cloud Dataflow Running really slow when reading/writing from Cloud Storage (GCS)

Since using the release of the latest build of Cloud Dataflow (0.4.150414) our jobs are running really slow when reading from cloud storage (GCS). After running for 20 minutes with 10 VMs we were only able to read in about 20 records when previously we could read in millions without issue.
It seems to be hanging, although no errors are being reported back to the console.
We received an email informing us that the latest build would be slower and that it could be countered by using more VMs but we got similar results with 50 VMs.
Here is the job id for reference: 2015-04-22_22_20_21-5463648738106751600
Instance: n1-standard-2
Region: us-central1-a
Your job seems to be using side inputs to a DoFn. Since there has been a recent change in how Cloud Dataflow SDK for Java handles side inputs, it is likely that your performance issue is related to that. I'm reposting my answer from a related question.
The evidence seems to indicate that there is an issue with how your pipeline handles side inputs. Specifically, it's quite likely that side inputs may be getting re-read from BigQuery again and again, for every element of the main input. This is completely orthogonal to the changes to the type of virtual machines used by Dataflow workers, described below.
This is closely related to the changes made in the Dataflow SDK for Java, version 0.3.150326. In that release, we changed the side input API to apply per window. Calls to sideInput() now return values only in the specific window corresponding to the window of the main input element, and not the whole side input PCollectionView. Consequently, sideInput() can no longer be called from startBundle and finishBundle of a DoFn because the window is not yet known.
For example, the following code snippet has an issue that would cause re-reading side input for every input element.
#Override
public void processElement(ProcessContext c) throws Exception {
Iterable<String> uniqueIds = c.sideInput(iterableView);
for (String item : uniqueIds) {
[...]
}
c.output([...]);
}
This code can be improved by caching the side input to a List member variable of the transform (assuming it fits into memory) during the first call to processElement, and use that cached List instead of the side input in subsequent calls.
This workaround should restore the performance you were seeing before, when side inputs could have been called from startBundle. Long-term, we will work on better caching for side inputs. (If this doesn't help fully resolve the issue, please reach out to us via email and share the relevant code snippets.)
Separately, there was, indeed, an update to the Cloud Dataflow Service around 4/9/15 that changed the default type of virtual machines used by Dataflow workers. Specifically, we reduced the default number of cores per worker because our benchmarks showed it as cost effective for typical jobs. This is not a slowdown in the Dataflow Service of any kind -- it just runs with less resources per worker, by default. Users are still given the options to override both the number of workers as well as the type of the virtual machine used by workers.
We had a similar issue. It is when the side-input is reading from a BigQuery table that has had its data streamed in, rather than bulk loaded. When we copy the table(s), and read from the copies instead everything works fine.
If your tables are streamed, try copying them and reading the copies instead. This is a workaround.
See: Dataflow performance issues

Resources