We are using apache avro and got fortify scan results for avro. Fortify reports show race condition on this place: https://github.com/apache/avro/blob/master/lang/js/lib/protocols.js#L529
Does anyone has an idea if it is possible? It looks like a false positive to me. But want to make sure. Any input is appreciated.
Related
We have a Dataflow job that has a low system latency and a high "data freshness" (or "data watermark lag").
After upgrading to Beam 2.15 (from 2.12) we see that this metric keeps increasing, which would be caused by something stuck in the pipeline. However, this is not the case, as all data was consumed (from a PubSub subscription). Permissions also seem ok as we can consume (unless that is not enough?).
We also checked individual watermarks on all components of the pipeline, and they are ok (very recent).
Thanks!
This is indeed quite odd. Here are some reasons why you might be seeing this:
There may be a bug in a new Beam SDK, or in Dataflow when estimating the watermark.
It may be that you updated the topology of your pipeline, and hit a bug related to watermark calculation for old/new topology.
The job may indeed be stuck, and you may have missed some data that actually did not make it across the pipeline.
My advice, if you're seeing this, is to open a support case with Dataflow support.
I am working on a Apache Beam project that ran onto an issue with Dataflow service and PubsubIO related to the custom timestamp attribute. Current version of Beam SDK is 2.7.0.
In the project, we have 2 Dataflow jobs communicating via a PubSub topic and subscription:
The first pipeline (sinking data to PubSub)
This pipeline works on per-basis messages, therefore it had no custom window strategy applied besides the GlobalWindows (default by Beam). At the end of this pipeline, we sunk (wrote) all the messages which had already been assigned a map of attributes including their event timestamp (e.g. "published_at") to a PubSub topic using PubsubIO.writeMessages().
Note: if we use PubsubIO.writeMessages().withTimestampAttribute(), this method will tell PubsubIO.ShardFn, PubsubIO.WriteFn and PubsubClient to write/overwrite the sinking pipeline's processing time to this attribute in the map.
The second pipeline (reading data from PubSub)
In the second pipeline (reading pipeline), we have tried PubsubIO.readMessagesWithAttributes().withTimestampAttribute("published_at") and PubsubIO.readStrings().withTimestampAttribute("published_at") for the source.
When running with DirectRunner, everything worked well as expected. The messages
were read from the PubSub subscription and outputted to the
downstream stages with a ProcessContext.timestamp() equals to their
event timestamp "published_at".
But when running with DataflowRunner, the ProcessContext.timestamp()
was always set near real time which is closed to the sinking
pipeline's processing time. We checked and can confirm that those
timestamps were not from PubSub's publishing time. All the data were
then assigned to the wrong windows compared to their event domain
timestamp. We expected late data to be dropped not to be assigned
into invalid windows.
Note: We had left the Pubsub topic populated with a considerable amount of data before we turned on the second pipeline to have some kind of historical/late data.
Pubsub messages with invalid context timestamp
Assumed root cause
Looking deeper into the source code of DataflowRunner, we can see that Dataflow Service uses a completely different Pubsub code (overriding the PubsubIO.Read at the pipeline's construction time) to Read from and Sink to Pubsub.
So if we want to use the Beam SDK's PubsubIO, we have to use the experimental option "enable_custom_pubsub_source". But so far no luck yet as we have run into this issue https://jira.apache.org/jira/browse/BEAM-5674 and have not been able to test Beam SDK' Pubsub codes.
Workaround solution
Our current workaround is that, after the step assigning windows to the messages, we implemented a DoFn to check their event timestamp against their IntervalWindow. If the windows are invalid, then we just drop the messages and later on run a weekly or half a week jobs to correct them from a historical source. It is better to have some missing data rather than the improperly calculated ones.
Messages dropped due to invalid windows
Please share with us experiences on this case. We know that from the perspective of the Dataflow watermark management, the watermark is said to adjust itself into the current real time if the ingested data is sparsed (not dense enough overtime).
We also believe that we are misunderstanding something about the way Dataflow service maintains the PubsubUnboundedSource's output timestamp as we are still new to Apache Beam and Google's Dataflow so there are things that we have not come to know of yet.
Many Thanks!
I found the fix for this issue. In my sinking pipeline, the timestamp attribute is set with wrong date format compared to RFC 3339 standard. The formatted dates missed 'Z' character. We either did fix the 'Z' character or changed to use the milliseconds since epoch. Both worked well.
But one thing is that when Dataflow service could not parse the wrong date formats, it did warn or throw error but instead took the processing time for all the elements therefore, they were assigned to the wrong event_time windows.
I am writing a simple Dataflow pipeline in Java:
PubsubIO -> ConvertToTableRowDoFn -> BigQueryIO
The pipeline is working -- data arrives in BigQuery as expected -- but I'm seeing OutOfMemoryErrors in the Dataflow worker logs.
One experiment I tried is slowing down the ConvertToTableRowDoFn by adding Thread.sleep(100). I was thinking that this would make the batch sizes smaller for BigQueryIO, but to my surprise, this made the OutOfMemoryErrors more frequent!
This makes me think that something in PubsubIO is reading data too quickly or doing too much buffering. Any tips for how to investigate this, or pointers on how PubsubIO does buffering in the Google Dataflow environment?
Recompiled beam with FILE_TRIGGERING_RECORD_COUNT = 100000 instead of 500000 and we haven't seen any OOMs since.
We have a pipeline reading data from BigQuery and processing historical data for various calendar years. It fails with OutOfMemoryError errors if the input data is small (~500MB)
On startup it reads from BigQuery about 10.000 elements/sec, after short time it slows down to hundreds elements/s then it hangs completely.
Observing 'Elements Added' on the next processing step (BQImportAndCompute), the value increases and then decreases again. That looks to me like some already loaded data is dropped and then loaded again.
Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:
Error reporting workitem progress update to Dataflow service:
"java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)
I would suspect that there is a problem with topology of the pipe, but running the same pipeline
locally with DirectPipelineRunner works fine
in cloud with DataflowPipelineRunner on large dataset (5GB, for another year) works fine
I assume problem is how Dataflow parallelizes and distributes work in the pipeline. Are there any possibilities to inspect or influence it?
The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.
Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).
Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.
Have you tried debugging with Stackdriver?
https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger
I have installed the jMeter plug-in for Jenkins and I can see the performance report based upon the jtl file that my jmx file generates. In my jmx file, I have several Response Assertion listeners. Based upon the jtl file that it generates, the assertion responses are false. However, when I look at the performance report in Jenkins, it is showing 0 in the percentage of errors. My first question would be, what constitutes an error in the performance report? And secondly, how can I see which of the response assertion listeners returns false in the performance report in Jenkins?
What constitute an error in the performance report is determined by the percentage of errors that the samples find
To see what the error is go to
last build->Peformance Report
you should see a listing of all of the samples in your jmeter test plan and the results
Then select the sample name and you should see a the results for the sample taken