I have added a delay of 4 seconds to Amazon SQS queue but while looking at the logs the ApproximateNumberOfMessagesDelayed metric it shows as zero . I am not sure why the delay is not showing up in metrics .
but in metrics its now showing up.
Related
In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.
We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.
It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.
metrics
What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.
GroupByKey throughput
I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.
Does anyone know what might be causing this issue?
The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.
Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.
This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.
We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.
https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
I have a worker environment set up to run six cron jobs.
When checking Cloudwatch I noticed that i'm receiving 57K emptyReceives/day.
I researched about this and found Long polling can be used to counter this high number of emptyReceives.
So, I tried to reduce this by setting Receive Message Wait Time to 20s in the SQS console for the SQS queue of the worker environment.
But still I'm getting 57K emptyReceives/day.
I checked sample for 5 minutes and i'm getting 200 emptyReceives.
This means a request every 1.5 seconds, right ?. So the setting is not working obviously.
So is there any other setting i need to set before i can use long polling with Worker environment queue?
When checked the tutorial, it says short polling occurs when:
The ReceiveMessage call sets WaitTimeSeconds to 0.
The ReceiveMessage call doesn’t set WaitTimeSeconds, but the queue attribute ReceiveMessageWaitTimeSeconds is set to 0.
From what i understood, the WaitTimeSeconds of ReceiveMessage call must be 0 for this to happen in my case.
Is there any option i can change this ?
We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.
I am new to Flink community and I am trying to do a experimental study to capture performance of Flink for streaming data.
For this, I am trying to collect statistics of running jobs over a few hours. However, using Flink’s UI I can only see the statistics for the last 5 minutes.
I tried to hit the Rest API but that does not contain the data of statistics other than bytes read/written.
The metrics provided in the UI under Task Metrics are very helpful but do not scale beyond 5 min. Is there a way in which I can capture the entire history of metrics.
You can configure a metrics reporter to record all the metrics that you are interested in over any period of time.
Flink 1.3 comes with reporters for JMX, Ganglia, Graphite, StatsD, and DataDog. The documentation describes to to configure a reporter.
I'm using a Google Dataflow streaming pipeline with the default settings.
Thing is, it looks like the pipeline will start off at 1 worker, then scale down to 0 for 10-20 minutes, then up to 1 for 10-40 minutes, then back down.
This causes backups and surges in my PubSub topics, and sets off alerts based on unacknowledged messages. I've adjusted the alerting to accomodate these surges, but it's still odd behavior.
If the traffic through Dataflow is sufficiently low, but not zero, is it expected that the workers will scale to 0 until there is a backlog of work to do?