I have a question regarding the reserved CPU time field in Google Dataflow. I don't understand why it varies so widely depending on the configuration of my run. I suspect that I am not interpreting the reserved CPU time for what it really is. To my understanding, it is the CPU time that was needed to complete the job I submitted, but based on the following evidence, it seems I may be mistaken. Is it the time that is allocated to your job, regardless of whether it is actually using the resources? If that's the case, how do I get the actual CPU time of my job?
First I ran my job with a variable sized pool of workers (max 24 workers).
The corresponding stats are as follows:
Then, I ran my script using a fixed number of workers (10):
And the stats changes to:
They went from 15 days to 7 hours? How is that possible?!
Thanks!
If you hover over the "?" next to "Reserved CPU time" a pop-up message will show and it will read: "The total time Dataflow was active on GCE instances, on a per-CPU basis." This indicates it is not the CPU-time used by the VMs. At this time Dataflow does not aggregate per-machine CPU usage stats; you may, however, be able to use the cloud monitoring API to extract those metrics yourself.
Related
I want to trigger a procedure in snowflake warehouse to load file from azure blob storage, for that I have implemented snowflake connector as an azure function and it is running on consumption plan (dynamic). But consumption plan has a default timeout of 5mins and max timeout can be of 10mins. But my data is like 50 GB and it takes like 20mins with medium size snowflake cluster. So is there any other way to achieve this?
If you want to get rid of this limitation, you have multiple solutions.
First, you can design a timetrigger to wake up the function before it times out. This timetrigger is periodic, and its period should be less than the timeout of your function.
Second, because the timeout limit comes from the service plan, you can change your service plan to complete your idea.
In a serverless Consumption plan, the valid range is from 1 second to 10 minutes, and the default value is 5 minutes.
In the Premium plan, the valid range is from 1 second to 60 minutes, and the default value is 30 minutes.
In a Dedicated (App Service) plan, there is no overall limit, and the default value is 30 minutes. A value of -1 indicates unbounded execution, but keeping a fixed upper bound is recommended.
Related documents:https://learn.microsoft.com/en-us/azure/azure-functions/functions-host-json#functiontimeout
Third, use durable functions. Under the consumption plan, ordinary out-of-the-box functions run for up to 10 minutes. But if you use durable functions, there is no such restriction at all. It also introduces support for stateful execution, which means that subsequent calls to the same function can share local variables and static members. This is an extension of the normal out-of-the-box functional model, and it requires some additional boilerplate code to make all functions work as expected.
more details about durable functions:https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.
I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.
This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.
We are storing metrics having build number in the metric name. Here is the format of the metric in graphite.
latency.<host>.<request>.<buildNumber>.average
Issue with above format is that buildNumber is ever changing value and in our case it changes every week because of the release cycle. This results in new storage file(.wsp) every week and since whisper allocates space upfront, we never fully utilized the space because of changing build number.
I know disk space is cheap resource but still at some point I think we will have lot of unused space.
For e.g if each metric file is 10MB large and if we are sending 5000 different metrics for latency then for a particular build number we will use up 50GB. Now if every week we are sending a new build number then 1TB of disk space will get filled in 20 weeks which is roughly 5 months.(1TB = 1000GB)/(50GB per week) = 20 weeks
Above problem could be solved if we can aggregate multiple metrics in one of last month. Is there any way of specifying a retention policy where multiple metrics are merged in one using some aggregation method?
Or is there any way for tackling this kind of problem in graphite?
If you use the Ceres storage engine for Graphite instead of using Whisper, you will avoid the problems of pre-allocation of space. https://github.com/graphite-project/ceres
I don't believe you can, during downsampling, merge multiple metrics with a specified aggregation. However, you can do this at the point of ingestion via aggregation-rules.conf. Documentation can be found here: http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf