I have a dataflow job that communicates with external resources. The problem is that theses external resources are slower than the dataflow job and this causes that the external resources are always saturated. I need some form to reduce the quantity of messages read from PubSub or something to reduce the throughput of the job in order to reduce the traffic to the external resources.
Thanks.
We currently do not support throttling primitives (such as "make sure this DoFn is throttled to at most X calls per second over the whole job"), however we know it is an important use case and it will most likely be supported sooner or later.
Meanwhile your best bet is, as Ryan said, to limit the number of workers and worker threads: specify --numWorkers (or --maxNumWorkers if you are using autoscaling) and --numberOfWorkerHarnessThreads. However, note that this will lead to creating a backlog of input messages, rather than dropping them. It is hard to tell which is better in your use case.
Related
I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.
We're building a web-app where users will be uploading potentially large files that will need to be processed in the background. The task involves calling 3rd-party APIs so each job can take several hours to complete. We're using DelayedJob to run the background jobs. With every user kicking off a background job, each of which will take a few hours to finish, that will add up to a lot of background jobs every quickly. I am wondering what would be the best way to setup the deployment for this? We're currently hosted on DigitalOcean. I've kicked off 10 DelayedJob workers. Each one (when ideal) takes up 157MB. When actively running it utilizes around 900 MB. Our user-base right now is pretty small so it's not an issue but will be one soon. So on a 4GB droplet, I can probably run like 2 or 3 workers at a time. How should we approach this issue? Should we be looking at using DigitalOcean's API to auto-spin cheap droplets on demand? Should we subscribe to high-memory droplets on a monthly basis instead? If we go with auto-spinning droplets, should we stick with DigitalOcean or would Heroku make more sense? Or is the entire approach wrong and should we be approaching it from an entire different direction? Any help/advice would be very much appreciated.
Thanks!
It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.
I thing memory and time is more problem for you. you have to use sidekiq gem for this process because it will consume less time and memory consumption for doing the same job,because it uses redis as database which is key value pair db.if the problem continues go with java script.
I have a cloud streaming pipeline that read from PubSubIO and which "PipelineOptions" are set with "WorkerMachineType = n1-standard-1". This machine have 3.75GB of memory.
My problem is that if the subscription has a lot of messages, the pipeline reads really fast and when starts to process many elements it doesn't have enough memory.
Is there any form to reduce the quantity of messages read per second? or is the memory consumption related with the time duration assigned to the window and I would reduce this time duration?
Thanks is advance.
It sounds like you may be trying to process too much data with too few workers. We are looking at addressing this and related scenarios, but in the meantime you may want to try dialing down the amount of data you're ingesting, or increasing the number of workers available to the jobs.
You'll also get better performance with n1-standard-4 machines, which is why we make those the default for the streaming runner.
Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.
Does anybody knows if there is a sort of 'load-balancer' in the erlang standard library? I mean, if I have some really simple operations on a really large set of data, the overhead of constructing a process for every item will be larger than perform the operation sequentially. But if I can balance the work in the 'right number' of process, it will perform better, so I'm basically asking if there is an easy way to accomplish this task.
By the way, does anybody knows if an OTP application does some kind of balance load? I mean, in an OTP application there is the concept of a "worker process" (like a java-ish thread worker)?
See modules pg2 and pool.
pg2 implements quite simple distributed process pool. pg2:get_closest_pid/1 returns "closest" pid, i.e. random local process if available, otherwise random remote process.
pool implements load balancing between nodes started with module slave.
The plists module probably does what you want. It is basically a parallel implementation of the lists module, design to be used as a drop-in replacement. However, you can also control how it parallelizes its operations, for example by defining how many worker processes should be spawned etc.
You probably would do it by calculating some number of workers depending on the length of the list or the load of the system etc.
From the website:
plists is a drop-in replacement for
the Erlang module lists, making most
list operations parallel. It can
operate on each element in parallel,
for IO-bound operations, on sublists
in parallel, for taking advantage of
multi-core machines with CPU-bound
operations, and across erlang nodes,
for parallizing inside a cluster. It
handles errors and node failures. It
can be configured, tuned, and tweaked
to get optimal performance while
minimizing overhead.
There is no, in my view, usefull generic load-balancing tool in otp. And perhaps it only usefull to have one in specific cases. It is easy enough to implement one yourself. plists may be useful in the same cases. I do not believe in parallel-libraries as a substitute to the real thing. Amdahl will haunt you forever if you walk this path.
The right number of worker processes is equal to the number of schedulers. This may vary depending of what other work is done on the system. Use,
erlang:system_info(schedulers_online) -> NS
to get the number of schedulers.
The notion of overhead when flooding the system with an abundance of worker processes is somewhat faulty. There is overhead with new processes but not as much as with os-threads. The main overhead is message copying between processes, this can be alleviated with the use of binaries since only the reference to the binary is sent. With eterms the structure is first expanded then copied to the other process.
There is no way how to predict cost of work mechanically without measure it e.g do it. Some person must determine how to partition work for some class of tasks. In load balancer word I understand something very different than in your question.