Daemon eats too much CPU when being idle - ruby-on-rails

I am using blue-daemons fork of daemons gem (since the second one looks totally abandoned) along with daemons-rails gem, which wraps daemons for rails.
The problem is that my daemon eats too much CPU when it's idle (10-20 times higher then it's actually performing the job).
By being idle, I mean that I have special flag - Status.active?. If Status.active? is true, then I perform the job, if it's false, then I just sleep 10 secs and iterate next step in the while($running) do block and check status again and again.
I don't want to hard stop job because there is really sensitive data and I don't want the process to break it. Is there any good way to handle that high CPU usaget? I tried Sidekiq, but it looks like it's primary aim is to run jobs on demand or on schedule, but I need the daemon to run on non-stop basis.
$running = true
Signal.trap("TERM") do
$running = false
end
while($running) do
while Status.active? do
..... DO LOTS OF WORK .....
else
sleep 10
end
end

Related

Sidekiq - Enqueuing a job to be performed 0.seconds from now

I'm using sidekiq for background job and I enqueue a job like this:
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***) ・・・ ①
When waiting_time is nil,
it becomes
SampleJob.set(wait: 0.seconds).perform_later(***)
Of course it works well, but I'm worried about performance because worker enqueued with wait argument is derived by poller,
so I wonder if I should remove set(wait: waiting_time.to_i.seconds) when
waiting_time is nil.
i.e.)
if waiting_time.present?
SampleJob.set(wait: waiting_time.to_i.seconds).perform_later(***)
else
SampleJob.perform_later(***)
end ・・・ ②
Is there any differences in performance or speed between ① and ②?
Thank you in advance.
There is no difference. It looks like this is already considered in the Sidekiq library.
https://github.com/mperham/sidekiq/blob/main/lib/sidekiq/worker.rb#L261
# Optimization to enqueue something now that is scheduled to go out now or in the past
#opts["at"] = ts if ts > now

Can I make flex template jobs take less than 10 minutes before they start to process data?

I am using terraform resource google_dataflow_flex_template_job to deploy a Dataflow flex template job.
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
}
}
Its all working fine however without my requesting it the job seems to be using flexible resource scheduling (flexRS) mode, I say this because the job takes about ten minutes to start and during that time has state=QUEUED which I think is only applicable to flexRS jobs.
Using flexRS mode is fine for production scenarios however I'm currently still developing my dataflow job and when doing so flexRS is massively inconvenient because it takes about 10 minutes to see the effect of any changes I might make, no matter how small.
In Enabling FlexRS it is stated
To enable a FlexRS job, use the following pipeline option:
--flexRSGoal=COST_OPTIMIZED, where the cost-optimized goal means that the Dataflow service chooses any available discounted resources or
--flexRSGoal=SPEED_OPTIMIZED, where it optimizes for lower execution time.
I then found the following statement:
To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
at Specifying pipeline execution parameters > Setting other Cloud Dataflow pipeline options
I interpret that to mean that flexrs_goal=SPEED_OPTIMIZED will turn off flexRS mode. However, I changed the definition of my google_dataflow_flex_template_job resource to:
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
"flexrs_goal" = "SPEED_OPTIMIZED"
}
}
(note the addition of "flexrs_goal" = "SPEED_OPTIMIZED") but it doesn't seem to make any difference. The Dataflow UI confirms I have set SPEED_OPTIMIZED:
but it still takes too long (9 minutes 46 seconds) for the job to start processing data, and it was in state=QUEUED for all that time:
2021-01-17 19:49:19.021 GMTStarting GCE instance, launcher-2021011711491611239867327455334861, to launch the template.
...
...
2021-01-17 19:59:05.381 GMTStarting 1 workers in europe-west1-d...
2021-01-17 19:59:12.256 GMTVM, launcher-2021011711491611239867327455334861, stopped.
I then tried explictly setting flexrs_goal=COST_OPTIMIZED just to see if it made any difference, but this only caused an error:
"The workflow could not be created. Causes: The workflow could not be
created due to misconfiguration. The experimental feature
flexible_resource_scheduling is not supported for streaming jobs.
Contact Google Cloud Support for further help. "
This makes sense. My job is indeed a streaming job and the documentation does indeed state that flexRS is only for batch jobs.
This page explains how to enable Flexible Resource Scheduling (FlexRS) for autoscaled batch pipelines in Dataflow.
https://cloud.google.com/dataflow/docs/guides/flexrs
This doesn't solve my problem though. As I said above if I deploy with flexrs_goal=SPEED_OPTIMIZED then still state=QUEUED for almost ten minutes, yet as far as I know QUEUED is only applicable to flexRS jobs:
Therefore, after you submit a FlexRS job, your job displays an ID and a Status of Queued
https://cloud.google.com/dataflow/docs/guides/flexrs#delayed_scheduling
Hence I'm very confused:
Why is my job getting queued even though it is not a flexRS job?
Why does it take nearly ten minutes for my job to start processing any data?
How can I speed up the time it takes for my job to start processing data so that I can get quicker feedback during development/testing?
UPDATE, I dug a bit more into the logs to find out what was going on during those 9minutes 46 seconds. These two consecutive log messages are 7 minutes 23 seconds apart:
2021-01-17 19:51:03.381 GMT
"INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']"
2021-01-17 19:58:26.459 GMT
"INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi"
Whatever is going on between those two log records is the main contributor to the long time spent in state=QUEUED. Anyone know what might be the cause?
As mentioned in the existing answer you need to extract the apache-beam modules inside your requirements.txt:
RUN pip install -U apache-beam==<version>
RUN pip install -U -r ./requirements.txt
While developing, I prefer to use DirectRunner, for the fastest feedback.

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

Autoscaling Resque workers on Heroku in real time

I would like to up/down-scale my dynos automatically dependings on the size of the pending list.
I heard about HireFire, but the scaling is only made every minutes, and I need it to be (almost) real time.
I would like to scale my dynos so that the pending list be ~always empty.
I was thinking about doing it by myself (with a scheduler (~15s delay) and using Heroku API), because I'm not sure there is anything out there; and if not, do you know any monitoring tools which could send an email alert if the queue lenght exceed a fixed size ? (similar to apdex on newrelic).
A potential custom code solution is included below. There are also two New Relic plgins that do Resque monitoring. I'm not sure if either do email alerts based on exceeding a certain queue size. Using resque hooks you could output log messages that could trigger email alerts (or slack, hipchat, pagerduty, etc) via a service like Papertrail or Loggly. THis might look something like:
def after_enqueue_pending_check(*args)
job_count = Resque.info[:pending].to_i
if job_count > PENDING_THRESHOLD
Rails.logger.warn('pending queue threshold exceeded')
end
end
Instead of logging you could send an email but without some sort of rate limiting on the emails you could easily get flooded if the pending queue grows rapidly.
I don't think there is a Heroku add-on or other service that can do the scaling in realtime. There is a gem that will do this using the deprecated Heroku API. You can do this using resque hooks and the Heroku platform-api. This untested example uses the heroku platform-api to scale the 'worker' dynos up and down. Just as an example I included 1 worker for every three pending jobs. The downscale will only every reset the workers to 1 if there are no pending jobs and no working jobs. This is not ideal and should be updated to fit your needs. See here for information about ensuring that then scaling down the workers you don't lose jobs: http://quickleft.com/blog/heroku-s-cedar-stack-will-kill-your-resque-workers
require 'platform-api'
def after_enqueue_upscale(*args)
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
worker_count = heroku.formation.info('app-name','worker')["quantity"]
job_count = Resque.info[:pending].to_i
# one worker for every 3 jobs (minimum of 1)
new_worker_count = ((job_count / 3) + 1).to_i
return if new_worker_count <= worker_count
heroku.formation.update('app-name', 'worker', {"quantity" => new_worker_count})
end
def after_perform_downscale
heroku = PlatformAPI.connect_oauth('OAUTH_TOKEN')
if Resque.info[:pending].to_i == 0 && Resque.info[:working].to_i == 0
heroku.formation.update('app-name', 'worker', {"quantity" => 1})
end
end
Im having a similiar issue and have ran into "Hirefire"
https://www.hirefire.io/.
For ruby, use:
https://github.com/hirefire/hirefire-resource
It runs similar to theoretically works like AdepScale (https://www.adeptscale.com/). However Hirefire can also scale workers and does not limit itself to just dynos. Hope this helps!

Ruby popen3 not working as expected in Sidekiq worker

I'd like to find out more info about the wait_thread being passed to my Popen wrapper method
def my_popen(cmd, ignore_err = true)
Open3.popen3(cmd, {}) do |stdin, stdout, stderr, wait_thr|
cmd_status = wait_thr.value
cmd_output << stdout.read
cmd_output << stderr.read unless ignore_err
end
return cmd_output, cmd_status
end
It works for short running processes but it is being used in a Sidekiq worker which can take around an hour. However when I time it, it takes only around 30 secs every time no matter how long the worker really takes. To time it I just add a timestamp entry into the database at the beginning of the worker and then update it at the end for thread safety and so I can see it in a UI.
Is there something to do with this wait thread that is timing out after around 30 seconds?

Resources