Setting up MAX_RUN_TIME for delayed_job - how much time can I set up? - ruby-on-rails

The default value is 4 hours. When I run my data to process, I got this error message:
E, [2014-08-15T06:49:57.821145 #17238] ERROR -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] Job ImportJob (id=8) FAILED (1 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 14400 seconds)
I, [2014-08-15T06:49:57.830621 #17238] INFO -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] 1 jobs processed at 0.0001 j/s, 1 failed
Which means that the current limit is set on 4 hours.
Because I have a large amount of data to process that might take 40 or 80 hours to process, I was curious if I can set up this amount of hours for MAX_RUN_TIME.
Are there any limits or negatives for setting up, let's say, MAX_RUN_TIME on 100 hours? Or possibly, is there any other way to process this data?
EDIT: is there a way to set up MAX_RUN_TIME on an infinity value?

There does not appear to be a way to set MAX_RUN_TIME to infinity, but you can set it very high. To configure the max run time, add a setting to your delayed_job initializer (config/initializers/delayed_job_config.rb by default):
Delayed::Worker.max_run_time = 7.days

Assuming you are running your Delayed Job daemon on its own utility server (i.e. so that it doesn't affect your web server, assuming you have one) then I don't see why long run times would be problematic. Basically, if you're expecting long run times and you're getting them then it sounds like all is normal and you should feel free to up the MAX_RUN_TIME. However, it is also there to protect you so I would suggest keeping a reasonable limit lest you run into an infinite loop or something that actually never will complete.
As far as setting MAX_RUN_TIME to infinity... it doesn't look to be possible since Delayed Job doesn't make the max_run_time optional. And there's a part in the code where a to_i conversion is done, which wouldn't work with infinity:
[2] pry(main)> Float::INFINITY.to_i
# => FloatDomainError: Infinity

Related

Circle Ci reduce maximum job duration time

I know there is a 10 min no-output timeout on builds in CircleCi. However we had a dodgy built which for some reason took 5h until it hit a maximum job duration. Is there a way to override this 5h limit to something custom and more sensible for our case like 1h?

Flink Checkpoint Failure - Checkpoints time out after 10 mins

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).
The root issue is:
Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.
Suggested solution:
I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?
Aim:
How to avoid this issue and record the correct state that doesn't miss any data?
Failed checkpoint:
Completed checkpoint:
subtask didn't respond
Thanks
There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.
Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.
Sounds like you should extend the timeout, which you can do like this:
env.getCheckpointConfig().setCheckpointTimeout(n);
where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

Airflow list dag times out exactly after 30 seconds

I have a dynamic airflow dag(backfill_dag) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag.
Backfill_dag now:
parses the Json,
renders the tasks of the dags x,y, and z, and
builds itself dynamically with x and y in parallel and z depends on x
Issue:
It works good as long as Backfill_dag can list_dags in 30 seconds.
Since Backfill_dag is bit complex here, it takes more than 30 seconds to list(airflow list_dags -sd Backfill_dag.py), hence it times out and the dag breaks.
Tried:
I tried to set a parameter, dagbag_import_timeout = 100, in airflow.cfg file of the scheduler, but that did not help.
I fixed my code.
Fix:
I had some aws s3 cp command in my dag that were running durring compilation hence my list_dags command was taking more than 30 seconds, i removed them(or had then in a BashOperator task), now my code compiles(list_dags) in couple of seconds.
Besides fixing your code you can also increase the core.dagbag_import_timeout which has per default 30 seconds. For me it helped increasing it to 150.
core.dagbag_import_timeout
default 30 seconds
The number of seconds before importing a Python file times out.
You can use this option to free up resources by increasing the time it takes before the Scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the Scheduler "loop," and must contain a value lower than the value specified in core.dag_file_processor_timeout.
core.dag_file_processor_timeout
default: 50 seconds
The number of seconds before the DagFileProcessor times out processing a DAG file.
You can use this option to free up resources by increasing the time it takes before the DagFileProcessor times out. We recommend increasing this value if you're seeing timeouts in your DAG processing logs that result in no viable DAGs being loaded.
You can try change other airflow configs like:
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
AIRFLOW__CORE__DEFAULT_TASK_EXECUTION_TIMEOUT
also as mentioned above:
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT

How to specify custom timeout for bazel test

bazel test command uses default timeout of 75 seconds for tests tagged size = small in my setup (version 0.12.0) (while the documentation mentions this as 60 seconds)
Is there a way to supply a custom timeout say 10 seconds, on bazel command line, so that if a test hangs it is terminated quicker?
I hope I did not misread the question, but this really sounds like you're looking for --test_timeout option:
--test_timeout
(a single integer or comma-separated list of 4 integers; default: "-1")
Override the default test timeout values for test timeouts (in secs). If a
single positive integer value is specified it will override all
categories. If 4 comma-separated integers are specified, they will
override the timeouts for short, moderate, long and eternal (in that
order). In either form, a value of -1 tells blaze to use its default
timeouts for that category.
If you want to use the same option(s) every time, you can save yourself some typing by using bazelrc.

How to estimate memory requirement for submitting a job to a cluster running SGE?

I am trying to submit a job to a cluster [running Sun Grid Engine (SGE)]. The job kept being terminated with the following report:
Job 780603 (temp_new) Aborted
Exit Status = 137
Signal = KILL
User = heaswara
Queue = std.q#comp-0-8.local
Host = comp-0-8.local
Start Time = 08/24/2013 13:49:05
End Time = 08/24/2013 16:26:38
CPU = 02:46:38
Max vmem = 12.055G
failed assumedly after job because:
job 780603.1 died through signal KILL (9)
The resource requirements I had set were:
#$ -l mem_free=10G
#$ -l h_vmem=12G
mem_free is the amount of memory my job requires and h_vmem is the is the upper bound on the amount of memory the job is allowed to use. I wonder my job is being terminated because it requires more than that threshold (12G).
Is there a way to estimate how much memory will be required for my operation? I am trying to figure out what should be the upper bound.
Thanks in advance.
It depends on the nature of the job itself. If you know anything about the program that is being run (i.e., you wrote it), you should be able to make an estimate on how much memory it is going to want. If not, your only recourse is to run it without the limit and see how much it actually uses.
I have a bunch of FPGA build and simulation jobs that I run. After each job, I track how much memory was actually used. I can use this historical information to make an estimate on how much it might use in the future (I pad by 10% in case there are some weird changes in the source). I still have to redo the calculations whenever the vendor delivers a new version of the tools, though, as quite often the memory footprint changes dramatically.

Resources