I know there is a 10 min no-output timeout on builds in CircleCi. However we had a dodgy built which for some reason took 5h until it hit a maximum job duration. Is there a way to override this 5h limit to something custom and more sensible for our case like 1h?
Related
I have a dynamic airflow dag(backfill_dag) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag.
Backfill_dag now:
parses the Json,
renders the tasks of the dags x,y, and z, and
builds itself dynamically with x and y in parallel and z depends on x
Issue:
It works good as long as Backfill_dag can list_dags in 30 seconds.
Since Backfill_dag is bit complex here, it takes more than 30 seconds to list(airflow list_dags -sd Backfill_dag.py), hence it times out and the dag breaks.
Tried:
I tried to set a parameter, dagbag_import_timeout = 100, in airflow.cfg file of the scheduler, but that did not help.
I fixed my code.
Fix:
I had some aws s3 cp command in my dag that were running durring compilation hence my list_dags command was taking more than 30 seconds, i removed them(or had then in a BashOperator task), now my code compiles(list_dags) in couple of seconds.
Besides fixing your code you can also increase the core.dagbag_import_timeout which has per default 30 seconds. For me it helped increasing it to 150.
core.dagbag_import_timeout
default 30 seconds
The number of seconds before importing a Python file times out.
You can use this option to free up resources by increasing the time it takes before the Scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the Scheduler "loop," and must contain a value lower than the value specified in core.dag_file_processor_timeout.
core.dag_file_processor_timeout
default: 50 seconds
The number of seconds before the DagFileProcessor times out processing a DAG file.
You can use this option to free up resources by increasing the time it takes before the DagFileProcessor times out. We recommend increasing this value if you're seeing timeouts in your DAG processing logs that result in no viable DAGs being loaded.
You can try change other airflow configs like:
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
AIRFLOW__CORE__DEFAULT_TASK_EXECUTION_TIMEOUT
also as mentioned above:
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
If you are writing a bosun alert which is based of a percentage error rate for requests handled by your system, how do you write it in such a way that it handles periods of low traffic.
For example:
If I have an alert which looks back over the last 5 minutes and works out the error rate for requests
$errorRate = $numberErr/$numberReq and then triggers an alarm if the errorRate exceeds a predefined threshold crit = $errorRate > 0.05 this can work quite well so long as every 5 minute period had a sufficiently large number of requests ($numberReq).
If the number of requests in a 5 minute period was 10,000 then 501 errors would be required to trigger an alarm. However if the number of requests in a 5 minute period was 100 then only 5 errors would be required to trigger an alarm.
How can I write an alert which handles periods where the number of requests are so low that a small number of errors will equate to a large error rate. I had considered a sliding window of time, rather than a fixed 5 minute period, where the window would increase in size until the number of requests was high enough to give some confidence in the alarm. e.g. increase the time period until the number of requests is 10,000.
I can't find a way to achieve this in bosun, and I don't want to commit to a larger period of time for my alerts because the traffic rate varies so much. A longer period during peak traffic could result in an actual error causing a much larger impact.
I generally pair any percentage and/or historical based alerts with a static threshold.
For example: crit = numberErr > 100 && $errorRate > 0.05. That way the percent part doesn't matter unless the number of errors have also crossed some threshold because the entire statement won't be true.
I have Scheduled a JMeter Job in Jenkins. My Goal is to compare the result of the current test with previous build's result and provide the status accordingly. I have set the
Unstable % Range as = (-)1 to (+)1
Error % Range as = (-)5 to (+)5
My application's response time always have a variation of 1 second and it is an known factor
Since JMeter is returning Average Response time in milli seconds, Test is marked as failed or unstable as the difference in milli seconds is always high
Instead of comparing the response time in milli seconds, is it possible to convert the response time in Seconds and then perform an build comparison?
I'm using Graphite+Statsd (with Python client) to collect custom metrics from a webapp: a counter for successful transactions. Let's say the counter is stats.transactions.count, that also has a rate/per/second metric available at stats.transactions.rate.
I've also setup Seyren as a monitor+alert system and successfully pulled metrics from Graphite. Now I want to setup an alert in Seyren if the number of successful transactions in the last 60 minutes is less than a certain minimum.
Which metric and Graphite function should I use? I tried with summarize(metric, '1h') but this gives me an alert each hour when Graphite starts aggregating the metric for the starting hour.
Note that Seyren also allows to specify Graphite's from and until parameters, if this helps.
I contributed the Seyren code to support from/until in order to handle this exact situation.
The following configuration should raise a warning if the count for the last hour drops below 50, and an error if it drops below 25.
Target: summarize(nonNegativeDerivative(stats.transactions.count),"1h","sum",true)
From: -1h
To: [blank]
Warn: 50 (soft minimum)
Error: 25 (hard minimum)
Note this will run every minute, so the "last hour" is a sliding scale. Also note that the third boolean parameter true for the summarize function is telling it to align its 1h bucket to the From, meaning you get a full 1-hour bucket starting 1 hour ago, rather than accidentally getting a half bucket. (Newer versions of Graphite may do this automatically.)
Your mileage may vary. I have had problems with this approach when the counter gets set back to 0 on a server restart. But in my case I'm using dropwizard metrics + graphite, not statsd + graphite, so you may not have that problem.
Please let me know if this approach works for you!
The default value is 4 hours. When I run my data to process, I got this error message:
E, [2014-08-15T06:49:57.821145 #17238] ERROR -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] Job ImportJob (id=8) FAILED (1 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 14400 seconds)
I, [2014-08-15T06:49:57.830621 #17238] INFO -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] 1 jobs processed at 0.0001 j/s, 1 failed
Which means that the current limit is set on 4 hours.
Because I have a large amount of data to process that might take 40 or 80 hours to process, I was curious if I can set up this amount of hours for MAX_RUN_TIME.
Are there any limits or negatives for setting up, let's say, MAX_RUN_TIME on 100 hours? Or possibly, is there any other way to process this data?
EDIT: is there a way to set up MAX_RUN_TIME on an infinity value?
There does not appear to be a way to set MAX_RUN_TIME to infinity, but you can set it very high. To configure the max run time, add a setting to your delayed_job initializer (config/initializers/delayed_job_config.rb by default):
Delayed::Worker.max_run_time = 7.days
Assuming you are running your Delayed Job daemon on its own utility server (i.e. so that it doesn't affect your web server, assuming you have one) then I don't see why long run times would be problematic. Basically, if you're expecting long run times and you're getting them then it sounds like all is normal and you should feel free to up the MAX_RUN_TIME. However, it is also there to protect you so I would suggest keeping a reasonable limit lest you run into an infinite loop or something that actually never will complete.
As far as setting MAX_RUN_TIME to infinity... it doesn't look to be possible since Delayed Job doesn't make the max_run_time optional. And there's a part in the code where a to_i conversion is done, which wouldn't work with infinity:
[2] pry(main)> Float::INFINITY.to_i
# => FloatDomainError: Infinity