Airflow list dag times out exactly after 30 seconds - timeout

I have a dynamic airflow dag(backfill_dag) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag.
Backfill_dag now:
parses the Json,
renders the tasks of the dags x,y, and z, and
builds itself dynamically with x and y in parallel and z depends on x
Issue:
It works good as long as Backfill_dag can list_dags in 30 seconds.
Since Backfill_dag is bit complex here, it takes more than 30 seconds to list(airflow list_dags -sd Backfill_dag.py), hence it times out and the dag breaks.
Tried:
I tried to set a parameter, dagbag_import_timeout = 100, in airflow.cfg file of the scheduler, but that did not help.

I fixed my code.
Fix:
I had some aws s3 cp command in my dag that were running durring compilation hence my list_dags command was taking more than 30 seconds, i removed them(or had then in a BashOperator task), now my code compiles(list_dags) in couple of seconds.

Besides fixing your code you can also increase the core.dagbag_import_timeout which has per default 30 seconds. For me it helped increasing it to 150.
core.dagbag_import_timeout
default 30 seconds
The number of seconds before importing a Python file times out.
You can use this option to free up resources by increasing the time it takes before the Scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the Scheduler "loop," and must contain a value lower than the value specified in core.dag_file_processor_timeout.
core.dag_file_processor_timeout
default: 50 seconds
The number of seconds before the DagFileProcessor times out processing a DAG file.
You can use this option to free up resources by increasing the time it takes before the DagFileProcessor times out. We recommend increasing this value if you're seeing timeouts in your DAG processing logs that result in no viable DAGs being loaded.

You can try change other airflow configs like:
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
AIRFLOW__CORE__DEFAULT_TASK_EXECUTION_TIMEOUT
also as mentioned above:
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT

Related

Is dask dataframe processing speed linearly increased with the numbers of workers?

Hi I'm new to dask dataframe and looking at how well it can improve the processing time with distributed computing. My code is working with a 5m+ row file as below
client = Client('127.0.0.1:8786')
start_time = time.time()
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)))
ddf = dask.dataframe.read_csv(filename)
ddf = ddf[['Prod_desc']].drop_duplicates()
ddf = ddf.query(filter_text)
df = ddf.head(1000, npartitions=-1)
end_time = time.time()
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)))
print(end_time - start_time, 'secs')
The results are below
distributed 1 server, 1 worker/server
22.344669342041016 secs
distributed 2 server, 1 worker/server
11.367061614990234 secs
distributed 3 server, 1 worker/server
9.111485004425049 secs
distributed 3 server, 2 workers/server
6.44242000579834 secs
distributed 3 server, 3 workers/server
8.706154823303223 secs
The processing speed didn't improve linearly as we expected. When we used 3 workers per server, more time was used than 2 workers per server.
Anybody has an idea how this happened? Thanks.
If I read the code correctly, you read the data, then drop duplicates and later show the top 1000.
I suspect that one or both of these steps require a shuffle of the data, perhaps even to one place.
If you need to shuffle that costs time (Data transport between nodes will touch both network and disk between servers, not sure what happens within a server), so that would be my primary suspect for why you see a slowdown.
In general parallel processing should speed things up linearly, but this especially holds if you can do many steps of processing and the shuffling is just a minor component, in this job the processing is so light that it is possibly even taking less time than simply transporting the data.
could you pls try to remove
ddf = ddf[['Prod_desc']].drop_duplicates()
and test the time again?
This step might drop too many lines so too few lines for the next step.

Why does dask worker fails due to MemoryError on "small" size task? [Dask.bag]

I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

How to alert in Seyren with Graphite if transactions in last 60 minutes are less than x?

I'm using Graphite+Statsd (with Python client) to collect custom metrics from a webapp: a counter for successful transactions. Let's say the counter is stats.transactions.count, that also has a rate/per/second metric available at stats.transactions.rate.
I've also setup Seyren as a monitor+alert system and successfully pulled metrics from Graphite. Now I want to setup an alert in Seyren if the number of successful transactions in the last 60 minutes is less than a certain minimum.
Which metric and Graphite function should I use? I tried with summarize(metric, '1h') but this gives me an alert each hour when Graphite starts aggregating the metric for the starting hour.
Note that Seyren also allows to specify Graphite's from and until parameters, if this helps.
I contributed the Seyren code to support from/until in order to handle this exact situation.
The following configuration should raise a warning if the count for the last hour drops below 50, and an error if it drops below 25.
Target: summarize(nonNegativeDerivative(stats.transactions.count),"1h","sum",true)
From: -1h
To: [blank]
Warn: 50 (soft minimum)
Error: 25 (hard minimum)
Note this will run every minute, so the "last hour" is a sliding scale. Also note that the third boolean parameter true for the summarize function is telling it to align its 1h bucket to the From, meaning you get a full 1-hour bucket starting 1 hour ago, rather than accidentally getting a half bucket. (Newer versions of Graphite may do this automatically.)
Your mileage may vary. I have had problems with this approach when the counter gets set back to 0 on a server restart. But in my case I'm using dropwizard metrics + graphite, not statsd + graphite, so you may not have that problem.
Please let me know if this approach works for you!

Setting up MAX_RUN_TIME for delayed_job - how much time can I set up?

The default value is 4 hours. When I run my data to process, I got this error message:
E, [2014-08-15T06:49:57.821145 #17238] ERROR -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] Job ImportJob (id=8) FAILED (1 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 14400 seconds)
I, [2014-08-15T06:49:57.830621 #17238] INFO -- : 2014-08-15T06:49:57+0000: [Worker(delayed_job host:app-name pid:17238)] 1 jobs processed at 0.0001 j/s, 1 failed
Which means that the current limit is set on 4 hours.
Because I have a large amount of data to process that might take 40 or 80 hours to process, I was curious if I can set up this amount of hours for MAX_RUN_TIME.
Are there any limits or negatives for setting up, let's say, MAX_RUN_TIME on 100 hours? Or possibly, is there any other way to process this data?
EDIT: is there a way to set up MAX_RUN_TIME on an infinity value?
There does not appear to be a way to set MAX_RUN_TIME to infinity, but you can set it very high. To configure the max run time, add a setting to your delayed_job initializer (config/initializers/delayed_job_config.rb by default):
Delayed::Worker.max_run_time = 7.days
Assuming you are running your Delayed Job daemon on its own utility server (i.e. so that it doesn't affect your web server, assuming you have one) then I don't see why long run times would be problematic. Basically, if you're expecting long run times and you're getting them then it sounds like all is normal and you should feel free to up the MAX_RUN_TIME. However, it is also there to protect you so I would suggest keeping a reasonable limit lest you run into an infinite loop or something that actually never will complete.
As far as setting MAX_RUN_TIME to infinity... it doesn't look to be possible since Delayed Job doesn't make the max_run_time optional. And there's a part in the code where a to_i conversion is done, which wouldn't work with infinity:
[2] pry(main)> Float::INFINITY.to_i
# => FloatDomainError: Infinity

Resources