Why is there such a long delay between scheduling and performing an activity function? - azure-durable-functions

I have an activity function which will:
read a bunch of results from a process elsewhere in my pipeline, via azure blob storage
perform some processing on each result, using the fan-out, fan-in pattern
Combine the original results with the newly processed data, and re-upload to blob storage
When I have a large number of results (~10,000) - I'm experiencing large delays between finishing my processing, and the upload activity actually triggering.
I'm seeing the processing take ~3-4 minutes, then my "PersistResults" activity is scheduled - then 10 minutes later, the "PersistResults" activity actually runs, and takes about 20 seconds to run.
My guess is that the large payload on the "Persist" activity function is slowing it down a lot - though I've read nothing about limiting that, and the documentation certainly implies that any actual work I do (e.g. storing results) should be done in the activities to keep the orchestrator deterministic.
Actually uploading the results to storage appears to be fairly quick, given that when my activity does run, it only takes 20 seconds.
The final payload (all combined results) is roughly 50MB uncompressed - I notice that within the durable functions storage account it uses a compressed version for the activity input.
Is this significant delay expected when using durable functions like this?
Is there anything I've likely done to cause such a delay?
Is there anything I can do to prevent it?

Related

Dataflow batch vs streaming: window size larger than batch size

Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).

Speed up the proces of requesting messages from SQS

We need to process a big number of messages stored in SQS (the messages originate from Amazon store and SQS is the only place we can save them to) and save the result to our database. The problem is, SQS can only return 10 messages at a time. Considering we can have up to 300000 messages in SQS, even if requesting and processing a 10 messages takes little time, the whole process takes forever with the main culprit being actually requesting and receiving the messages from SQS.
We're looking for a way to speed this up. The intended result would be dumping the results to our database. The process would probably run a few times per day (the number of messages would likely be less per run in that scenario).
Like Michael-sqlbot wrote, parallel requests were the solution. By rewriting our code to use async and making 10 requests at the same time, we managed to reduce the execution time to something much reasonable.
I guess it's because I rarely use multithreading directly in my job, that I haven't thought of using it to solve this problem.

How can you replay old data into dataflow via pub/sub and maintain correct event time logic?

We're trying to use dataflow's processing-time independence to start up a new streaming job and replay all of our data into it via Pub/Sub but are running into the following problem:
The first stage of the pipeline is a groupby on a transaction id, with a session window of 10s discarding fired panes and no allowed lateness. So if we don't specify the timestampLabel of our replay pub/sub topic then when we replay into pub/sub all of the event timestamps are the same and the groupby tries to group all of our archived data into transaction id's for all time. No good.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay say 1d at a time into the pub/sub topic then it works for the first day's worth of events, but then as soon as those are exhausted the data watermark for the replay pub/sub somehow jumps forward to the current time, and all subsequent replayed days are dropped as late data. I don't really understand why that happens, as it seems to violate the idea that dataflow logic is independent of the processing time.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay all of it into the pub/sub topic, and then start the streaming job to consume it, the data watermark never seems to advance, and nothing ever seems to come out of the groupby. I don't really understand what's going on with that either.
Your approaches #2 and #3 are suffering from different issues:
Approach #3 (write all data, then start consuming): Since data is written to the pubsub topic out-of-order, the watermark really cannot advance until all (or most) of the data is consumed - because the watermark is a soft guarantee "further items that you receive you are unlikely to have event time later than this", but due to out-of-order publishing there is no correspondence whatsoever between publish time and event time. So your pipeline is effectively stuck until it finishes processing all this data.
Approach #2: technically it suffers from the same problem within each particular day, but I suppose the amount of data within 1 day is not that large, so the pipeline is able to process it. However, after that, the pubsub channel stays empty for a long time, and in that case the current implementation of PubsubIO will advance the watermark to real time, that's why further days of data are declared late. The documentation explains this some more.
In general, quickly catching up with a large backlog, e.g. by using historic data to "seed" the pipeline and then continuing to stream in new data, is an important use case that we currently don't support well.
Meanwhile I have a couple of recommendations for you:
(better) Use a variation on approach #2, but try timing it against the streaming pipeline so that the pubsub channel doesn't stay empty.
Use approach #3, but with more workers and more disk per worker (your current job appears to be using autoscaling with max 8 workers - try something much larger, like 100? It will downscale after it catches up)

Making use of workers in Custom Sink

I have a custom sink which will publish the final result from a pipeline to a repository.
I am getting the inputs for this pipeline from BigQuery and GCS.
The custom writer present in the sink is called for each in all workers. Custom Writer will just collect the objects to be psuhed and return it as part of WriteResult. And then finally I merge these records in the CustomWriteOperation.finalize() and push it into my repository.
This works fine for smaller files. But, my repository will not accept if the result is greater than 5 MB. Also it will not accept not more than 20 writes per hour.
If I push the result via worker, then the writes per day limit will be violated. If I write it in a CustomWriteOperation.finalize(), then it may violate size limt i.e. 5MB.
Current approach is to write in chunks in CustomWriteOperation.finalize(). As this is not executed in many workers it might cause delay in my job. How can I make use of workers in finalize() and how can I specify the number of workers to be used inside a pipeline for a specific job (i.e) write job?
Or is there any better approach?
The sink API doesn't explicitly allow tuning of bundle size.
One work around might be to use a ParDo to group records into bundles. For example, you can use a DoFn to randomly assign each record a key between 1,..., N. You could then use a GroupByKey to group the records into KV<Integer, Iterable<Records>>. This should produce N groups of roughly the same size.
As a result, an invocation of Sink.Writer.write could write all the records with the same key at once and since write is invoked in parallel the bundles would be written in parallel.
However, since a given KV pair could be processed multiple times or in multiple workers at the same time, you will need to implement some mechanism to create a lock so that you only try to write each group of records once.
You will also need to handle failures and retries.
So, if I understand correctly, you have a repository that
Accepts no more than X write operations per hour (I suppose if you try to do more, you get an error from the API you're writing to), and
Each write operation can be no bigger than Y in size (with similar error reporting).
That means it is not possible to write more than X*Y data in 1 hour, so I suppose, if you want to write more than that, you would want your pipeline to wait longer than 1 hour.
Dataflow currently does not provide built-in support for enforcing either of these limits, however it seems like you should be able to simply do retries with randomized exponential back-off to get around the first limitation (here's a good discussion), and it only remains to make sure individual writes are not too big.
Limiting individual writes can be done in your Writer class in the custom sink. You can maintain a buffer of records, and have write() add to the buffer and flush it by issuing the API call (with exponential back-off, as mentioned) if it becomes just below the allowed write size, and flush one more time in close().
This way you will write bundles that are as big as possible but not bigger, and if you add retry logic, throttling will also be respected.
Overall, this seems to fit well in the Sink API.
I am working with Sam on this and here are the actual limits imposed by our target system: 100 GB per api call, and max of 25 api calls per day.
Given these limits, the retry method with back-off logic may cause the upload to take many days to complete since we don't have control on the number of workers.
Another approach would be to leverage FileBasedSink to write many files in parallel. Once all these files are written, finalize (or copyToOutputFiles) can combine files until total size reaches 100 GB and push to target system. This way we leverage parallelization from writer threads, and honor the limit from target system.
Thoughts on this, or any other ideas?

Parse large and multiple fetches for statistics

In my app I want to retrieve a large amount of data from Parse to build a view of statistics. However, in future, as data builds up, this may be a huge amount.
For example, 10,000 results. Even if I fetched in batches of 1000 at a time, this would result in 10 fetches. This could rapidly, send me over the 30 requests per second limitation by Parse. Specifically when several other chunks of data may need to be collected at the same time for other stats.
Any recommendations/tips/advice for this scenario?
You will also run into limits with the skip and limit query variables. And heavy weight lifting on a mobile device could also present issues for you.
If you can you should pre-aggregate these statistics, perhaps once per day, so that you can simply directly request the details.
Alternatively, create a cloud code function to do some processing for you and return the results. Again, you may well run into limits here, so a cloud job may meed your needs better, and then you may need to effectively create a request object which is processed by the job and then poll for completion or send out push notifications on completion.

Resources