My Azure Durable Function(Runtime V3) getting an average of 3M events per day. When it runs for two or three weeks it is getting slower and slower. When I remove two table storages(History & Instances) used by Durable Function Framework, it is getting better and works as expected. I hosted my function app in the consumption plan. And also inside my function app, I'm used Durabel Entities as well. In my code, I'm using sub orchestrators as well for the Fan-Out mechanism.
Is this problem possible when it comes to heavy workload? Do I need to clear those table storages from time to time or do I need to Delete the state of completed entities inside my Durable Entity Function?
Someone, please help me
Yes, you should perform periodic clean-ups yourself by calling the PurgeInstanceHistoryAsync method. See a similar post on how to do this: https://stackoverflow.com/a/60894392
Also review any loops or Monitor patterns that you may have in your code.
Any looping logic, (like foreach, for or while loops) will replay from the initial startup state. Whilst the Durable Function replay architecture is very efficient at doing this, the code we write may not be optimised for repetitive iterations.
Durable Monitor Pattern is almost an Anti-Pattern. The concept is OK but it is easily misinterpreted and is open to abuse. It is designed for a low-frequency loop that polls an endpoint either for a set number of iterations or up until a finite time, or of course when the state of the endpoint being monitoried has changed. That state change will be the trigger to perform the rest of the operation.
It is NOT an example of how to use general or high frequency looping structures in Durable functions
It is NOT and example of how to implement a traditional HTTP endpoint reporting monitor in an infinite loop (while(true)) style, perhaps to record changes into a data store over time.
If your durable function logic has an iterator that may involve many iterations, consider migrating the iteration step to a sub-orchestration that uses the Eternal Orchestration pattern
Related
There are many usages of Fuseable interface in Reactor source code but I can't find any reference what is it. Could someone explain it's purpose?
The Fuseable interface, and its containing interfaces define the contracts used for stream fusion. Stream fusion is a reactive streams optimisation.
Without any such optimisation (in "normal" execution if you will), each reactive operator:
Subscribes to a previous operator in the chain
Is notified when the subscriber has completed
Performs its operation
Notifies its subscribers
...and then the cycle repeats for all operators. This is fantastic for making sure everything stays non-blocking, but all of those asynchronous calls come with some amount of overhead.
"Stream fusion" (or "operator fusion") significantly reduces this overhead by performing two or more of the operations in one chunk (fusing them together as one unit), passing values between them using a Queue or similar rather than via subscriptions, eliminating this overhead. It's not always possible of course - it can't be done this way if running in parallel, when certain side effects come into play, etc. - but a neat optimisation when it is possible.
Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage.
Here is some general advice for your use case
Please aggregate multiple elements then set a timer.
Please don't create a timer per element, which would be excessive.
Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and count, instead of storing every number when trying to compute a mean.
Please consider session windows for this use case.
In dataflow, state is not supported for merging windows. It is for beam.
Please use state according to your access pattern, i.e. BagState for blind writes.
Here is an informative blog post with some more info on state "Stateful processing with Apache Beam."
I'm new to working with CloudKit and database fetching and I've looked at the CKDataBaseOperation calls, so I'm trying to understand the real differences between adding an operation to a database and using "normal" function calls on that database if they both produce, more or less, the same results.
Why would adding an operation be more desirable over a function call and in what situations?
Thanks for helping me understand this. I'm trying to learn as much as I can about Swift.
Overview:
In CloudKit most of the tasks have 2 ways of doing things:
Convenience APIs (functions with completion handlers)
Operations
1. Convenience APIs
Advantages:
As the name implies, they are convenient to use
Disadvantage:
Usually requires more server requests.
Can't build dependencies
2. Operations:
Advantages:
More configurable and more options.
Requires lesser server requests (Better for your server request quota)
It is built using Operation, so you get all the capabilities of Operation like dependencies (you will need them in a real app)
Disadvantages:
It is not so convenient to use, you need to create the operation. It takes a little more time to code but well worth it.
Example 1 (Fetch):
If you use CKDatabase.fetch, you would need to specify the record IDs that you want to fetch.
If you use CKQueryOperation, you can query based on field values.
Example 2 (Save / Update):
If you use CKDatabase.save, you can save 1 record with every function call. Each function call would result in a separate server request. If you want to save 200 records, you would have to run it in a loop and would make 200 server requests which is not very efficient. CloudKit also has a limit on the number of server requests you can make per second. This way you would exhaust your quota very quickly.
If you use CKModifyRecordsOperation, you can save 200 records all at once*, by passing it as an array. So you would be making far lesser server requests.
*Note: The server imposes a limit on the number of records it can save in 1 request but it is definitely better than creating a separate request to save each record.
Reference:
https://developer.apple.com/library/content/documentation/DataManagement/Conceptual/CloudKitQuickStart/Introduction/Introduction.html#//apple_ref/doc/uid/TP40014987-CH1-SW1
Watch WWDC CloudKit videos
Might help to learn and watch WWDC videos about Operation (earlier used to be referred as NSOperation)
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
I have a server application made in Erlang. In it I have an mnesia table
that store some information on photos. In the spirit of "everything is a
process" I decided to wrap that table in a gen_server module, so that the
gen_server module is the only one that directly accesses the table. Querying
and adding information to that table is done by sending messages to that process
(which has a registered name). The idea is that there will be several client
processes querying information from that table.
This works just fine, but that gen_server module has no state. Everything it
requires is stored in the mnesia table. So, I wonder if a gen_server is perhaps
not the best model for encapsulating that table?
Should I simply not make it a process, and instead only encapsulate the table
through the functions in that module? In case of a bug in that module, that
would cause the calling process to crash, which I think might be better, because
it would only affect a single client, as opposed to now, when it would cause the
gen_server process to crash, leaving everyone without access to the table (until
the supervisor restarts it).
Any input is greatly appreciated.
I guess according to Occam's razor there is no need for this gen_server to exist, especially since there is absolutely no state stored in it. Such process could be needed in situations when you need access to the table (or any other resource) to be strictly sequential (for example you might want to avoid any aborted transactions at cost of a bottleneck).
Encapsulating access to the table in a module is a good solution. It creates no additional complexity, while providing proper level of abstraction and encapsulation.
I'm not sure I understand why you've decided to encapsulate a table with a process. Mnesia is designed to mediate multiple concurrent accesses to tables, both locally and distributed across a cluster.
Creating an API module that performs all the particular table access operations and updates is a good idea as the API functions will convey your intent better in the code that calls them. It will be more readable than putting the mnesia operations directly into the calling code.
An API module also gives you the option to switch from mnesia to some other storage system later if you need to. Using mnesia transactions inside your API module protects you from some programming errors as mnesia will roll-back operations that crash. The API module will always be available to callers and allows any number of callers to perform operations concurrently, whereas a gen_server based API has a point of failure, the process, that can render the API unavailable.
The only thing a gen_server based API gives you over a pure-functional API is to serialize access to the table - which is an unusual requirement and unless you specifically need it, it will be a performance killer.
It may be a good idea to handle a mnesia table using single gen_server process when you want to use dirty access and avoid transactions. This approach might be faster than txs, but as usually you need to benchmark it.