I know this question is wired, But I am not 100% sure if it is possible or not. Need expert advise.
I am using this architecture (see Fig 1), there is a MVC WebAPI which puts data in Azure Queue and then Queue will call Azure Function to perform small tasks but very large in number e.g Queue is sending 5k - 10k requests to Azure Function in 1 minute.
Fig 1
We want to remove Azure Function because it cost us a lot. We want to go for alternate of it.
For this, someone share an idea to remove Azure function with another MVC WebAPI. (see Fig 2)
Fig 2
Is above architecture is possible ? If yes then How and If no then can anyone please suggest anything?
When using Azure Functions with Storage Queue trigger, Azure Functions will scale out based on the load on the queue. By default, batchSize is set to 16. The setting can be configured via host.json
The number of queue messages that the Functions runtime retrieves simultaneously and processes in parallel. When the number being processed gets down to the newBatchThreshold, the runtime gets another batch and starts processing those messages. So the maximum number of concurrent messages being processed per function is batchSize plus newBatchThreshold. This limit applies separately to each queue-triggered function.
This setting alone might not be sufficient when the number of messages is substantial. In that case, you want to restrict the scale-out behaviour associated with the number of VMs used to execute the Function App. The setting is an App Setting WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT. Setting it to 1 would prevent any scale-out to new VMs, but according to the documentation
This setting is a preview feature - and only reliable if set to a value <= 5
While your focus is on the cost of processing, take into consideration time as well. Unless it's OK to wait for the messages to get processed for a long time, you're likely to have other alternatives to Functions. But the trade-off between the cost and the time to process will always be there.
Related
I want to trigger a procedure in snowflake warehouse to load file from azure blob storage, for that I have implemented snowflake connector as an azure function and it is running on consumption plan (dynamic). But consumption plan has a default timeout of 5mins and max timeout can be of 10mins. But my data is like 50 GB and it takes like 20mins with medium size snowflake cluster. So is there any other way to achieve this?
If you want to get rid of this limitation, you have multiple solutions.
First, you can design a timetrigger to wake up the function before it times out. This timetrigger is periodic, and its period should be less than the timeout of your function.
Second, because the timeout limit comes from the service plan, you can change your service plan to complete your idea.
In a serverless Consumption plan, the valid range is from 1 second to 10 minutes, and the default value is 5 minutes.
In the Premium plan, the valid range is from 1 second to 60 minutes, and the default value is 30 minutes.
In a Dedicated (App Service) plan, there is no overall limit, and the default value is 30 minutes. A value of -1 indicates unbounded execution, but keeping a fixed upper bound is recommended.
Related documents:https://learn.microsoft.com/en-us/azure/azure-functions/functions-host-json#functiontimeout
Third, use durable functions. Under the consumption plan, ordinary out-of-the-box functions run for up to 10 minutes. But if you use durable functions, there is no such restriction at all. It also introduces support for stateful execution, which means that subsequent calls to the same function can share local variables and static members. This is an extension of the normal out-of-the-box functional model, and it requires some additional boilerplate code to make all functions work as expected.
more details about durable functions:https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know for sure the increment was called only-once? I want to understand what part I am missing.
Also, is CloudBigTableIO usable in Streaming mode or is it tied to Batch mode only? I guess we could use the BigTable HBase client directly in the pipeline but the connector seems to have nice properties like Connection-pooling which we would like to leverage and hence the question.
The way that Dataflow (and other systems) offer the appearence of exactly-once execution in the presence of failures and retries is by requiring that side-effects (such as mutating BigTable) are idempotent. A "write" is idempotent because it is overwritten on retry. Inserts can be made idempotent by including a deterministic "insert ID" that deduplicates the insert.
For an increment, that is not the case. It is not supported because it would not be idempotent when retried, so it would not support exactly-once execution.
CloudBigTableIO is usable in streaming mode. We had to implement a DoFn rather than a Sink in order to support that via the Dataflow SDK.
I have a dataflow job that communicates with external resources. The problem is that theses external resources are slower than the dataflow job and this causes that the external resources are always saturated. I need some form to reduce the quantity of messages read from PubSub or something to reduce the throughput of the job in order to reduce the traffic to the external resources.
Thanks.
We currently do not support throttling primitives (such as "make sure this DoFn is throttled to at most X calls per second over the whole job"), however we know it is an important use case and it will most likely be supported sooner or later.
Meanwhile your best bet is, as Ryan said, to limit the number of workers and worker threads: specify --numWorkers (or --maxNumWorkers if you are using autoscaling) and --numberOfWorkerHarnessThreads. However, note that this will lead to creating a backlog of input messages, rather than dropping them. It is hard to tell which is better in your use case.
We have an app that uses Cloudant as a remote server. Nevertheless, Cloudant is not completely compatible with TouchDB's continuous replications from previous experience. So our alternative for now is to trigger manually one-shot replications at a fixed frequency. Nevertheless, we would like to know if that approach is going to cost us more money than continuous replications, since continuous replications use longpoll and doesn't need to query the server often. In other words, does one-shot pull replications with Cloudant as the target cost us a GET request?
Thank you,
Paul
I think the issue you refer to is [1].
Cloudant's replication is 100% compatible with CouchDB. In this
instance, TouchDB's logs indicate the iOS network stack passed
on incomplete JSON to TouchDB. It's not clear who was to blame
in this case for the replication failure.
[1] https://github.com/couchbaselabs/TouchDB-iOS/issues/241
For the cost question, a one-shot pull replication will result in a GET to the _changes
feed each time it happens, plus the other requests required to
replicate. This _changes request will be counted as a light
HTTP request against your Cloudant account.
However, whether this works out as more or fewer requests overall
depends on the number of changes coming down from the remote server.
It's also important to remember that the number of _changes calls are very small
relative to the number of other calls involved (e.g., getting the
content of the changes themselves and particularly if there are many
attachments).
While this question is specific to TouchDB, and I mention specific
behaviours of that codebase, this answer deals with the requests involved
in replication between any two systems speaking the CouchDB replication
protocol[2].
[2] http://www.dataprotocols.org/en/latest/couchdb_replication.html
Let's take a contrived example: 1 update per 10 second window to
the source database for the replication, where a TouchDB database
is the target. Let's take a 5 minute poll vs. a continuous replication.
For simplicity of call-counting, let's also take attachments out of the
picture. We'll also assume the device has a constant network connection.
For the continuous case, every 10s TouchDB will receive an update in
the _changes feed. This causes the longpoll connection to close.
TouchDB then runs through the changes, requesting the updates from the
source database; one or more GET requests on the remote server. While
this is happening, TouchDB has to open up another longpoll request
to _changes. So in a five minute period, you'd end up with perhaps
30 calls to _changes, plus all the calls to get documents and record
checkpoints.
Compare this with a one-shot replication every five minutes. You'd
receive notification of the 30 updates in one _changes feed call.
TouchDB implements an optimisation[3] whereby it will call _all_docs
to get updated documents for 1- revs, so you might end up with a single
call to get all 30 documents (not possible in the continuous case as
you've received a single change). Then you've the checkpoint documents
to record. At best fewer than 5 HTTP calls, at most about a third of
the continuous case as you've avoided extra _changes requests.
[3] https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm#performance
It comes down to the frequency of updates you expect to the source
database. One-shot replication is likely to provide a smoother price
curve as you're in better control of the number of requests you make.
A further question is how often connections will drop because of the
network disconnects which happen regularly with mobile devices.
TouchDB's continuous replications will fire back up each time the
user comes on line (if added via the _replicator database). This is a
further source of unpredictable costs.
However, the benefits from more immediate visibility of changes may
certainly be worth the uncertainty.
A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.
So far I have two roles:
One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.
One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.
Which leads to lots of questions:
Is it okay to think of a worker role as an agent ?
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
Sorry for pounding all these questions, hope you don't mind,
Thanks a lot!
There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/
Is it okay to think of a worker role as an agent?
Yes, definitely.
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.
Is it okay to think of a worker role
as an agent ?
This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).
In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?
As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.
Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?
You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.