Leaky bucket vs token buckets - rate limiter - rate-limiting

So reading about Leaky bucket algorithm I'm trying to understand if the implementation differs between implementors or I'm missing something.
Reading about it, it says leaky bucket is used to keep a constant pace of retrieval (e.g 2 requests/second). So if the configuration is bucket size of 40 and 2 qps, any requests that will be reached when the bucket is full (larger than 40) will be throttled.
But, what happens to requests when the bucket is not full ? are they actually added to a queue and wait? (that doesn't make much sense I suppose cause it will affect latency).
If they don't actually wait, then the pace of handling requests is not constant on 2 qps, but allow bursts up to 40.
However, in this case, what is the difference between leaky bucket and token bucket other than the fact one removes tokens and one adds tokens ? it looks like both allow bursts and both will try smooth handling requests after bursts to the QPS defined.

Related

How can you verify that a blockchain/DLT has not been tampered with?

I understand how a distributed ledger ensures integrity using a chained linked-list data model whereby each block is chained to all its previous ones.
I also understand how in a PoW/PoS/PoET/(insert any trustless consensus mechanism here) context, the verification process makes it difficult for a malicious (or a group of) individual to tamper with a block because they would not have the necessary resources to broadcast an instance of the ledger to all network members so that they update their version to a wrong one.
My question is, if let's say some one does manage to change a block, does an integrity checking process ever happen? Is it an actual part of the verification mechanism and if so, how far in history does it go?
Is it ever necessary to verify the integrity of i.e. block number 500 out of a total of 10000 and if so, how do I do that? Do I need to start from block 10000 and verify all blocks from there until block 500?
My question is, if let's say some one does manage to change a block, does an integrity checking process ever happen?
Change the block where? If you mean change my copy of the block in my computer, how would they do that? By breaking into my computer? And how would that affect anyone else?
Is it an actual part of the verification mechanism and if so, how far in history does it go? Is it ever necessary to verify the integrity of i.e. block number 500 out of a total of 10000 and if so, how do I do that? Do I need to start from block 10000 and verify all blocks from there until block 500?
The usual rule for most blockchains is that every full node checks every single block it ever receives to ensure that it follows every single system rule for validity.
While you could re-check every block you already checked to ensure that your particular copy hasn't been tampered with, this generally serves no purpose. Someone who could tamper with your local block storage could also tamper with your local checking code. So this kind of check doesn't usually provide any additional security.
To begin with, tampering with a block is not made almost-impossible because of resource shortage for broadcasting the wrong ledger to all nodes. Broadcasting is not necessarily resource intensive. It is a chain-reaction which you only have to trigger. The challenge with tampering a block-chained block arises from the difficulty of recomputing the valid hashes (meeting the block difficulty level) of all the successive blocks (the blocks that come after the one being tampered). Because altering a block alters its hash, which in turn changes the previous hash attribute of the next block, hence invalidating its previously correct hash, and so on till the latest block. If say the latest block index is 1000. And if you tamper the 990th block. This implies that you would have to re-mine (recalculate a valid hash by randomly changing the nonce value) blocks from 990 - 1000. That in itself is very difficult to do. But say somehow you manged to do that, but by the time you broadcast your updated blockchain, there would have been other blocks (of index say 1001, 1002) mined by other miners. So yours won't be the longest valid blockchain,and therefore would be rejected.
According to this article, when a new block is broadcasted to the network, the integrity of its transaction is validated and verified against the history of transactions. The trouble with integrity check arises only if a malicious longer block-chain is broadcasted to the network. In this case, the protocol forces the nodes to accept this longest chain. Note that, this longest chain maintains its own integrity and is verified of its own. But it is verified against its own version of truth. Mind you though, this can only be possible if the attacker has a hashing power that is at least 51% of the network's hashing power. And this kind of power is assumed practically impossible. https://medium.com/coinmonks/what-is-a-51-attack-or-double-spend-attack-aa108db63474

Making use of workers in Custom Sink

I have a custom sink which will publish the final result from a pipeline to a repository.
I am getting the inputs for this pipeline from BigQuery and GCS.
The custom writer present in the sink is called for each in all workers. Custom Writer will just collect the objects to be psuhed and return it as part of WriteResult. And then finally I merge these records in the CustomWriteOperation.finalize() and push it into my repository.
This works fine for smaller files. But, my repository will not accept if the result is greater than 5 MB. Also it will not accept not more than 20 writes per hour.
If I push the result via worker, then the writes per day limit will be violated. If I write it in a CustomWriteOperation.finalize(), then it may violate size limt i.e. 5MB.
Current approach is to write in chunks in CustomWriteOperation.finalize(). As this is not executed in many workers it might cause delay in my job. How can I make use of workers in finalize() and how can I specify the number of workers to be used inside a pipeline for a specific job (i.e) write job?
Or is there any better approach?
The sink API doesn't explicitly allow tuning of bundle size.
One work around might be to use a ParDo to group records into bundles. For example, you can use a DoFn to randomly assign each record a key between 1,..., N. You could then use a GroupByKey to group the records into KV<Integer, Iterable<Records>>. This should produce N groups of roughly the same size.
As a result, an invocation of Sink.Writer.write could write all the records with the same key at once and since write is invoked in parallel the bundles would be written in parallel.
However, since a given KV pair could be processed multiple times or in multiple workers at the same time, you will need to implement some mechanism to create a lock so that you only try to write each group of records once.
You will also need to handle failures and retries.
So, if I understand correctly, you have a repository that
Accepts no more than X write operations per hour (I suppose if you try to do more, you get an error from the API you're writing to), and
Each write operation can be no bigger than Y in size (with similar error reporting).
That means it is not possible to write more than X*Y data in 1 hour, so I suppose, if you want to write more than that, you would want your pipeline to wait longer than 1 hour.
Dataflow currently does not provide built-in support for enforcing either of these limits, however it seems like you should be able to simply do retries with randomized exponential back-off to get around the first limitation (here's a good discussion), and it only remains to make sure individual writes are not too big.
Limiting individual writes can be done in your Writer class in the custom sink. You can maintain a buffer of records, and have write() add to the buffer and flush it by issuing the API call (with exponential back-off, as mentioned) if it becomes just below the allowed write size, and flush one more time in close().
This way you will write bundles that are as big as possible but not bigger, and if you add retry logic, throttling will also be respected.
Overall, this seems to fit well in the Sink API.
I am working with Sam on this and here are the actual limits imposed by our target system: 100 GB per api call, and max of 25 api calls per day.
Given these limits, the retry method with back-off logic may cause the upload to take many days to complete since we don't have control on the number of workers.
Another approach would be to leverage FileBasedSink to write many files in parallel. Once all these files are written, finalize (or copyToOutputFiles) can combine files until total size reaches 100 GB and push to target system. This way we leverage parallelization from writer threads, and honor the limit from target system.
Thoughts on this, or any other ideas?

When is key/value caching on Heroku worthwhile?

Background:
I have a web service which takes in input of 1 to 20 objects, and then performs an operation on each that takes roughly 100-300ms. The results of that operation are valid on average for one hour, and the output is a hash of strings and integers. The average request has 5 objects, thus a response time of roughly 1000ms. I am expecting a pretty low cache hit rate until the service picks up traction--let's call it a 10% hit rate for now.
My application is hosted on Heroku, and for the purposes of this question, I do not wish to move it.
What I've Tried
I started with the free offering from IronCache (through the Heroku add-on), and did some very rough tests. A put() and get() request take roughly 20-40ms for simple objects. There is no support for batch operations, so assuming a 100% cache miss, this would add 20-40ms per object to my response. In my average case for 5 objects, that is roughly 150ms extra.
IronCache did not support batched operations, but it seems like that would solve my issue.
My Question
Given this profile, is it worthwhile to use a hosted caching (key/value) store on Heroku? If so, which?
I went with MemCachier, an add-on for Heroku which offers a 25MB free tier. They use Dalli as their Ruby library, which supports get_multi, and a multi function which takes a block and defers sending until the end of the block.
If batched operations will make caching worth while for you, you should use redis, It supports them.

Force update Amazon CloudFront on iOS

I am using Amazon S3 in conjunction with Amazon CloudFront, basically in my app I have a method to update an S3 object, basically I get the S3 object using CloudFront, I make a change to the data and I reupload it under the same key -- basically replacing/updating the file/object.
However, CloudFront doesn't seem to update along with S3 (well it does, but my users don't have all day), is there a way to force a CloudFront content update? Apparently you can invalidate it, is there an iOS SDK way to do that?
I don't know that there is a way to make an CloudFront invalidation request via the iOS SDK. You would likely need to build your own method to formulate the request against AWS API.
I would however suggest that you take another approach. Invalidation requests are expensive operations (relative to other Cloudfront costs). You probably do not want to leave it up to your user to be able to initiate an unlimited amount of invalidation requests against CloudFront via the application. You will also run up against limits to the number of concurrent invalidation requests you can have. Your best best is to actually implement a file name versioning scheme to where you can change the file name in a programmatic way for each revision. You would then reference the new URL in Cloudfront with each revision, eliminating the need to wait for a cache refresh or perform an invalidation. Also this will lead to more immediate response availability for the image, as invalidation requests may take a while to process.
Please note the following from the CloudFront FAQ:
Q. Is there a limit to the number of invalidation requests I can make?
There are no limits on the total number of files you can invalidate; however, each invalidation request you make can have a maximum of 1,000 files. In addition, you can only have 3 invalidation requests in progress at any given time. If you exceed this limit, further invalidation requests will receive an error response until one of the earlier requests completes. You should use invalidation only in unexpected circumstances; if you know beforehand that your files will need to be removed from cache frequently, it is recommended that you either implement a versioning system for your files and/or set a short expiration period.

Single Bigger POST to a google-script or several smaller?

I have a script which collect some data about certain files in my computer and then make a POST to a google-script published as service.
I was wondering what should be better: collect all the data (which couldn't be more than few MB, maybe 10) and make a single POST, or make one POST request for each piece (which are just some kb) ?
Which is better for performance at both sides, my local computer and for google servers?
Could be understood as abuse if I make a hundred of POST? it will run just once a month.
There are a lot of factors that would go into this decision -
In general, I would argue its better to do one upload as 10mb isn't a large amount of data
Is this Asynchronous (or automatic) or is there a user clicking button? If its happening automatically then you don't have to worry about reporting progress accurately to the user. If there is a user watching the upload then smaller uploads are better as you'll be able to measure how many of the units (or chunks) are properly uploaded.
Your computer should not be in the picture at all - Google Apps Script runs on the Google Servers. Perhaps there is some confusion here?

Resources