What is the best way to performance test an SQS consumer to find the max TPS that one host can handle? - amazon-sqs

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

Related

Understand how k6 manages at low level a large number of API call in a short period of time

I'm new with k6 and I'm sorry if I'm asking something naive. I'm trying to understand how that tool manage the network calls under the hood. Is it executing them at the max rate he can ? Is it queuing them based on the System Under Test's response time ?
I need to get that because I'm running a lot of tests using both k6 run and k6 cloud but I can't make more than ~2000 requests per second (looking at k6 results). I was wondering if it is k6 that implement some kind of back-pressure mechanism if it understand that my system is "slow" or if there are some other reasons why I can't overcome that limit.
I read here that is possible to make 300.000 request per second and that the cloud environment is already configured for that. I also try to manually configure my machine but nothing changed.
e.g. The following tests are identical, the only changes is the number of VUs. I run all test on k6 cloud.
Shared parameters:
60 api calls (I have a single http.batch with 60 api calls)
Iterations: 100
Executor: per-vu-iterations
Here I got 547 reqs/s:
VUs: 10 (60.000 calls with an avg response time of 108ms)
Here I got 1.051,67 reqs/s:
VUs: 20 (120.000 calls with an avg response time of 112 ms)
I got 1.794,33 reqs/s:
VUs: 40 (240.000 calls with an avg response time of 134 ms)
Here I got 2.060,33 ​reqs/s:
VUs: 80 (480.000 calls with an avg response time of 238 ms)
Here I got 2.223,33 ​reqs/s:
VUs: 160 (960.000 calls with an avg response time of 479 ms)
Here I got 2.102,83 peak ​reqs/s:
VUs: 200 (1.081.380 calls with an avg response time of 637 ms) // I reach the max duration here, that's why he stop
What I was expecting is that if my system can't handle so much requests I have to see a lot of timeout errors but I haven't see any. What I'm seeing is that all the API calls are executed and no errors is returned. Can anyone help me ?
As k6 - or more specifically, your VUs - execute code synchronously, the amount of throughput you can achieve is fully dependent on how quickly the system you're interacting with responds.
Lets take this script as an example:
import http from 'k6/http';
export default function() {
http.get("https://httpbin.org/delay/1");
}
The endpoint here is purposefully designed to take 1 second to respond. There is no other code in the exported default function. Because each VU will wait for a response (or a timeout) before proceeding past the http.get statement, the maximum amount of throughput for each VU will be a very predictable 1 HTTP request/sec.
Often, response times (and/or errors, like timeouts) will increase as you increase the number of VUs. You will eventually reach a point where adding VUs does not result in higher throughput. In this situation, you've basically established the maximum throughput the System-Under-Test can handle. It simply can't keep up.
The only situation where that might not be the case is when the system running k6 runs out of hardware resources (usually CPU time). This is something that you must always pay attention to.
If you are using k6 OSS, you can scale to as many VUs (concurrent threads) as your system can handle. You could also use http.batch to fire off multiple requests concurrently within each VU (the statement will still block until all responses have been received). This might be slightly less overhead than spinning up additional VUs.

Throttle Apache Spout Dynamically

I have a topology where spout reads data from Kafka and sends to bolt which in turn calls a REST API (A) and that calls another REST API (B). So far API B did not have throttling. Now they have implemented throttling (x number of max calls per clock minute).
We need to implement the throttling handler.
Option A
Initially we were thinking to do it in REST API (A) level and put a
Thread.sleep(x in millis) once the call is throttled by REST API (B)
but that will hold all the REST (A) calls waiting for that long (which will vary between 1 sec to 59 seconds) and that may increase the load for new calls coming in.
Option B
REST API (A) sends response back to Bolt about being throttled. Bolt notifies the Spout with process failure to
To not to change the offset for those messages
To tell spout to stop reading from Kafka and to stop emitting message to Bolt.
Spout waits for some time (say a minute) and resumes from where it left
Option A is straight forward to implement but not a good solution in my opinion.
Trying to figure out if Option B is feasible with topology.max.spout.pending however how to dynamically communicate to Storm to put a throttling in spout. Anyone please can you share some thoughts on this option.
Option C
REST API (B) throttles the call from REST (A) which will not handle the same but will send the 429 response code to the bolt. The bolt will re-queue the message to a error topic part of another storm topology. This message can have retry count as part of it and in case the same message gets throttled again we can re-queue again with ++retry count.
Updating the post as found a solution to make the option B feasible.
Option D
https://github.com/apache/storm/blob/master/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpoutRetryExponentialBackoff.java
/**
* The time stamp of the next retry is scheduled according to the exponential backoff formula (geometric progression):
* nextRetry = failCount == 1 ? currentTime + initialDelay : currentTime + delayPeriod^(failCount-1),
* where failCount = 1, 2, 3, ... nextRetry = Min(nextRetry, currentTime + maxDelay).
* <p/>
* By specifying a value for maxRetries lower than Integer.MAX_VALUE, the user decides to sacrifice guarantee of delivery for the
* previous polled records in favor of processing more records.
*
* #param initialDelay initial delay of the first retry
* #param delayPeriod the time interval that is the ratio of the exponential backoff formula (geometric progression)
* #param maxRetries maximum number of times a tuple is retried before being acked and scheduled for commit
* #param maxDelay maximum amount of time waiting before retrying
*
*/
public KafkaSpoutRetryExponentialBackoff(TimeInterval initialDelay, TimeInterval delayPeriod, int maxRetries, TimeInterval maxDelay) {
this.initialDelay = initialDelay;
this.delayPeriod = delayPeriod;
this.maxRetries = maxRetries;
this.maxDelay = maxDelay;
LOG.debug("Instantiated {}", this.toStringImpl());
}
The steps will be as follows:
Create kafkaSpoutRetryService using the above constructor
Set retry to KafkaSpoutConfig using
KafkaSpoutConfig.builder(kafkaBootStrapServers, topic).setRetry(kafkaSpoutRetryService)
Fail the Bolt in case there is throttling in Rest API (B) using
collector.fail(tuple) which will signal spout to process the tuple again, based in the retry configuration setup in step 1 and 2
Your option D sounds fine, but in the interest of avoiding duplicates in calls to API A, I think you should consider separating your topology into two.
Have a topology that reads from your original Kafka topic (call it topic 1), calls REST API A, and writes whatever the output of the bolt is back to a Kafka topic (call it topic 2).
You then make a second topology whose only job is to read from topic 2, and call REST API B.
This will allow you to use option D while avoiding extra calls to API A when you are saturating API B. Your topologies will look like
Kafka 1 -> Bolt A -> REST API A -> Kafka 2
Kafka 2 -> Bolt B -> REST API B
If you want to make the solution a little more responsive to the throttling, you can use the topology.max.spout.pending configuration in Storm to limit how many tuples can be in-flight at the same time. You could then make your bolt B buffer in-flight tuples until the throttling expires, at which point you can make it try sending the tuples again. You can use OutputCollector.resetTupleTimeout to avoid the tuples timing out while Bolt B is waiting for the throttling to expire. You can use tick tuples to make Bolt B periodically wake up and check whether the throttling has expired.

Explain Cost of Google Cloud PubSub when used with Cloud Dataflow

The documentation on pubsub pricing is very minimal. Can someone explain the costs for the scenario below ?
Size of the data per event = 0.5 KB
Size of data per day = 1 TB
There is only one publisher app and there are two dataflow pipeline subscriptions.
The very rough estimate I can come up with is:
1x publishing
2x subscription (1x for each subscription)
2x acknowledgment (1x for each subscription ack)
The questions are:
Is total data volume per month, 150 (30* 1 TB * 5x) TB? That is 8000$ per month from the price calculator.
1 KB min size for the calculation is applicable even for acknowledging a message?
Dataflow handles subscribe/acknowledge in bundles of ParDos. But, Is the bundle for each message acknowledged separately?
One does not pay for acknowledgements in Google Cloud Pub/Sub, only for publishes, pulls, and pushes. With messages of size 0.5KB, the amount you'd get charged would depend on the batching because of the 1KB minimum size. If all requests had at least 1KB, then the total cost for publishing and getting messages to two subscribers would be:
1TB/day * 30 days * 3 = 92,160GB/month
10GB * $0 + 92,150GB * $0.04 = $3,686
If some messages were not batched, then the price could go up because of the 1KB minimum. The Google Cloud Pub/Sub client library does batch published messages by default, so assuming your messages were not published very sporadically (meaning they were not frequent enough to result in batching), you would hit the 1KB minimum. With the amount of data, you are probably going to end up with batching on your subscribe side as well.

How to write bosun alerts which handle low traffic volumes

If you are writing a bosun alert which is based of a percentage error rate for requests handled by your system, how do you write it in such a way that it handles periods of low traffic.
For example:
If I have an alert which looks back over the last 5 minutes and works out the error rate for requests
$errorRate = $numberErr/$numberReq and then triggers an alarm if the errorRate exceeds a predefined threshold crit = $errorRate > 0.05 this can work quite well so long as every 5 minute period had a sufficiently large number of requests ($numberReq).
If the number of requests in a 5 minute period was 10,000 then 501 errors would be required to trigger an alarm. However if the number of requests in a 5 minute period was 100 then only 5 errors would be required to trigger an alarm.
How can I write an alert which handles periods where the number of requests are so low that a small number of errors will equate to a large error rate. I had considered a sliding window of time, rather than a fixed 5 minute period, where the window would increase in size until the number of requests was high enough to give some confidence in the alarm. e.g. increase the time period until the number of requests is 10,000.
I can't find a way to achieve this in bosun, and I don't want to commit to a larger period of time for my alerts because the traffic rate varies so much. A longer period during peak traffic could result in an actual error causing a much larger impact.
I generally pair any percentage and/or historical based alerts with a static threshold.
For example: crit = numberErr > 100 && $errorRate > 0.05. That way the percent part doesn't matter unless the number of errors have also crossed some threshold because the entire statement won't be true.

Flume doesn't recover after memory transaction capacity is exceeded

I'm creating a proof-of-concept of a Flume agent that'll buffer events and stops consuming events from the source when the sink is unavailable. Only when the sink is available again, the buffered events should be processed and then the source restarts consumption.
For this I've created a simple agent, which reads from a SpoolDir and writes to a file. To simulate that the sink service is down, I change file permissions so Flume can't write to it. Then I start Flume some events are buffered in the memory channel and it stops consuming events when the channel capacity is full, as expected. As soon as the file becomes writeable, the sink is able to process the events and Flume recovers. However, that only works when the transaction capacity is not exceeded. As soon as the transaction capacity is exceeded, Flume never recovers and keeps writing the following error:
2015-10-02 14:52:51,940 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to
deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to process transaction
at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:218)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelException: Take list for MemoryTransaction,
capacity 4 full, consider committing more frequently, increasing capacity, or
increasing thread count
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doTake(MemoryChannel.java:96)
at org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)
at org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)
at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:191)
... 3 more
As soon as the number of events buffered in memory exceed the transaction capacity (4) this error occurs. I don't understand why, because the batchSize of the fileout is 1, so it should take out the events one by one.
This is the config I'm using:
agent.sources = spool-src
agent.channels = mem-channel
agent.sinks = fileout
agent.sources.spool-src.channels = mem-channel
agent.sources.spool-src.type = spooldir
agent.sources.spool-src.spoolDir = /tmp/flume-spool
agent.sources.spool-src.batchSize = 1
agent.channels.mem-channel.type = memory
agent.channels.mem-channel.capacity = 10
agent.channels.mem-channel.transactionCapacity = 4
agent.sinks.fileout.channel = mem-channel
agent.sinks.fileout.type = file_roll
agent.sinks.fileout.sink.directory = /tmp/flume-output
agent.sinks.fileout.sink.rollInterval = 0
agent.sinks.fileout.batchSize = 1
I've tested this config with different values for the channel capacity & transaction capacity (e.g., 3 and 3), but haven't found a situation where the channel capacity is full and Flume is able to recover.
On the flume mailing list someone told me it was probably this bug that affected my proof of concept. The bug entails that the batch size is 100, even tho it's specified differently in the config. I re-ran the test with the source & sink batchSizes set to 100, the memory channel transactionCapacity set to 100 and its capacity to 300. With those values, the proof of concept works exactly as expected.

Resources