Understand how k6 manages at low level a large number of API call in a short period of time - load-testing

I'm new with k6 and I'm sorry if I'm asking something naive. I'm trying to understand how that tool manage the network calls under the hood. Is it executing them at the max rate he can ? Is it queuing them based on the System Under Test's response time ?
I need to get that because I'm running a lot of tests using both k6 run and k6 cloud but I can't make more than ~2000 requests per second (looking at k6 results). I was wondering if it is k6 that implement some kind of back-pressure mechanism if it understand that my system is "slow" or if there are some other reasons why I can't overcome that limit.
I read here that is possible to make 300.000 request per second and that the cloud environment is already configured for that. I also try to manually configure my machine but nothing changed.
e.g. The following tests are identical, the only changes is the number of VUs. I run all test on k6 cloud.
Shared parameters:
60 api calls (I have a single http.batch with 60 api calls)
Iterations: 100
Executor: per-vu-iterations
Here I got 547 reqs/s:
VUs: 10 (60.000 calls with an avg response time of 108ms)
Here I got 1.051,67 reqs/s:
VUs: 20 (120.000 calls with an avg response time of 112 ms)
I got 1.794,33 reqs/s:
VUs: 40 (240.000 calls with an avg response time of 134 ms)
Here I got 2.060,33 ​reqs/s:
VUs: 80 (480.000 calls with an avg response time of 238 ms)
Here I got 2.223,33 ​reqs/s:
VUs: 160 (960.000 calls with an avg response time of 479 ms)
Here I got 2.102,83 peak ​reqs/s:
VUs: 200 (1.081.380 calls with an avg response time of 637 ms) // I reach the max duration here, that's why he stop
What I was expecting is that if my system can't handle so much requests I have to see a lot of timeout errors but I haven't see any. What I'm seeing is that all the API calls are executed and no errors is returned. Can anyone help me ?

As k6 - or more specifically, your VUs - execute code synchronously, the amount of throughput you can achieve is fully dependent on how quickly the system you're interacting with responds.
Lets take this script as an example:
import http from 'k6/http';
export default function() {
http.get("https://httpbin.org/delay/1");
}
The endpoint here is purposefully designed to take 1 second to respond. There is no other code in the exported default function. Because each VU will wait for a response (or a timeout) before proceeding past the http.get statement, the maximum amount of throughput for each VU will be a very predictable 1 HTTP request/sec.
Often, response times (and/or errors, like timeouts) will increase as you increase the number of VUs. You will eventually reach a point where adding VUs does not result in higher throughput. In this situation, you've basically established the maximum throughput the System-Under-Test can handle. It simply can't keep up.
The only situation where that might not be the case is when the system running k6 runs out of hardware resources (usually CPU time). This is something that you must always pay attention to.
If you are using k6 OSS, you can scale to as many VUs (concurrent threads) as your system can handle. You could also use http.batch to fire off multiple requests concurrently within each VU (the statement will still block until all responses have been received). This might be slightly less overhead than spinning up additional VUs.

Related

SLO calculation for 90% of requests under 1000ms

I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.

Getting Locust to send a predefined distribution of requests per second

I previously asked this question about using Locust as the means of delivering a static, repeatable request load to the target server (n requests per second for five minutes, where n is predetermined for each second), and it was determined that it's not readily achievable.
So, I took a step back and reformulated the problem into something that you probably could do using a custom load shape, but I'm not sure how – hence this question.
As in the previous question, we have a 5-minute period of extracted Apache logs, where each second, anywhere from 1 to 36 GET requests were made to an Apache server. From those logs, I can get a distribution of how many times a certain requests-per-second rate appeared; e.g. there's a 1/4000 chance of 36 requests being processed on any given second, 1/50 for 18 requests to be processed on any given second, etc.
I can model the distribution of request rates as a simple Python list: the numbers between 1 and 36 appear in it an equal number of times as 1–36 requests per second were made in the 5-minute period captured in the Apache logs, and then just randomly get a number from it in the tick() method of a custom load shape to get a number that informs the (user count, spawn rate) calculation.
Additionally, by using a predetermined random seed, I can make the test runs repeatable to within an acceptable level of variation to be useful in testing my API server configuration changes, since the same random list elements should be retrieved each time.
The problem is that I'm not yet able to "think in Locust", to think in terms of user counts and spawn rates instead of rates of requests received by the server.
The question becomes this:
How do you implement the tick() method of a custom load shape in such a way that the (user count, spawn rate) tuple results in a roughly known distribution of requests per second to be sent, possibly with the help of other configuration options and plugins?
You need to create a Locust User with the tasks you want it to run (e.g. make your http calls). You can define time between tasks to kind of control the requests per second. If you have a task to make a single http call and define wait_time = constant(1) you can roughly get 1 request per second. Locust's spawn_rate is a per second unit. Since you have the data you want to reproduce already and it's in 1 second intervals, you can then create a LoadTestShape class with the tick() method somewhat like this:
class MyShape(LoadTestShape):
repro_data = […]
last_user_count = 0
def tick(self):
self.last_user_count = requests_per_second
if len(self.repro_data) > 0:
requests_per_second = self.repro_data.pop(0)
requests_per_second_diff = abs(last_user_count - requests_per_second)
return (requests_per_second, requests_per_second_diff)
return None
If your first data point is 10 requests, you'd need requests_per_second=10 and requests_per_second_diff=10 to make Locust spin up all 10 users in a single second. If the next second is 25, you'd have requests_per_second=25 and requests_per_second_diff=15. In a Load Shape, spawn_rate also works for decreasing the number of users. So if next is 16, requests_per_second=16 and requests_per_second_diff=9.

Circumventing negative side effects of default request sizes

I have been using Reactor pretty extensively for a while now.
The biggest caveat I have had coming up multiple times is default request sizes / prefetch.
Take this simple code for example:
Mono.fromCallable(System::currentTimeMillis)
.repeat()
.delayElements(Duration.ofSeconds(1))
.take(5)
.doOnNext(n -> log.info(n.toString()))
.blockLast();
To the eye of someone who might have worked with other reactive libraries before, this piece of code
should log the current timestamp every second for five times.
What really happens is that the same timestamp is returned five times, because delayElements doesn't send one request upstream for every elapsed duration, it sends 32 requests upstream by default, replenishing the number of requested elements as they are consumed.
This wouldn't be a problem if the environment variable for overriding the default prefetch wasn't capped to minimum 8.
This means that if I want to write real reactive code like above, I have to set the prefetch to one in every transformation. That sucks.
Is there a better way?

What is the best way to performance test an SQS consumer to find the max TPS that one host can handle?

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

How to alert in Seyren with Graphite if transactions in last 60 minutes are less than x?

I'm using Graphite+Statsd (with Python client) to collect custom metrics from a webapp: a counter for successful transactions. Let's say the counter is stats.transactions.count, that also has a rate/per/second metric available at stats.transactions.rate.
I've also setup Seyren as a monitor+alert system and successfully pulled metrics from Graphite. Now I want to setup an alert in Seyren if the number of successful transactions in the last 60 minutes is less than a certain minimum.
Which metric and Graphite function should I use? I tried with summarize(metric, '1h') but this gives me an alert each hour when Graphite starts aggregating the metric for the starting hour.
Note that Seyren also allows to specify Graphite's from and until parameters, if this helps.
I contributed the Seyren code to support from/until in order to handle this exact situation.
The following configuration should raise a warning if the count for the last hour drops below 50, and an error if it drops below 25.
Target: summarize(nonNegativeDerivative(stats.transactions.count),"1h","sum",true)
From: -1h
To: [blank]
Warn: 50 (soft minimum)
Error: 25 (hard minimum)
Note this will run every minute, so the "last hour" is a sliding scale. Also note that the third boolean parameter true for the summarize function is telling it to align its 1h bucket to the From, meaning you get a full 1-hour bucket starting 1 hour ago, rather than accidentally getting a half bucket. (Newer versions of Graphite may do this automatically.)
Your mileage may vary. I have had problems with this approach when the counter gets set back to 0 on a server restart. But in my case I'm using dropwizard metrics + graphite, not statsd + graphite, so you may not have that problem.
Please let me know if this approach works for you!

Resources