spike vs stress to determine performance - load-testing

I have a bit of theoretical question, that I'm actively gathering more views on, but, it's also related a bit to k6:
During load testing , I'm doing mostly spike test on the SUT(System Under Test). We gradually dial up the number of users hitting the server, from 10, 100, 1000. And we log the process execution time separately from the response time, and these are the kind of results we obtain say for a spike of 1000 users, at a single point of time:
export const options: any = {
scenarios: {
[Bu]: {
executor: 'per-vu-iterations',
exec: 'mRun',
vus: 1000,
iterations: Iteration,
maxDuration: '1000m',
tags: { tag_index: Index },
env: {
bu: Bu,
},
},
},
};
export function mRun(){
http.get('baseurl/check');
// code to log res.timings.duration
}
response time: 0.1s api execution time: 0.3s
response time 0.11s api execution time: 0.4s
response time 0.15s api execution time: 0.3s
response time 0.7s api execution time: 0.5s
response time 0.9s api execution time: 0.3s
... 500th request processed by the server
response time: 7s api execution time: 0.6s
response time: 7.1s api execution time: 0.5s
response time: 7.2s api execution time: 0.4s
... 1000th request processed by the server
response time: 11s api execution time: 0.6s
response time: 11.1s api execution time: 0.5s
I'm using both azure's web app service(P2V1) and azure postgres database.(General Purpose 4 core, 16 gb)
We speculate the increasing response time due to the queuing up of requests on the server.
As we see in the observations above, the execution time fluctuates around the same amount of time, and it's this queuing up of request that's giving us a seemingly bad response time.
So, what I'm confused on is if a spike test is actually a performance test. To me it seems more like a concurrency test in which we're trying to see if a system has some rate of failure and logging response, seems to me, doesn't have more value than a SUT on continuous expected load. These are just my ideas and I'd like more views on this point.
So, if you'd log the response times for a SUT, on a spike test and a stress test, how would you interpret those two different sets of data? And which test/test data would be better suitable to speculate and determine and benchmark the response times for a SUT?

Related

Understand how k6 manages at low level a large number of API call in a short period of time

I'm new with k6 and I'm sorry if I'm asking something naive. I'm trying to understand how that tool manage the network calls under the hood. Is it executing them at the max rate he can ? Is it queuing them based on the System Under Test's response time ?
I need to get that because I'm running a lot of tests using both k6 run and k6 cloud but I can't make more than ~2000 requests per second (looking at k6 results). I was wondering if it is k6 that implement some kind of back-pressure mechanism if it understand that my system is "slow" or if there are some other reasons why I can't overcome that limit.
I read here that is possible to make 300.000 request per second and that the cloud environment is already configured for that. I also try to manually configure my machine but nothing changed.
e.g. The following tests are identical, the only changes is the number of VUs. I run all test on k6 cloud.
Shared parameters:
60 api calls (I have a single http.batch with 60 api calls)
Iterations: 100
Executor: per-vu-iterations
Here I got 547 reqs/s:
VUs: 10 (60.000 calls with an avg response time of 108ms)
Here I got 1.051,67 reqs/s:
VUs: 20 (120.000 calls with an avg response time of 112 ms)
I got 1.794,33 reqs/s:
VUs: 40 (240.000 calls with an avg response time of 134 ms)
Here I got 2.060,33 ​reqs/s:
VUs: 80 (480.000 calls with an avg response time of 238 ms)
Here I got 2.223,33 ​reqs/s:
VUs: 160 (960.000 calls with an avg response time of 479 ms)
Here I got 2.102,83 peak ​reqs/s:
VUs: 200 (1.081.380 calls with an avg response time of 637 ms) // I reach the max duration here, that's why he stop
What I was expecting is that if my system can't handle so much requests I have to see a lot of timeout errors but I haven't see any. What I'm seeing is that all the API calls are executed and no errors is returned. Can anyone help me ?
As k6 - or more specifically, your VUs - execute code synchronously, the amount of throughput you can achieve is fully dependent on how quickly the system you're interacting with responds.
Lets take this script as an example:
import http from 'k6/http';
export default function() {
http.get("https://httpbin.org/delay/1");
}
The endpoint here is purposefully designed to take 1 second to respond. There is no other code in the exported default function. Because each VU will wait for a response (or a timeout) before proceeding past the http.get statement, the maximum amount of throughput for each VU will be a very predictable 1 HTTP request/sec.
Often, response times (and/or errors, like timeouts) will increase as you increase the number of VUs. You will eventually reach a point where adding VUs does not result in higher throughput. In this situation, you've basically established the maximum throughput the System-Under-Test can handle. It simply can't keep up.
The only situation where that might not be the case is when the system running k6 runs out of hardware resources (usually CPU time). This is something that you must always pay attention to.
If you are using k6 OSS, you can scale to as many VUs (concurrent threads) as your system can handle. You could also use http.batch to fire off multiple requests concurrently within each VU (the statement will still block until all responses have been received). This might be slightly less overhead than spinning up additional VUs.

Apache Bench 'Time per Request' decreases with increasing concurrency

I am testing my web server using Apache Bench and I am getting the following responses
Request : ab -n 1000 -c 20 https://www.my-example.com
Time per request: 16.264 [ms] (mean, across all concurrent requests)
Request : ab -n 10000 -c 100 https://www.my-example.com
Time per request: 3.587 [ms] (mean, across all concurrent requests)
Request : ab -n 10000 -c 500 https://www.my-example.com
Time per request: 1.381 [ms] (mean, across all concurrent requests)
The 'Time per request' is decreasing with increasing concurrency. May I know why? Or is this by any chance a bug?
You should be seeing 2 values for Time per request. One of them is [ms] (mean) whereas the other one is [ms] (mean, across all concurrent requests). A concurrency of 20 means that 20 simultaneous requests were sent in a single go and the concurrency was maintained for the duration of the test. The lower value is total_time_taken/total_number_of_requests and it kind of disregards the concurrency aspect whereas the other value is closer to the mean response time (actual response time) you were getting for your requests. I generally visualize it as x concurrent requests being sent in a single batch, and that value is the mean time it took for a batch of concurrent requests to complete. This value will also be closer to your percentiles, which also points to it being the actual time taken by the request.

Flink Checkpoint Failure - Checkpoints time out after 10 mins

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).
The root issue is:
Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.
Suggested solution:
I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?
Aim:
How to avoid this issue and record the correct state that doesn't miss any data?
Failed checkpoint:
Completed checkpoint:
subtask didn't respond
Thanks
There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.
Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.
Sounds like you should extend the timeout, which you can do like this:
env.getCheckpointConfig().setCheckpointTimeout(n);
where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

What is the best way to performance test an SQS consumer to find the max TPS that one host can handle?

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

Jenkins - JMeter Plugin - How to compare with Previous build in terms of Average Response time in Seconds

I have Scheduled a JMeter Job in Jenkins. My Goal is to compare the result of the current test with previous build's result and provide the status accordingly. I have set the
Unstable % Range as = (-)1 to (+)1
Error % Range as = (-)5 to (+)5
My application's response time always have a variation of 1 second and it is an known factor
Since JMeter is returning Average Response time in milli seconds, Test is marked as failed or unstable as the difference in milli seconds is always high
Instead of comparing the response time in milli seconds, is it possible to convert the response time in Seconds and then perform an build comparison?

Resources