Old HTTP requests arrive to server behind AWS ELB - ios

We had an interesting event the other day in our system where a burst of month old HTTP requests arrived to our ELB and from there to one of our coupled servers. We could tell that the requests were old by a timestamp we are sending from our client app (and from the fact it had no relevant data :) ).
Our system is hosted on AWS, using a group of EC2 instances behind an ELB which communicates in HTTP with the EC2's. Also, our client app runs on iOS.
A thing to notice - the old requests were dated to a day in which we had a server crash which lead to a great load on our remaining servers (resulting in a lot of hanged HTTP requests, i.e they were not processed)
Also, despite the group of old messages originally spanned across several minutes (which we know from the timestamps), they all came in a single bulk the other day (this is from the ELB metrics).
We are trying to figure out how or where could these requests stack up and maybe understand why it happened when it did.
Any insights, similar experiences or suggestions will be appreciated as we've failed to find similar events on the web, thanks!

Related

NextJs - ServerSide Render CPU spike with certain API request

We noticed an interesting issue regarding our serverside rendering of our NextJs application.
The issue occurred when we decided to add fetching of our translation key values (which are fetched via API) on our serverside calls.
We made this decision to reduce CLS and give the user a better overal experience on first load. We didn't think much of it, since it's just another api call being handled returning the json before rendering the page.
Below you see a graph of our cpu usage on AWS, between 10:00 (10AM) and 14:00 (2PM) the fetching of the translation keys was live on production. We had to manually restart every 2h in order for the servers to survive (That is the peak you are seeing). After 16:00 (4PM) We removed this api call from that specific serverside call and you see that the server is stable. From 20:00 (8PM) we enabled an automatic restart to be sure that the servers would survive the night, but as you can see this was not necessary.
The API call in question, just returns a JSON object. It can contain up to 700lines of values which should be fine as our Product listing page can have a larger responses (up to 10k lines). Everything has caching on enabled, so next/static and the api responses. We were also thinking that it might have something to do with outgoing connections not closing in time. But I am making this post because no one really knows why this is an issue.
We are running the following setup:
Dockerized Application
Running on AWS Beanstalk
Cloudfront + Akamai
NextJs v12.2
Node v14.16
If anyone has the smallest idea in which direction to look, please let me know. We appreciate it.

Large percent of requests in CLRThreadPoolQueue

We have an ASP.NET MVC application hosted in an azure app-service. After running the profiler to help diagnose possible slow requests, we were surprised to see this:
An unusually high % of slow requests in the CLRThreadPoolQueue. We've now run multiple profile sessions each come back having between 40-80% in the CLRThreadPoolQueue (something we'd never seen before in previous profiles). CPU each time was below 40%, and after checking our metrics we aren't getting sudden spikes in requests.
The majority of the requests listed as slow are super simple api calls. We've added response caching and made them async. The only thing they do is hit a database looking for a single record result. We've checked the metrics on the database and the query avg run time is around 50ms or less. Looking at application insights for these requests confirms this, and shows that the database query doesn't take place until the very end of the request time line (I assume this is the request sitting in the queue).
Recently we started including SignalR into a portion of our application. Its not fully in use but it is in the code base. We since switched to using Azure SignalR Service and saw no changes. The addition of SignalR is the only "major" change/addition we've made since encountering this issue.
I understand we can scale up and/or increase the minWorkerThreads. However, this feels like I'm just treating the symptom not the cause.
Things we've tried:
Finding the most frequent requests and making them async (they weren't before)
Response caching to frequent requests
Using Azure SignalR service rather than hosting it on the same web
Running memory dumps and contacting azure support (they
found nothing).
Scaling up to an S3
Profiling with and without thread report
-- None of these steps have resolved our issue --
How can we determine what requests and/or code is causing requests to pile up in the CLRThreadPoolQueue?
We encountered a similar problem, I guess internally SignalR must be using up a lot of threads or some other contended resource.
We did three things that helped a lot:
Call ThreadPool.SetMinThreads(400, 1) on app startup to make sure that the threadpool has enough threads to handle all the incoming requests from the start
Create a second App Service with the same code deployed to it. In the javascript, set the SignalR URL to point to that second instance. That way, all the SignalR requests go to one app service, and all the app's HTTP requests go to the other. Obviously this requires a SignalR backplane to be set up, but assuming your app service has more than 1 instance you'll have had to do this anyway
Review the code for any synchronous code paths (eg. making a non-async call to the database or to an API) and convert them to async code paths

How do I spread out load on my rails app from webhook responses?

I have a rails app that easily handles the traffic we currently experience, except once a day when we receive a large number of pings within a few seconds from an external service's webhook that is reporting on past transactions. Currently this causes the app to time out due to lack of db connection availability, meaning we lose some of the webhooks as well as bringing the site down for a few seconds. It's not important that the data contained in these webhooks be processed instantaneously, so I am looking for a good way to spread out the responses, rather than do an expensive upgrade just to handle these bursts with additional db connection capability.
Is it okay to just have the relevant controller method sleep for a small, random number of seconds before doing anything that would open a db connection to spread things out? Or is there a better way to do this?
Setup a background/async processing system like Sidekiq (or whatever Heroku offers). Modify your controller action to do nothing but shove the parameters into a background job and return "ok". Then process the job in the background.

Server Sent Events and Rails Streaming

I'm experimenting with Rails 4 ActionController::Live and Server Sent Events. I'm using MRI 2.0.0 and Puma.
For what I can see, each connected client keeps an active connection to the server. I was wondering if it is possible to leverage SSEs without keeping all response streams running.
Puma manages multiple connections using threads, and I imagine there is a limit to the number of cuncurrent connections.
What if I want to support a real-world scenario with thousands of clients registering to my Rails app for SSE events?
Is there any example?
Also, I usually run Rails app servers behind an nginx reverse proxy. Would it require any particular setup?
The way that SSEs are built is by the client opening a connection to the server, which is then left open until the server has some data to send. This is part of the SSE spec, and not a thing specific to ActionController::Live. It's effectively the same as long-polling, but with the connection not being closed after the first bit of data is returned, and with the mechanism built into the browser.
As such, the only way it can be implemented is by having multiple open client connections to the webserver which sit there indefinitely. As to what resources are required to deal with them, I'm not sure, as I've not yet tried to benchmark this, but it'll need enough servers for Puma to keep open thousands of connections if you have that many users with a page open.
The default limit for puma is 16 concurrent connections. Several blogs posts about setting up SSEs for Rails mention upping this to a larger value, but none that I've found suggest what this higher value should be. They do mention that the number of DB connections will need to be the same, as each Rails thread keeps one running. Sort of sounds like an expensive way to run things.
"Run a benchmark" is the only answer really.
I can't comment as to reverse proxying as I've not tried it, but as SSEs are done over standard HTTP, I shouldn't think it'll need any special setup.

Amazon Web Service Micro Instance - Server Crash

I am currently using an AWS micro instance as a web server for a website that allows users to upload photos. Two questions:
1) When looking at my CloudWatch metrics, I have recently noticed CPU spikes, the website receives very little traffic at the moment, but becomes utterly unusable during these spikes. These spikes can last several hours and resetting the server does not eliminate the spikes.
2) Although seemingly unrelated, whenever I post a link of my website on Twitter, the server crashes (i.e.,Error Establishing a Database Connection). Once restarting Apache and MySQL, the website returns to normal functionality.
My only guess would be that the issue is somehow the result of deficiencies with the micro instance. Unfortunately, when I upgraded to the small instance, the site was actually slower due to fact that the micro instances can have two EC2 compute units.
Any suggestions?
If you want to stay in the free tier of AWS (micro instance), you should off load as much as possible away from your EC2 instance.
I would suggest you to upload the images directly to S3 instead of going through your web server (see some example for it here: http://aws.amazon.com/articles/1434).
S3 can also be used to serve most of your web pages (images, js, css...), instead of your weak web server. You can also add these files in S3 as origin to Amazon CloudFront (CDN) distribution to improve your application performance.
Another service that can help you in off loading the work is SQS (Simple Queue Service). Instead of working with online requests from users, you can send some requests (upload done, for example) as a message to SQS and have your reader process these messages on its own pace. This is good way to handel momentary load cause by several users working simultaneously with your service.
Another service is DynamoDB (managed NoSQL DB service). You can put on dynamoDB most of your current MySQL data and queries. Amazon DynamoDB also has a free tier that you can enjoy.
With the combination of the above, you can have your micro instance handling the few remaining dynamic pages until you need to scale your service with your growing success.
Wait… I'm sorry. Did you say you were running both Apache and MySQL Server on a micro instance?
First of all, that's never a good idea. Secondly, as documented, micros have low I/O and can only burst to 2 ECUs.
If you want to continue using a resource-constrained micro instance, you need to (a) put MySQL somewhere else, and (b) use something like Nginx instead of Apache as it requires far fewer resources to run. Otherwise, you should seriously consider sizing up to something larger.
I had the same issue: As far as I understand the problem is that AWS will slow you down when you reach a predefined usage. This means that they allow for a small burst but after that things will become horribly slow.
You can test that by logging in and doing something. If you use the CPU for a couple of seconds then the whole box will become extremely slow. After that you'll have to wait without doing anything at all to get things back to "normal".
That was the main reason I went for VPS instead of AWS.

Resources