How to isolate worker dynos from web dynos on heroku? - ruby-on-rails

We have developed a Rails app in Heroku, we have around 3 web dynos, and 2-3 worker dynos. We have some exporting and importing functionalities that use a lot of our worker dynos, when that happens, everything crashes, and we get an App error un the website.
Sentry tells us that it is due to a Timeout. We are trying to find out which functionality of our software is taking so much worker time. The problem is that it affects all of our users, some of them only using web layer functionalities.
But I was wondering, is there a way to isolate our worker dyno problems from the web dynos work? I mean, Is there a way that our site does not crash when one user exports a big amount of data and saturates the workers?
Thanks in advance!
Regards,
Gonzalo

thank you for the answers, let me give you some related info:
- We use delayed_job for the workers, which is async.
- We used to have a DB connection before, it supported 120 connections, and we never saw it completely busy. The current AWS RDS only reached 24% of its use and we only saw 28 concurrent connections on the day of the crash.
- New Relic did not indicate delay in the DB.
- The web dynos start to generate timeout in many functionalities, if the crash is not related to the workers, it may be because of some functionality that is not in jobs.
Update:
- We have set some limits in our exports, even when those are in jobs, this was affecting our web and giving some App Errors. When we set the limit, the App Errors were dramatically reduced.
- We are still searching for any other unoptimized functionality.

Related

Heroku configuration for Ruby on Rails application

I’ve set a client up with Heroku for their Ruby on Rails application and have had a great deal of trouble over the years with their application not running well regardless of how much money we spend on additional resources, find their documentation highly confusing. I’ve never been able to understand their specific terminology and documentation. We are constantly getting "H12" errors and "R14" errors etc. The memory usage and dyno loads are constantly spiking. And yet this is a small to medium-sized business without a massive amount of traffic. Wondering if anybody out there who does understand the ins and outs of Heroku can look this configuration over and tell me if it makes sense:
DB_POOL: 10
MALLOC_ARENA_MAX: 2
RAILS_MAX_THREADS: 5
WEB_CONCURRENCY: 4
Ruby 2.7
Rails 6.0
Puma
8 2x web dynos
5 1x worker dynos
$50 Postgres standard 0 database
$15 Memcachier
$10 Rediscloud
...etc addons
Your WEB_CONCURRENCY is too high for your Standard-2x dynos. The recommended default is 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
This is likely contributing to your R14 errors as higher web concurrency means more memory usage. So you need to either lower your web concurrency (which may mean you also need to increase the # of dynos to compensate) or you need to use bigger dynos.
You already have MALLOC_ARENA_MAX=2 but not sure if you are using jemalloc. You might want to try that too.
Of course, you may also have other memory issues in your app - check out some tips here. I also recommend adding a monitoring tool like AppSignal as it's capable of tracking memory allocations per transaction.
For mitigating H12s:
Ensure you have installed something like the rack-timeout gem, which ensures that a long-running request is dropped at the dyno-level and thus avoids the H12 error (you get a Rack::TimeoutError exception instead). Set the timeout to 15s so that it is well under the 30s for H12 timeout.
Investigate your slow transactions. A monitoring tool is key here, i.e. New Relic (start with lowest-priced paid plan - free plan does not allow transaction tracing). Here is their blog post on how to trace transactions
When you've identified the problem - fix it!
if the bottleneck is external:
check for external API limits and throttling
add timeouts and make app resilient to slow external responses
if the bottleneck is due to the database:
optimize slow queries
check cache hit rates
check for the # of waiting connections and db locks -> if the number of waiting connections is consistently above 0 for X minutes, that indicates you have some long locks that you'll need to investigate. Waiting connections is easiest to track over time with Librato (free plan should do fine)
if the bottleneck is other app code:
add more custom instrumentation to get more insights, i.e. New Relic instructions
address app code issues
I want to stress the importance of monitoring tools to help diagnose issues and help determine optimal resource usage. Doing things like figuring out the correct concurrency configs, the correct size and # of dynos to run are virtually impossible without proper monitoring tools. Hopefully you have some already that are covered by your etc add-ons that are not listed, but if you do not, I'll summarize my recommendations and mention a couple other tips:
To get more metrics info, ensure you have enabled log-runtime-metrics
Also enable Ruby language metrics
Add a monitoring tool that can track Ruby memory allocations like AppSignal. Scout APM can do this too but I think their plans capable of this are more expensive (requires Scout Insights feature)
Add the lowest-paid version of New Relic. This is my go-to tool for transaction tracing. AppSignal can do this too if you don't want to pay for another tool, but I find it easier with New Relic.
Add Librato. It offers some great charts out of the box, including a set of Postgres charts in its own dashboard.
Set alerts in your monitoring apps to warn you about things like response times so you can look into them!
And of course, make all your changes in staging first AND load test them to see the impacts of your changes before attempting in production!
Update: I also just noticed that you said you are using Standard-0 Postgres, which means it has a 120 connection limit. So if you end up lowering your WEB_CONCURRENCY and increasing the # of dynos, watch out for your total connections to that database. Beyond just the fact that there is a limit, more connections also mean more overhead for your db anyway so if you are close to your connection limit, you are more likely to see db performance suffer. You may want to upgrade to another plan that has a higher connection limit or use pgbouncer as your connection pooler to avoid connection limits.

Could performance issues imerge when using ActionCable in Production?

I'm planning to have a Rails App that has a very content rich interactive page where many users will connect to.
Development has went well and small time testing on the Dev servers went without a hitch either.
Problems started when we started alpha testing with selected groups of people. the sever would grind to a halt suddenly. Nginx would stop because of queue being full. I was at a lose for a while, but after looking around, came to the conclusion that the live actioncable was completely eating up my memory.This especially gets bad when the user reloads the page multiple times that subscribes to actioncable, causing additional process to become active, completely stopping the server, only being cured by a nginx reboot.
I currently run a 2core 1GB memory SSD run VPS server for alpha testing, perhaps at tops 20 concurrent users.Should I be running into performance problems with such load? or should tuning the code or redis, passenger fix this?
I know its hard to say any definitive things without more specifics, but could a ballpark estimate be done with the information?
After some gogoling and testing Nginx settings, adding this directive to the nginx settings for passenger has seemed to dramatically improve the performance issue.
location /special_websocket_endpoint {
passenger_app_group_name foo_websocket;
passenger_force_max_concurrent_requests_per_process 0;
}
more info here
https://www.phusionpassenger.com/library/config/nginx/tuning_sse_and_websockets/
20 concurrent users plus multiple tabs per user is still less than about 100 concurrent websocket connections, it is not that lot.
First thing I'd look for is leaks - when for some reason websocket connection or other resources (open files etc.) does not get freed when actual user disconnects. Make sure you're running fresh versions of rails/passenger, as there was a bug in rails causing similar behaviour (see https://blog.phusion.nl/2016/07/07/actioncable-under-stress-p1/ for details)
Also while actioncable+passenger inside nginx allows you to run everything inside single process, it is not a good idea when you expect some load.
When running a clean nginx and separate rails servers for regular requests and cable - at least other parts of the app will continue some kind of working in such conditions.

How to decrease the memory consumption of Sidekiq processes?

I've a server that launches 10 web applications that are almost identical (only assets and content are different). These applications uses Sidekiq to send emails after successful submitting. The problem is in memory usage. Each process consumes 80-100MB of RAM.
I've already set up concurrency: 1 for every project. Since the amount of jobs is small I want to somewhat combine these processes into one. How to do this? Is it reliable solution? Maybe, I should search for some memory leaks?
I'm not so experienced in this field so any advices are welcome.
You could centralize all the email sending into a single sidekiq process to send out the emails of all applications.
Here's a good answer about the details of sharing sidekiq between various apps:
How to share worker among two different applications on heroku?

Split Heroku Web Workers by URL

Firstly: I realise in an ideal world I could achieve this using SOA. Humour me :)
Background
Imagine I have a rails app running on heroku with very minimal traffic in terms of user requests, they can be happily served by 1 web dyno.
I also have a machine somewhere in the world which is regularly and repeatedly submitting large files to my application via http://example.com/api/bigupload as fast as it is able.
The large files eat up my web dynos and so the user experience is bad. I increase the web dynos, but the large file uploads continue to tie them all up in long requests.
Question
Is there some way I can keep one worker in 'reserve' which will not respond to the big upload requests and concentrate on serving user traffic for other URLs?
Note: I have a similar situation to this one where automated large image uploads are eating my requests and delaying users accessing the API, albeit on a larger scale.
I think you're effectively asking: "Is there a way to partition my web dynos so that only some respond to a certain subset of requests".
The answer (today) is no unfortunately. Heroku routes randomly across all your web dynos.
What web server are you running on your web dyno? Are you using a concurrent web server? If you're not, that may have a large impact (in that it won't tie the dyno up nearly as much).
Have you explored a different architecture where instead of your other app submitting big uploads, it submits pointers to the big payloads. That way your web dyno can simply dump them on a queue, and your workers can fetch the payloads and process - and then you can scale by increasing the number of workers.

Heroku | Different performance parameters for different parts of your application

I have an Rails 3 application hosted on heroku, it has pretty common configuration where I have a client facing part of my application say: www.myapplication.com and an admin part of my application admin.myapplication.com.
I want my client facing part of my application to be fast, and I don't really care about how fast my admin module is. What I do care about is that I do not want usage on my admin site to slow down the client facing part of my application.
Ideally my client-side of the app with have 3 dedicated dynos, and my admin side will have 1 dedicated dyno.
Does anyone have any idea on the best way to accomplish this?
Thanks!
If you split the applications you're going to have share the database connections between the two apps. To be honest, I'd just have it one single app and give it 4 dynos :)
Also, Dynos don't increase performance, they increase throughput so you're capable of dealing with more requests a second.
For example,
Roughly - If a typical page response is 100ms, 1 dyno could process 10 requests a second. If you only have a single dyno and your app suddenly receives 10 requests per second then the excess requests will be queued until the dyno is free'd up to process those requests. Also requests > 30s will be timed out.
If you add a second dyno requests would be shared between the 2 dynos so you'd now be able to process 20 requests a second (in an ideal world) and so on as you add more dynos.
And remember a dyno is single threaded, so if it's doing something ANYTHING ie rendering a page, building a pdf and including receiving an uploaded image etc then it's busy and unable to process further requests until it's finished and if you don't have an more dynos requests will be queued.
My advice is to split your application into it's logical parts. Having a separate application for the admin interface is a good thing.
It does not have to be on the same domain as the main application. It could have a global client IP restriction or just a simple global Basic Auth.
Why complicate things and stuff two things into one application? This also lets you eperimenting more with the admin part and redeploy it without affecting your users.

Resources