Unicorn CPU usage spiking during load tests, ways to optimize - ruby-on-rails

I am interested in ways to optimize my Unicorn setup for my Ruby on Rails 3.1.3 app. I'm currently spawning 14 worker processes on High-CPU Extra Large Instance since my application appears to be CPU bound during load tests. At about 20 requests per second replaying requests on a simulation load tests, all 8 cores on my instance get peaked out, and the box load spikes up to 7-8. Each unicorn instance is utilizing about 56-60% CPU.
I'm curious what are ways that I can optimize this? I'd like to be able to funnel more requests per second onto an instance of this size. Memory is completely fine as is all other I/O. CPU is getting tanked during my tests.

If you are CPU bound you want to use no more unicorn processes than you have cores, otherwise you overload the system and slow down the scheduler. You can test this on a dev box using ab. You will notice that 2 unicorns will outperform 20 (number depends on cores, but the concept will hold true).
The exception to this rule is if your IO bound. In which case add as many unicorns as memory can hold.
A good performance trick is to route IO bound requests to a different app server hosting many unicorns. For example, if you have a request that uses a slow sql query, or your waiting on an external request, such as a credit card transaction. If using nginx, define an upstream server for the IO bound requests, forward those urls to a box with 40 unicorns. CPU bound or really fast requests, forward to a box with 8 unicorns (you stated you have 8 cores, but on aws you might want to try 4-6 as their schedulers are hypervised and already very busy).
Also, I'm not sure you can count on aws giving you reliable CPU usage, as your getting a percentage of an obscure percentage.

First off, you probably don't want instances at 45-60% cpu. In that case, if you get a traffic spike, all of your instances will choke.
Next, 14 Unicorn instances seems large. Unicorn does not use threading. Rather, each process runs with a single thread. Unicorn's master process will only select a thread if it is able to handle it. Because of this, the number of cores isn't a metric you should use to measure performance with Unicorn.
A more conservative setup may use 4 or so Unicorn processes per instance, responding to maybe 5-8 requests per second. Then, adjust the number of instances until your CPU use is around 35%. This will ensure stability under the stressful '20 requests per second scenario.'
Lastly, you can get more gritty stats and details by using God.

For a high CPU extra large instance, 20 requests per second is very low. It is likely there is an issue with the code. A unicorn-specific problem seems less likely. If you are in doubt, you could try a different app server and confirm it still happens.
In this scenario, questions I'd be thinking about...
1 - Are you doing something CPU intensive in code--maybe something that should really be in the database. For example, if you are bringing back a large recordset and looping through it in ruby/rails to sort it or do some other operation, that would explain a CPU bottleneck at this level as opposed to within the database. The recommendation in this case is to revamp the query to do more and take the burden off of rails. For example, if you are sorting the result set in your controller, rather than through sql, that would cause an issue like this.
2 - Are you doing anything unusual compared to a vanilla crud app, like accessing a shared resource, or anything where contention could be an issue?
3 - Do you have any loops that might burn CPU, especially if there was contention for a resource?
4 - Try unhooking various parts of the controller logic in question. For example, how well does it scale if you hack your code to just return a static hello world response instead? I bet suddenly unicorn will be blazlingly fast. Then try adding back in parts of your code until you discover the source of the slowness.

Related

Async App Server versus Multiple Blocking Servers

tl;dr Many Rails apps or one Vertx/Play! app?
I've been having discussions with other members of my team on the pros and cons of using an async app server such as the Play! Framework (built on Netty) versus spinning up multiple instances of a Rails app server.
I know that Netty is asynchronous/non-blocking, meaning during a database query, network request, or something similar an async call will allow the event loop thread to switch from the blocked request to another request ready to be processed/served. This will keep the CPUs busy instead of blocking and waiting.
I'm arguing in favor or using something such as the Play! Framework or Vertx.io, something that is non-blocking... Scalable. My team members, on the other hand, are saying that you can get the same benefit by using multiple instances of a Rails app, which out of the box only comes with one thread and doesn't have true concurrency as do apps on the JVM. They are saying just use enough app instances to match the performance of one Play! application (or however many Play! apps we use), and when a Rails app blocks the OS will switch processes to a different Rails app. In the end, they are saying that the CPUs will be doing the same amount of work and we will get the same performance.
So here are my questions:
Are there any logical fallacies in the arguments above? Would the OS manage the Rails app instances as well as Netty (which also runs on the JVM, which maps threads to cores very well) manages requests in its event loop?
Would the OS be as performant in switching on blocking calls as would something like Netty or Vertx, or even something built on Ruby's own EventMachine?
With enough Rails app instances to match the performance Play! apps, would there be a cost noticeable cost difference in running the servers? If there are no cost difference it wouldn't really matter what method is used, in my opinion. Shoot if it was cheaper financially to run up a million Rails apps than one Play! app I would rather do that.
What are some other benefits to using either of these approaches that I may be failing to ask about?
Both approaches can and have worked. So if switching would incur a high development cost and/or schedule hit then it's probably not worth the effort...yet. Make the switch when the costs become unacceptably high. Think of using microservices as a gradual switching strategy.
If you are early on in your development cycle then making the switch early may make sense. Rewriting is a pain.
Or perhaps you'll never have to switch and rails will work for your use case like a charm. And you've been so successful at making your customers happy that the cash is just rolling in.
Some of the downsides of a blocking single server approach:
Increased memory usage. Sources: multiple processes, memory leaks, lack of shared datastructures (which increases communication costs and brings up consistency issues).
Lack of parallelism. This has two consequences: more boxes and more latency. You'll need potentially a much larger box count to handle the same load. So if you need to scale and have money concerns then this can be a problem. If it isn't a concern then it doesn't matter. In the server it means increased latency, the sort of latency which can't be improved by multiplying processes, which may be a killer argument depending on your app.
Some examples of those who had made such a switch from rails to node.js and golang:
LinkedIn Moved From Rails To Node: 27 Servers Cut And Up To 20x Faster : http://highscalability.com/blog/2012/10/4/linkedin-moved-from-rails-to-node-27-servers-cut-and-up-to-2.html
Why Timehop Chose Go to Replace Our Rails App : https://medium.com/building-timehop/why-timehop-chose-go-to-replace-our-rails-app-2855ea1912d
How We Moved Our API From Ruby to Go and Saved Our Sanity : http://blog.parse.com/learn/how-we-moved-our-api-from-ruby-to-go-and-saved-our-sanity/
How We Went from 30 Servers to 2: Go : http://www.iron.io/blog/2013/03/how-we-went-from-30-servers-to-2-go.html
These posts represent arguments that are probably illustrative of what your group is going through. The decision is unfortunately not an obvious one.
It depends on the nature of what you are building, the nature of your team, the nature of resources, the nature of your skills, the nature of your goals and how you value all the different tradeoffs.
Would costs really drop? Isn't the same amount of computation done no matter the number of servers?
Depends on the type and scale of the work being done. Typically web services are IO bound, waiting on responses from other services like databases, caches, etc.
If you are using a single threaded server the process is blocked on IO a lot so it is doing nothing a lot. In contrast the nonblocking server will be able to handle many many requests while the single threaded server is blocked. You can keep adding processes, but there are only so many processes a single machine can run. A nonblocking server can have the same number of processes while keeping the CPU busy as possible handling requests. It's often possible to handle higher loads on smaller cheaper machines when using nonblocking servers.
If your expected request rate can be handled by an acceptable number of boxes and you don't expect huge spikes then you would be fine with single threaded servers. Nonblocking servers are great at soaking up load spikes without necessarily having to add machines.
If your work is such that response latencies don't really matter then you can get by with fewer nodes.
If your workload is CPU bound then you'll need more boxes anyway because there won't be the same opportunity for parallelism because the servers won't be blocking on IO.

My server gets overloaded even though I keep a limit on the requests I send it

I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.

Rails rake parallelization thresholds and caveats

This is the first time that I've actually run into timing issues regarding the task I have to tackle. I need to do a calculation (running against a webservice) with approximately 7M records. This would take more than 180hrs, so I was thinking about running multiple instances of the webservice on EC2 and just running rake tasks in parallel.
Since I have never done this before, I was wondering what needs to be considered.
More precisely:
What's the maximum number of rake tasks I can run (Is there any limit
at all besides your own machine power)?
What's the maximum number of concurrent connections to a postgres 9.3
db?
Are there any things to be considered when running multiple
active_record.save actions at the same time?
I am looking forward to hearing your thoughts.
Best,
Phil
rake instances
Every time you run rake, you are running a new instance of your ruby server, with all associated memory and related load-dependency usages. Look in your Rakefile for the inits.
your number of instances in limited by memory and CPU used
you must profile each memory and CPU to know how many can be run
you could write a program to monitor and calculate what's possible, but heuristics will work better for one-off, and first experiments.
datastore
heuristically explore your database capacity, too.
watch for write-locks that create blocking
watch for slow reads due to missing indices
look at your postgres configs to see concurrency limits, cache size, etc.
.save
each rake task is its own ruby server, so multiple active_record.save actions impacts:
blocking/waiting due to write-locking
one instance getting 'old' data that was read prior to another's update .save
operational complexity
the number of records (7MM) is just a multiplier for all of the operations that occur upon each record. The operational complexity is the source of limitation, since theoretically, running 7MM workers would solve the problem in the minimum timescale
if 180hr is accurate (dubious), then (180 * 60 * 60 * 1000) / 7000000 == 92.57 ms per process.
Look for any shared-resource that is an IO blocker.
look for any common calculation that you can do in advance and cache. A lookup beats a calc.
errata
leave headroom for base OS processes. These will vary by your environment, but you mention AWS but best to conceptually learn how to monitor any system for activity
run top in a separate screen / terminal as the rakes are running.
Prefer to run 2 tops in different screens. sort 1 by memory, sort the other by CPU
have a way to monitor the rakes
watch for events that bubble up the top processes.
if you do this long / well enough, you've profiled you headroom
run more rakes to fill your headroom
don't overrun your memory or you'll get swapping
You may want to consider beanstalk instead, but my guess is you'll find that more complicated than learning all these good foundations, first.

Why is my PostgreSQL server cpu constrained?

My database is very cpu constrained, and I can't find the root cause of the issue. I currently have two applications servers each wit a Rails api connecting to PostgreSQL via the ruby-pg gem. Both application server also have sidekiq running background jobs, and I have a handful of support servers processing new posts from a national feed via sidekiq. If I were running out of memory, the solution would seemingly be straight forward. Any general ideas why I am CPU constrained?
Database Specs:
Rackspace 8GB Performance Tier cloud VM (8GB RAM, 8x Core CPU, SSD)
Debian 7 Wheezy Linux OS
PostgreSQL 9.1 with PostGIS extension
Possible Problems:
PostgreSQL 9.1 is bad at indexes
The database has nearly 10GB of indexes. I am going to upgrade my database to PostgreSQL version >= 9.2. In version 9.2, index only scans were introduced.
Too many connections
In the postgresql.conf, I have set max connection equal to '500'. Usually throughout the day, only 175 connections are utilized, but during peak times, sidekiq tasks will increase the current connections to 350. How many connections are recommended with an 8GB server instance?
Idol Connections
When I take a look at pg_stat_activity in the psql console, I see sidekiq is leaving a lot of IDLE connections. Could these connections result in CPU inflation? Does the fix exist in the api or in sidekiq?
Need a more powerful server
Maybe there is not a bug. I might need to simply increase the server instance. Again this would make more sense if I was memory bound. However, both app servers and 3 of the support sidekiq servers are 4gb performance tier instances. Essentially, servers that interact with the database have combined more than double the resources of the database. Should this even matter?
Additional questions:
What tools/techniques should I employ to troubleshoot the issue?
Any basic settings in the postgresql.conf related to cpu usage?
Are there any known issues related to rails, sidekiq, or the pg gem that could be a contributing factor? (I havent seen any open issues.)
Are there any general postgreSQL guideline for CPU usage?
Any other ideas thoughts that might help my search?
You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.
On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.
If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.
See number of database connections (Pg wiki).
Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain module to record long running queries, and using pg_stat_statements / pg_stat_plans to record what's taking up resources. Examine these queries to see if they need improvement.
Your idle in transaction sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.
So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.
Most likely you are CPU constrained because your work needs a lot of CPU. :)
9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.
Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.
350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.
If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.
The connection you showed from top was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top just happens to catch it while it is between them. But you didn't say how many similar lines there were in top, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.
Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.
My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.

Concurrent SOAP api requests taking collectively longer time

I'm using savon gem to interact with a soap api. I'm trying to send three parallel request to the api using parallel gem. Normally each request takes around 13 seconds to complete so for three requests it takes around 39 seconds. After using parallel gem and sending three parallel requests using 3 threads it takes around 23 seconds to complete all three requests which is really nice but I'm not able to figure out why its not completing it in like 14-15 seconds. I really need to lower the total time as it directly affects the response time of my website. Any ideas on why it is happening? Are network requests blocking in nature?
I'm sending the requests as follows
Parallel.map(["GDSSpecialReturn", "Normal", "LCCSpecialReturn"], :in_threads => 3){ |promo_plan| self.search_request(promo_plan) }
I tried using multiple processes also but no use.
I have 2 theories:
Part of the work load can't run in parallel, so you don't see 3x speedup, but a bit less than that. It's very rare to see multithreaded tasks speedup 100% proportionally to the number of CPUs used, because there are always a few bits that have to run one at a time. See Amdahl's Law, which provides equations to describe this, and states that:
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program
Disk I/O is involved, and this runs slower in parallel because of disk seek time, limiting the IO per second. Remember that unless you're on an SSD, the disk has to make part of a physical rotation every time you look for something different on it. With 3 requests at once, the disk is skipping repeatedly over the disk to try to fulfill I/O requests in 3 different places. This is why random I/O on hard drives is much slower than sequential I/O. Even on an SSD, random I/O can be a bit slower, especially if small-block read-write is involved.
I think option 2 is the culprit if you're running your database on the same system. The problem is that when the SOAP calls hit the DB, it gets hit on both of these factors. Even blazing-fast 15000 RPM server hard drives can only manage ~200 IO operations per second. SSDs will do 10,000-100,000+ IO/s. See figures on Wikipedia for ballparks. Though, most databases do some clever memory caching to mitigate the problems.
A clever way to test if it's factor 2 is to run an H2 Database in-memory DB and test SOAP calls using this. They'll complete much faster, probably, and you should see similar execution time for 1,3, or $CPU-COUNT requests at once.
That's actually is big question, it depends on many factors.
1. Ruby language implementation
It could be different between MRI, Rubinus, JRuby. Tho I am not sure if the parallel gem
support Rubinus and JRuby.
2. Your Machine
How many CPU cores do you have in your machine, you can leverage this using parallel process? Have you tried using process do this if you have multiple cores?
Parallel.map(["GDSSpecialReturn", "Normal", "LCCSpecialReturn"]){ |promo_plan| self.search_request(promo_plan) } # by default it will use [number] of processes if you have [number] of CPUs
3. What happened underline self.search_request?
If you running this in MRI env, cause the GIL, it actually running your code not concurrently. Or put it precisely, the IO call won't block(MRI implementation), so only the network call part will be running concurrently, but not all others. That's why I am interesting about what other works you did inside self.search_request, cause that would have impact on the overall performance.
So I recommend you can test your code in different environments and different machines(it could be different between your local machine and the real production machine, so please do try tune and benchmark) to get the best result.
Btw, if you want to know more about the threads/process in ruby, highly recommend Jesse Storimer's Working with ruby threads, he did a pretty good job explaining all this things.
Hope it helps, thanks.

Resources