Scaling Puppet - when is too much for WEBrick? - scalability

I've found the following at Docs: Scaling Puppet:
Are you using the default webserver?
WEBrick, the default web server used to enable Puppet’s web services connectivity, is essentially a reference implementation, and becomes unreliable beyond about ten managed nodes. In any sort of production environment serving many nodes, you should switch to a more efficient web server implementation such as Passenger or Mongrel.
Where does the the number 10 come from in "ten managed nodes"?
I have a little over 20 nodes and I might soon have little over 30. Should I change to Passenger or not?

You should change to Passenger when you start having problems with WEBrick (or a little before). When that happens for you will depend on your workload.
The biggest problem with WEBrick is that it's single-threaded and blocking; once it's started working on a request, it cannot handle any other requests until it's done with the first one. Thus, what will make the difference to you is how much of the time Puppet spends processing requests.
Each time a client asks for its catalog, that's a request. Each separate file retrieved via puppet:/// URLs is also a request. If you're using Puppet lightly, each catalog won't take too long to generate, you won't be distributing many files on any given Puppet run, and each client won't be taking more than four to six seconds of server time every hour. If each client takes four seconds of server time per hour, 10 clients have a 5% chance of collisions0--of at least one client having to wait while another's request is processed. For 20 or 30 clients, those chances are 19% and 39%, respectively. As long as each request is short, you might be able to live with some contention, but the odds of collisions increase pretty quickly, so if you've got more than, say, 50 hosts (75% collision chance) you really ought to by using Passenger unless you're doing active performance measuring that shows that you're doing okay.
If, however, you're working your Puppet master harder--taking longer to generate catalogs, serving lots of files, serving large files, or whatever--you need to switch to Passenger sooner. I inherited a set of about thirty hosts with a WEBrick Puppet master where things were doing okay, but when I started deploying new systems, all of the Puppet traffic caused by a fresh deployment (including a couple of gigabyte files1) was preventing other hosts from getting their updates, so that's when I was forced to switch to Passenger.
In short, you'll probably be okay with 30 nodes if you're not doing anything too intense with Puppet, but at that point you need to be monitoring the performance of at least your Puppet master and preferably your clients' update status, too, so you'll know when you start running beyond the capabilities of WEBrick.
0 This is a standard birthday paradox calculation; if n is the number of clients and s is the average number of seconds of server time each client uses per hour, then the chance of having at least one collision during an hour is given by 1-(s/3600)!/((s/3600)^n*((s/3600)-n)!).
1 Puppet isn't really a good avenue for distributing files of this size in any case. I eventually switched to putting them on an NFS share that all of the hosts had access to.

For 20-30 nodes, there shouldn't be any problem. Note that passenger provides some additional features. It may be faster serving the nodes, but I am not sure how much improvement you will get if you have only 30 nodes.
You should change to passenger if you are using more than hundred nodes. I started seeing problems when the number of nodes requesting service from the puppet-master reached about 200. In my case, with the default web-server, about 5% of the nodes (random) couldn't receive the catalog during hourly run.

Related

Can't generate more than ~8000 RPM from Locust

I'm using Locust to load test my web servers. I'm running Locust in distributed mode. The worker nodes are written in Java, and use the Locust/Java port using locust4j. The master node and the worker nodes are containerized, our orchestrator is Kubernetes. When I want to spin up more workers, I'm doing it from there.
The problem that I'm running into is that no matter how many users I add, or worker nodes I add, I can't seem to generate more than ~8000 RPM. This is confirmed by the Locust web frontend, as well as the metrics I'm collecting from my web server.
Does anyone have any ideas why this is happening?
I've attached an image of timings I've collected. The snapshots are from running the load test for 60 seconds, I'm timing it from a stopwatch.
The usual culprit in these kinds of situations is your servers can't handle more than that. In my experience, the behavior you'll see client side as the servers get overwhelmed is you'll start to see a slow but steady increase in response times. This is one big reason why Locust includes those in the metrics it shows you.
Based on what I'm seeing in your screenshots, this is most likely the case for you. You have some very low minimum times but your average, median, and 90%iles are a lot higher than your minimums; your maximums are very significantly higher than those. Without seeing your charts I can't know for sure but that's a big red flag.
For more things to look out for, check out this question in the FAQ (especially see the list of server stats to investigate):
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

How can I get Jelastic to sleep?

Yesterday I got a trial account on webhosting.net's Jelastic v2.2.2 and configured an environment with a minimum of 0 cloudlets (max 8, i.e., all dynamic, no reserved). Then I deployed a Grails war which was using 3 cloudlets after it started up (around 350 MB). It worked great, and I was very impressed.
However, I did not access my app overnight, and the billing history shows it kept using 3 dynamic cloudlets every hour, even with 0 requests (i.e., 0 MB paid traffic) for 14 hours. Is there some way I can get my Jelastic environment to sleep (i.e., hibernation) after some period with no requests (e.g., after an hour or two)? Then, when it gets a request, I'd like it to automatically wake up (i.e., allocate some cloudlets and restore memory from disk). I see how to stop and restart it manually, but I would like it to work automatically, for any requester.
edit: I found the following documentation, but does it not work for Tomcat/Grails?
Hibernation
Jelastic’s hibernation feature delivers even better utilization of cluster resources. Optimal use of resources is achieved by suspending non-active containers and returning released resources back to the cluster.
Because they are in sleep mode, hibernated containers do not consume resources (only disk space). As a result you save money while your containers are in hibernate mode. If applications are needed again the platform returns them to a running state again in just a few seconds.
It takes a little time to awaken your environment from sleep, so it's not suitable to work how you describe for production use - you would effectively lose visitors because it would seem like your service is offline due to the delays for that first access.
For that reason the 'sleep' function is only active for trial accounts, and the inactivity time before sleep is set by the hosting provider (so you should contact them directly for help on that point).
Of course you should also remember that accesses from search engine spiders etc. may keep your environment awake.

Ruby on Rails hosting - Engineyard vs Enterprise-Rails

I've been running a Rails app on 1 big dedicated server. Now for scaling I want to switch to a cloud service hoster and serve the app on 3 instances - App, DB and Redis.
I have really bad experience with Heroku performance wise and hence cost efficiency. So for me 2 Alternatives remain: Engineyard and Enterprise-Rails.
What I find important is that Engineyard doesn't offer an autoscaling option to handle peaks. On the other hand Enterprise-Rails doesn't have too much of documentation, most of it is handled by a support crew which is setting up everything.
What are other differences and what should I use for my website? I don't need much of administration work and I am not experienced with it. Basically I just want my Site to run optimally safe, stable and cost efficient without much personal work involved.
I am running a massive Rails app off AWS at this time and I'm really happy with it. Previously I had a number of dedicated boxes that were always causing problems - sooner or later one of them would crash for some reason, Raid failures, database problems whatnot.
At AWS I use RDS for database, elastic cache for caching, I keep all my code on a fat instance that acts as staging server and get a variable number of reserved instances to load the code via NFS.
I also use autoscaling - we've prepaid for a number of reserved instances and autoscaling helps starting up nodes when CPU usage goes above 60%, then removing them when it goes below 25%. autoscaling rules are based on cloudwatch alerts that can be set to monitor a particular group of instances, memcache servers, and so on, you even get e-mails and SMS notifications via SNS when certain scaling activities take place, say when more than 100 instances are spammed in less than 1 hour (massive traffic spike). The instances also get added right up to the load balancers by the way and you don't need to mess with the session store as you can use the sticky session feature which is quite nice.
Recently I also started using a 2nd launch group with spot instances, this complicated things a bit in terms of cloudwatch rules but I'm able to save a lot every month as spot prices are much lower. When the spot price (minimum) I bid is not enough, the set-up I have switches back to reserved instances.
Even more recently I've also started using CloudFront which got my app's page assets to load real fast (about 2 megs of CSS, JS, some icon sprites). Previously I was serving directly from instances via the load balancers.
This took about 20 hours to deploy, test and tune for maximum performance and availability.
One of the problems I have with AWS is that there's no support unless you're prepared to foot a bill. They claim some support is offered without a subscription but the only option in the support area is Billing. Ha. Fortunately it's all stable enough not to put me in a position where I'd have to pay for it.
Overall Rails fits in quite nice with AWS. I spend less than 2 hours per month doing maintenance, where I was spending over 30 previously. Most important for me is that I know that I can GTFO on a vacation for X months knowing nothing will cause any trouble - haven't had a monitoring alert more than a year.
Later edit: the app is a sports site with white labeling feature, lots of users, lots of administrators working on content in the back-end, database intensive as we show market pricing data that should update every few seconds. I had an average load time of about 3 seconds per page with dedicated servers that were doing about the same thing - database, memcache, storage, load balancing, web app. Now my average is under 1 second. Monthly bill is about 8 times lower now.
While Engine Yard doesn't offer auto-scaling (it is in the pipeline), we do have a fairly easy to use scaling feature that allows you to spin up multiple instances at once in times of need.
The advantages over something like Enterprise-Rails is the full documentation, the choice to deploy from the CLI or the dashboard,and our amazing support team. It's also easier to use Engine Yard and move from a personal machine or from another cloud setup than it is using a service such as AWS directly.

Apache/Passenger/RoR Slow - But Why

I am running Ubuntu (64Bit) with Apache 2.2.17, Passenger 3.0.11, Ruby 1.9.3 and Rails 3.2.6
When accessing the web page (index.html) on my webpage the request takes ages to complete, somewhere around 30 second in extreme cases.
The server has plenty of memory available (top shows more than 4GB free), the Apache processes (there are 10 of them) each show 0% CPU in top and the load is also almost 0 and there are hardly any DB accesses as I cache most of the things with memcached.
The log files of Apache as well as Rails do not show any errors, on the contrary the render times shown in the RubyOnRails log file show excellent values (<100 ms).
So where to go from here?
Is the first request slow or all requests slow? Passengers shutdown after a given time interval. So intermittenly requests (requests with sufficient time span in between) will allow passengers to shutdown (only to be restarted at next request.
Passenger does the autoshutdown BY DESIGN. This is so because on a shared environment, there might be other user's apps. If your app is idle for a while, then the resources can be transferred to other people's app.
If you are on a tight budget and you have multiple apps hosted on the same server, then passenger is a great solution.
If you have only ONE app in your server which you control, then please reconfigure Passengers to NOT shutdown (if that indeed is your problem).
You can do "passenger-status" to see how many passengers are currently running and available for taking requests.
The configuration to ensure that Passengers stay up is PassengerMinInstances and PassengerPoolIdleTime.
Are you accessing it through a 'fake domain name' (added to your /etc/hosts file)?
If so, do
service avahi-daemon stop
At least that's what worked for me on ubuntu 10.10 :)
For some reason a DNS lookup is made on each and every request you do to the server, and when the domain doesn't exists, it times out ...
The performance issue has been keeping me busy for all these days. I believe I have nailed it down to Apache configuration: KeepAliveTimeout, it was set to a very high value (90), can't think why it was set that high, must have been a typo.
My understanding of KeepAliveTimeout is that the Apache process gets locked to the client for 90 seconds, even if the client isn't issuing any further requests, hence when traffic picks up (which it did on that day when performance was significantly reduced, page visits more than tripled) all Apache processes are busy waiting for the KeepAliveTimeout, while blocking all new requests coming in. This would also explain why the system was not showing much load at all, it was just sitting there waiting. I reduced the value down to 10, if traffic picks up I'll probably drop it to 5.

How can I find out why my app is slow?

I have a simple Rails app deployed on a 500 MB Slicehost VPN. I'm the only one who uses the app. When I run it on my laptop, it's fast enough. But the deployed version is insanely slow. It take 6 to 10 seconds to load the login screen.
I would like to find out why it's so slow. Is it my code? (Don't think so because it's much faster locally, but maybe.) Is it Slicehost's server being overloaded? Is it the Internet?
Can someone suggest a technique or set of steps I can take to help narrow down the cause of this problem?
Update:
Sorry forgot to mention. I'm running it under CentOS 5 using Phusion Passenger (AKA mod_rails or mod_rack).
If it is just slow on the first time you load it is probably because of passenger killing the process due to inactivity. I don't remember all the details but I do recall reading people who used cron jobs to keep at least one process alive to avoid this lag that can occur with passenger needed to reload the environment.
Edit: more details here
Specifically - pool idle time defaults to 2 minutes which means after two minutes of idling passenger would have to reload the environment to serve the next request.
First, find out if there's a particularly slow response from the server. Use Firefox and the Firebug plugin to see how long each component (including JavaScript and graphics) takes to download. Assuming the main page itself is what is taking all the time, you can start profiling the application. You'll need to find a good profiler, and as I don't actually work in Ruby on Rails, I can't suggest any: google "profile ruby on rails" for some options.
As YenTheFirst points out, the server software and config you're using may contribute to a slowdown, but A) slicehost doesn't choose that, you do, as Slicehost just provides very raw server "slices" that you can treat as dedicated machines. B) you're unlikely to see a script that runs instantly suddenly take 6 seconds just because it's running as CGI. Something else must be going on. Check how much RAM you're using: have you gone into swap? Is the login slow only the first time it's hit indicating some startup issue, or is it always that slow? Is static content served slow? That'd tend to mean some network issue (either on the Slicehost side, or your local network) is slowing things down, assuming you're not in swap.
When you say "fast enough" you're being vague: does the laptop version take 1 second to the Slicehost 6? That wouldn't be entirely surprising, if the laptop is decent: after all, the reason slices are cheap is because they're a fraction of a full server. You're using probably 1/32 of an 8 core machine at Slicehost, as opposed to both cores of a modern laptop. The Slicehost cores are quick, but your laptop could be a screamer compared to 1/4 of core. :)
Try to pint point where the slowness lies
1/ application is slow, or infrastructure (network + web server)
put a static file on your web server, and access it through your browser
2/ If it is fast, it is probable a problem with application + server configuration.
database access is slow
try a page with a simpel loop: is it slow?
3/ If it slow, it is probably your infrastructure. You can check:
bad network connection: do a packet capture (with Wireshark for example) and look for retransmissions, duplicate packets, etc.
DNS resolution is slow?
server is misconfigured?
etc.
What is Slicehost using to serve it?
Fast options are things like: Mongrel, or apache's mod_rails (also called passenger phusion or
something like that)
These are dedicated servers (or plugins to servers) which run an instance of your rails app.
If your host isn't using that, then it's probably defaulting to CGI. Rails comes with a simple CGI script that will serve the page, but it reloads the app for every page.
(edit: I suspect that this is the most likely case, that your app is running off of the CGI in /webapp_directory/public/dispatch.cgi, which would explain the slowness. This tends to be a default deployment on many hosts, since it doesn't require extra configuration on their part, but it doesn't give good performance)
If your host supports "Fast CGI", rails supports that too. Fast CGI will open a CGI session, and keep it open for multiple pages, so you get much better performance, but it's not nearly as good as Mongrel or mod_rails.
Secondly, is it in 'production' or 'development' mode? The easy way to tell is to go to a page in your app that gives an error. If it shows you a stack trace, it's in development mode, which is slower than production mode. Mongrel and mod_rails have startup options to determine whether to run the app in production or development mode.
Finally, if your database is slow for whatever reason, that will be a big bottleneck as well. If you do have a good deployment (Mongrel/mod_rails/etc.) in production mode, try looking into that.
Do you have a lot of data in your DB? I would double check that you have indexed all the appropriate columns- because this can make a huge difference. On your local dev system, you probably have a lot more memory than on your 500 mb slice, which would result in the DB running a lot slower if you have big, un indexed tables. You can also run the slow queries logger in MySql to pinpoint columns without indexes.
Other than that, yes- passenger will need to spool up a process for you if you have not been using the site recently. If this is the case, you should see a significant speed increase on second, and especially third and later page loads.
You might want to run a local virtual machine with 500 MB. Are you doing a lot of client-server interaction? Delays over the WAN are significant
You might want to check out RPM (there's a free "lite" version too) and/or New Relic's Tune Up.
Your CPU time is guaranteed by Slicehost using the Xen virtualization system, so it's not that. Don't have the other answers for you, sorry! Might try 'top' on a console while you're trying to access the page.
If you are using FireFox and doing localhost testing (or maybe even on LAN) you may want to try editing the network.dns.disableIPv6 setting.
Type about:config in the address bar and filter for network.dns.disableIPv6 and double-click to set to true.
This bug has been reported mainly from Vista OS's, but some others as well.
You could try running 'top' when you SSH in to see which process is heavy. If you also have problems logging you, perhaps you may try getting Statistics in the Slicehost manager.
If you discover it is MySQL's fault, consider decreasing the number of servers it can spawn.
512 seems decent for Rails application, you might have to check if you misconfigured too.

Resources