ease of scaling mongodb vs mysql - grails

I am creating a Grails application that is the backend for a mobile application. It is currently deployed on Amazon EC2. It persists data to a mysql database. One instance currently pointing to the database. I plan to deploy multiple instances of the app behind a load balancer and eventually have read requests go to slave instances of the db. We plan to release in the coming months and have a beta group of a couple of thousand users. It is more read intensive than write.
We have looked into using mongodb instead of sql and see it as a good solution.
Not having a lot of experience scaling mysql ( or mongodb ) would it be easier to scale mongodb since it has features such as auto sharding. ( Looking for thoughts from people who have done both ) I am of thinking it will be easier to switch to mongodb now rather than be in 'production' and having to migrate.
Thoughts?

MongoDB has two versions of "scaling":
Read scaling via replica sets.
Write scaling via sharding.
They're not silver bullets, but they're both very easy to set up. Replica sets have auto-failover which is practically essential when using EC2 (they have a good history of just randomly failing nodes). When you need write scaling, MongoDB has documented processes for upgrading your replica set to a series of sharded replica sets.
The unfortunate limitation is that (last I checked), things like scalr don't really support automatic scaling. So you'll have to roll your own solution for adding and removing nodes from the set.
Some important considerations:
Disk IO performance is sketchy in the cloud. Good performance is all about the amount of RAM you can throw at the problem.
If you're using replica sets for reads, ensure that your driver / data wrapper is capable of handling the distribution of reads. Just like MySQL it's not currently "free", you'll need to decide "write vs. read".
64-bit machines. MongoDB really wants to operate on 64-bit hardware. This is a cost consdieration as you'll probably have to ramp up with 4GB machines instead of 2GB machines (I don't think this is a big limitation, but I also know what it's like to be a startup).
MongoDB is still new tech. The lists are very active, and people are using it in production for very large data sets. But this is still a new product, you have to be prepared to work from the command-line and parse through docs and ask questions.
would it be easier to scale mongodb
At some level scaling is going to be a "hard" problem. What MongoDB does well is provide a way to really scale out lots of boxes horizontally with replication. In my experience, MySQL really tops out at around two boxes for writes. You can easily configure co-masters, but after that you have to start mucking around with all kinds of partitioning and you basically lose the ability to do joins.
I am of thinking it will be easier to switch to mongodb now rather than be in 'production'
It probably will.
Thoughts?
Start small. Get one piece working and see if you like how it works. If you have access to an EC2 account, then it's easy to spin up a couple of machines and play. MongoDB is not a panacea, but it works really well for a lot of modern web problems. Just measure how badly you need joins :)

Related

How to scale your 1 server Rails application

I have a rails application running on a single VPS that uses passenger, apache and MySQL. I am moving this to Amazon AWS with the following simple setup:
ELB > Web Server > MySQL
Lets say I am expecting a huge spike in daily users and want to start to scale this out on Amazon AWS using multiple instances. Where does a newbie start on this journey? Do I simply create an AMI from my production configured web server and get the ASG to launch these when required?
I understand that AWS increases the number of instances using auto scale groups as the load demands it, but do I need to architect anything differently in my Rails application for it to run at scale across multiple interfaces?
The problem with scaling horizontally is that it really depends on the application. There's no "just-add-water" ways to do it.
But there are some generic recipes you can follow in the beginning:
Extract MySQL server into a separate instance, which is capable of holding a higher load. Then create as many worker (i.e. app) instances that connect to the MySQL database as you need. You can keep doing so before your MySQL server gets saturated with requests, and can no longer keep up with the load.
When you're done with step 1, you can add MySQL replicas and setup a master-slave replication. This will leave you with a MySQL cluster, where one server can accept writes and all the others are read-only. After your set it up, change your application to send SELECT's to read-only replicas and INSERT/DELETE/UPDATE's to the writeable master server. This approach is based on the fact that most of the applications do reads way more often than writes. It can be not the case for you, but if it is, it'll keep your afloat pretty long. Right before you saturate MySQL master server write performance.
Once you've squeezed everything from step 2, you can go ahead and shard the data. This is now becoming more and more dependent on your application. But I will provide a blind example in order to convey the idea. Say, you have a user-centric application (e.g. a private photo-album, with no sharing capabilities), and each user has a name. In this case you can make two completely independent clusters, where the first one will serve users with names starting A-M, and the second one will serve ones with N-Z. It essentially makes the load twice as less, but complicates the whole architecture.
Though generic, these recipes can help you build a pretty solid application capable of serving millions of users daily before you're forced to bring up more exotic ways of scaling.
Hope this helps!

Scaling to support a massive amount of traffic in a short period of time

Until now, our site has had a modest amount of traffic. None of our developers are big ops guys, but we've stayed ahead of it and keep the site up and running pretty quick. That said, our dev team is stretched, we've accumulated some technical debt, and there's plenty of opportunity to optimize.
Without getting into specifics, we just found out that we'll be expecting a massive amount of traffic in the near future in a very short period time. On the order of several million hits in a few hours. Scaling is one thing, but this is several orders of magnitude greater than what we're seeing now.
We're a Rails app hosted on S3 using ELB, and Postgresql.
I wanted to field some recommendations for broad starting points for scaling and load testing given this situation.
Update: Sorry, EC2, late night :)
#LastZactionHero
Pretty interesting question, let me answer you in detail, I hope you are talking about some e-commerce applications, enterprise or B2B apps doenst see spikes as such. Since you already mentioned that you are hosted your rails app on s3. Let me make couple of things clear.
1)You cant host an rails app on s3. S3 is simple storage service. Where you can only store files.
2) I guess you have hosted your rails app on AWS ec2 with a elastic load balancer attached above the ec2 instances which is pretty good.
3)You have a self managed Postgresql deployed on a ec2 instance.
If you are running on AWS you are half way safe and you can easily scale up and scale down.
I can see one problem in your present model, that your db. AWS has got db as a service. Thats called Relation database service.Which supports Mysql Oracle and MS SQL server.
RDS comes with lot of features like auto back up of your database, high IOPS etc.
But it doesnt support your Postgresql. You need to have or manage a self managed ec2 instance and run postgresql database, but make sure its fail safe and you do have proper back and restore system at place.
AWS provides auto scaling api and command line tools, pretty easy.
You dont have worry about the bandwidth issue etc, but I admit Angelo's answer too.
You can use elastic mem cache for caching your app. Use CDN if need to speed your app. RDS can manage upto 30000 IOPS, its a monster to it will do lot of work for you.
Feel free to ask me if you need any kind of help.
(Disclaimer: I am a senior devOps engineer working for an e-commerce company, use ruby on rails)
Congratulations and I hope your expectation pans out!!
This is such a difficult question to comprehensively answer given the available information. For example, is your site heavy on db reads, writes or both (and is your sharding/replication strategy in line with your db strain)? Is bandwidth an issue, etc? Obvious points would focus on making sure you have access to the appropriate hardware and that your recipies for whatever you use to provision/deploy your hardware is up to date and good to go. You can often throw hardware at a sudden spike in traffic until you can get to the root of whatever bottlenecks you discover (and yes, you will discover them at inconvenient times!)
Regarding scaling your app, you should at least:
1) Cache whatever you can. Pay attention to cache expiration, etc.
2) Be sure your DB has appropriate indexes set up (essentially, you should have an index on any field you're searching on.)
3) Watch your logs closely to identify potential long queries, N+1 queries, long view renders, etc.
4) Do things like what Shopify outlines in this post: http://www.shopify.com/technology/7535298-what-does-your-webserver-do-when-a-user-hits-refresh#axzz2O0gJDotV
5) Set up a good monitoring system (Monit, God, etc) for each layer of your stack - sudden spikes in traffic can quickly bottleneck your application in unexpected places and lead to more issues. The cascade can happen quickly.
6) Set up cron to automate all those little tasks you currently do manually...that you will probably forget about doing once you're dealing with traffic spikes.
7) Google scaling rails and you'll see tons of good info.
8) etc, etc, etc...
You can use some profiling tools (rubyperf, or something like NewRelic, etc) Whatever response you get from them is probably best to be considered as a rough baseline at best. Simple reason being that your profiling is dependent on your hardware stack which will certainly change depending on actual traffic patterns. Pretty easy to do if you have a site with one page of static content...incredibly difficult to do if you have a CMS site with a growing db and growing traffic.
Good luck!!!

Multiple servers or everything in a single server?

I have a Rails app that uses MySQL, MongoDB, NodeJS (and SocketIO). Right now, the app (everything) is hosted inside 1 box. I would like to know what I should do when the number of users grow. What factors should I take into account to determine whether I need to host a separate element in another box (like MySQL, Node, Mongo in each of its separate box). Should I just make that one single box bigger? Is there a best-practice method that I can go with?
If you guys can provide me with reference, guides, research regarding this topic. Please do. I am super noob at deployment and server configuration.
We faced this dilemma at work a short while ago and found that simply upgrading to a more powerful single box sufficed and would give us room to grow further by up to 3-4 times.
The most important thing would be to identify your potential bottlenecks.
In our case there were 2 bottlenecks. Disk I/O and the database's ability to utilise memory.
On our new server we had the hard drive array configured in such a way as to maximise the disk I/O and we upgraded the database software to allow it to use more memory. In fact the DBMS now keeps the entire database in memory and only performs write operations to the disk as needed. This significantly improved performance.
The short answer is move everything to its own box. The longer answer is: it depends on your app's usage.
I recommend you use Nagios or similar to monitor your app's resource utilization -- that is, how much CPU and RAM each of your services use. When one starts to each up too much resources (and your page load speed is negatively affected), move that to its own box.
Then continue to monitor that box, beef up when necessary or shard out.
The high scalability blog is good for reading on what other people have done.

Should I choose cloud?

I'm about to start development on a project with very uncertain load/traffic specifics. When it will be released there will certainly be very low load that can easily be handled by a single desktop quad code machine.
The problem is that there will be (after some invite-only period) a strong publicity for the product so I expect considerable traffic/load peaks.
I haven't read enough about cloud providers and I'm mostly leaning toward Amazon or Azure for the credibility these two companies have without checking them out as I should with others (ie. Rackspace that I suppose is also a cloud service provider).
What I want
I would like to create a normal Asp.net MVC web application that can be run on in-house single machine low-cost server. It would run web server along with database (relational and maybe also document) and fulltext search (not SQL FTS but rather high speed separate product like Lucene or Sphinx). But after initial invite-only period I'd like to move this app to the cloud to make it more traffic/load demand-friendly.
As much as I know Amazon offers a sort of virtual machine hosting which I understand you setup as a normal server but has possible flexible resources in terms of load power. I'm not sure if that can be accomplished on Azure as well.
Questions
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
"Cloud" is such a vague term. Still, I think this is a very good question.
Basically, IaaS cloud hosting does not magically make your application scale. It's really a virtual private server with very short contract / cancellation periods.
For scalability, the magic lies not so much in the hosting, but in the horizontal scalability of the application code itself. This is related to all the distributed computing challenges. For example, adding more application servers is not always easy: you must be sure that you don't persist any user state in the server application (but rather in a database, static can be evil), caching can be problematic because local caches can make the situation worse if you're using a round-robin strategy, etc.
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
You don't really have to do anything different just to host on EC2 or Azure -- basically. But of course, it's not that easy when things grow.
For instance, EC2 instance storage is rather limited. Additional storage on EBS, however, does not provide comparable performance characteristics and can be a bit more laggy than a disk. The point here is that EBS does magically scale, and it's probably more PaaS than IaaS; but it's not a simple hard disk and it does, consequently, not behave like a hard drive. I don't know about Azure block storage. In general, expect additional abstraction layers to introduce problems of their own, no matter what they do.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
Typical cloud providers are more expensive than the usual 'round-the-corner VPS providers, but they are, to my experience, also much more reliable and professional. EC2 has a free tier (but it's quite small), Azure gives you a small instance for free for 3 months.
Doing the calculation right is rather tricky; for example, if you have to shut down your service for whatever reason, it's nice to be able to cancel now rather than pay another year - you might want to put this risk into your calculation. On the other hand, both EC2 and Azure will be considerably cheaper if you sign up for 6 or 12 months, rather than paying by the hour.
You might want to check out the free Azure plan, because it's nice to start fiddling around without any cost. A big advantage of cloud providers is that you can scale vertically very easily: buying a 16 core, 64GB RAM server machine is really expensive, but if there's so much traffic on your site, upgrading your plan won't be such a big issue.
As someone hasn't mention it yet...
AppHarbor has been amazing. You can push stuff in a matter of minutes. Deployment is a breeze. And setting up your project for it is easier as well. And it doesn't even require any major changes in your solution to fit in.
For the full-text search, you might consider something like Websolr.
A lot of this depends on what your app is doing (e.g., are there separable components that might benefit from running on different instances, vs. a simple CRUD application with a front end). One thing to consider is that in a cloud application you normally don't have a traditional relational database. As such, you have to choose either cloud or traditional hosting, or plan on coding your access layer twice. Azure does have relational databases (SQL Azure), although they're not identical to SQL Server 2008R2. You're going to have to research the pros/cons of a cloud setup for your specific situation.
As far as financial concerns, it's usually a lot cheaper to just get an account with a hosting company instead of a cloud service, since you pay by the month, instead of the hour (last time I checked an account with Azure running 24/7 for a month would cost about $40-$50, while you can get hosting for $15 a month). The savings with the cloud come in when you have to run several servers, and the cost of maintaining them surpasses the cost of the instance on the cloud platform.
So, sorry, there's no silver bullet answer for you. Read up on the different services available. Consider what your application needs, what prices will be, and go from there.
I have just migrated an MVC-based application from a dedicated server to Azure. When migrating the MSSQl-database, I first tried importing .bacpak files but some of the tables failed because of their size. I then used the SQL Database migratio wizard which worked fine for small tables but failed for tables with BLOB-fields. For these tables I had to use temporary intermediate tables. Then after a while after all the data was transferred setting up the Webapp was a breeze and we went in production. At first, everything seemed to work just fine, but after a couple of hours when the load got heavier, all kind of errors occurred. I went into the Azure portal and it was really easy to see the

Are there any stable and production quality nosql datastores?

Are there are production quality nosql stores that I can use on a production system. I have looked at cassandra, tokyodb, couchdb etc but none of them seem to be ready for deployments on production like environments. I am talking thousands of requests per minute and lots of reads/writes/updates. My only concern is speed and service times. Does anybody know of production systems that use nosql stores effectively ? Does anybody know of a nosql store that is backed by a big enterprise like Google/Yahoo/ IBM ?
Cassandra handles thousands of requests (including write-mostly workloads) per second, per machine, and its scaling-by-adding-machines has been there since day 1.
Here is a thread about Cassandra use in production and in-production-soon at dozens of companies: http://n2.nabble.com/Cassandra-users-survey-td4040068.html#a4040068
We're also adding more docs all the time, like http://wiki.apache.org/cassandra/Operations.
I think the NoSQL systems are an excellent choice if I you 'only' care about speed and service time (and not or less about stuff like consistency and transactions). Facebook uses Cassandra.
"Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes." http://highscalability.com/product-facebooks-cassandra-massive-distributed-store
I think CouchDb isn't really speedy, maybe you can use MongoDB: http://www.mongodb.org/display/DOCS/Production+Deployments
Also worth consideration is using a traditional RDBMS like MySQL to store schema-less. This method gives you the stability of a proven database server like MySQL with the flexibility a NoSQL solution.
Check out this blog posting on how FriendFeed does this.
BerkeleyDB is backed by Oracle
Using the native C interface one can reach close to 1 million read requests per second.
By the way, when you say thousands requests per minute, any 'normal' DB should be able to handle that easily too.
Redis is worth giving a try as Github uses redis to manage a heavy queue of background jobs.
My first instinct would be BerkeleyDB, with each application node on a SAMBA network to facilitate ACID conformance & network use. It also sports a SQLite interface. Other poster cites MemcacheDB also having BDB inside.
Another unique option would be OrientDB, also has a SQL interface, lots of network & cluster features.

Resources