Ruby on Rails scalability/performance? [closed] - ruby-on-rails

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have used PHP for awhile now and have used it well with CodeIgniter, which is a great framework. I am starting on a new personal project and last time I was considering what to use (PHP vs ROR) I used PHP because of the scalability problems I heard ROR had, especially after reading what the Twitter devs had to say about it. Is scalability still an issue in ROR or has there been improvements to it?
I would like to learn a new language, and ROR seems interesting. PHP gets the job done but as everyone knows its syntax and organization are fugly and it feels like one big hack.

To expand on Ryan Doherty's answer a bit...
I work in a statically typed language for my day job (.NET/C#), as well as Ruby as a side thing. Prior to my current day job, I was the lead programmer for a ruby development firm doing work for the New York Times Syndication service. Before that, I worked in PHP as well (though long, long ago).
I say that simply to say this: I've experienced rails (and more generally ruby) performance problems first hand, as well as a few other alternatives. As Ryan says, you aren't going to have it automatically scale for you. It takes work and immense amounts of patience to find your bottlenecks.
A large majority of the performance issues we saw from others and even ourselves were dealing with slow performing queries in our ORM layer. We went from Rails/ActiveRecord to Rails/DataMapper and finally to Merb/DM, each iteration getting more speed simply because of the underlying frameworks.
Caching does amazing wonders for performance. Unfortunately, we couldn't cache our data. Our cache would effectively be invalidated every five minutes at most. Nearly every single bit of our site was dynamic. So if/when you can't do that, perhaps you can learn from our experience.
We had to end up seriously fine tuning our database indexes, making sure our queries weren't doing very stupid things, making sure we weren't executing more queries than was absolutely necessary, etc. When I say "very stupid things", I mean the 1 + N query problem...
# 1 query
Dog.find(:all).each do |dog|
# N queries
dog.owner.siblings.each do |sibling|
# N queries per above N query!
sibling.pets.each do |pet|
# Do something here
end
end
end
DataMapper is an excellent way to handle the above problem (there are no 1 + N problems with it), but an even better way is to use your brain and stop doing queries like that. When you need raw performance, most of the ORM layers won't easily handle extremely custom queries, so you might as well hand write them.
We also did common sense things. We bought a beefy server for our growing database, and moved it off onto it's own dedicated box. We also had to do TONS of processing and data importing constantly. We moved our processing off onto its own box as well. We also stopped loading our entire freaking stack just for our data import utilities. We tastefully loaded only what we absolutely needed (thus reducing memory overhead!).
If you can't tell already... generally, when it comes to ruby/rails/merb, you have to scale out, throwing hardware at the problem. But in the end, hardware is cheap; though that's no excuse for shoddy code!
And even with these difficulties, I personally would never start projects in another framework if I can help it. I'm in love with the language, and continually learn more about it every day. That's something that I don't get from C#, though C# is faster.
I also enjoy the open source tools, the low cost to start working in the language, the low cost to just get something out there and try to see if it's marketable, all the while working in a language that often times can be elegant and beautiful...
In the end, it's all about what you want to live, breathe, eat, and sleep in day in and day out when it comes to choosing your framework. If you like Microsoft's way of thinking, go .NET. If you want open source but still want structure, try Java. If you want to have a dynamic language and still have a bit more structure than ruby, try python. And if you want elegance, try Ruby (I kid, I kid... there are many other elegant languages that fit the bill. Not trying to start a flame war.)
Hell, try them all! I tend to agree with the answers above that worrying about optimizations early isn't the reason you should or shouldn't pick a framework, but I disagree that this is their only answer.
So in short, yes there are difficulties you have to overcome, but the elegance of the language, imho, far outweighs those shortcomings.
Sorry for the novel, but I've been there and back with performance issues. It can be overcome. So don't let that scare you off.

RoR is being used with lots of huge websites, but as with any language or framework, it takes a good architecture (db scaling, caching, tuning, etc) to scale to large numbers of users.
There's been a few minor changes to RoR to make it easier to scale, but don't expect it to scale magically for you. Every website has different scaling issues, so you'll have to put in some work to make it scale.

Develop in the technology that is going to give your project the best chance of success - quick to develop in, easy debugging, easy deployment, good tools, you know it inside out (unless the point is to learn a new language), etc.
If you get tens of million of uniques a month you can always hire in a couple of people and rewrite in a different technology if you need to as ...
... you'll be rake-ing in the cache (sorry - couldn't resist!!)

First of all, it would perhaps make more sense to compare Rails to
Symfony, CodeIgniter or CakePHP, since Ruby on Rails is a complete web application
framework. Compared to PHP or PHP frameworks, Rails applications offer
the advantages that they are small, clean, and readable. PHP is perfect
for small, personal pages (originally it stood for "Personal Home Page"),
while Rails is a full MVC framwork which can be used to build large
sites.
Ruby on Rails has not a larger scalability issue than comparable PHP frameworks.
Both Rails and PHP will scale well if you have only a moderate number
of users (10,000-100,000) which operate on a similar number of objects.
For a few thousand users a classic monolithic architecture will
be sufficient. With a bit of M&M (Memcached and MySQL) you can also
handle millions of objects. The M&M architecture uses a MySQL server to
handle writes and Memcached to handle high read loads. The traditional
storage pattern, a single SQL server using normalized relational tables
(or at best a SQL Master/Multiple Read Slave setup), no longer works
for very large sites.
If you have billions of users like Google, Twitter and Facebook, then
probably a distributed architecture will be better. If you really want to
scale your application without limit, use some kind of cheap commodity hardware
as a foundation, divide your application into a set of services, keep
each component or service scalable itself (design every component as
a scalable service), and adapt the architecture to your application.
Then you will need suitable scalable datastores like NoSQL databases
and distributed hash tables (DHTs), you will need sophisticated map-reduce
algorithms to work with them, you will have to deal with SOA, external
services, and messaging. Neither PHP nor Rails offer a magic bullet here.

What is breaks down to with RoR is that unless you're in Alexa's top 100, you will not have any scalability problems. You'll have more issues with stability on shared hosting unless you can squeeze Phusion, Passenger, or Mongrel out.

Take a little while to look at the problems the Twitter people had to deal with, then ask yourself if your app is going to need to scale to that level.
Then build it in Rails anyway, because you know it makes sense. If you get to Twitter-level volumes then you'll be in the happy position of considering performance optimisaton options. At least you'll be applying them in a nice language!

You can't compare PHP and ROR, PHP is a scripting language as Ruby, and Rails is a framework as CakePHP.
Stated that, I strongly suggest you Rails, because you will have an application strictly organized in MVC pattern, and this is a MUST for your scalability requirement. (Using PHP you had to take care about the project organization on your own).
But for what about scalability, Rails it's not just MVC: For instance, you can start to develop your application with a database, changing it on road without any effort (in the most part of cases), so we can state that a Rails application is (almost) database indipendent because it's ORM (that allow you to avoid database query), you can do a lot of other stuff. (take a look to this video http://www.youtube.com/watch?v=p5EIrSM8dCA )

Just wanted to add some more info to Keith Hanson's smart point about 1 + N problem where he states:
DataMapper is an excellent way to handle the above problem (there are no 1 + N problems with it), but an even better way is to use your brain and stop doing queries like that. When you need raw performance, most of the ORM layers won't easily handle extremely custom queries, so you might as well hand write them.
Doctrine is one of the most popular ORM's for PHP. It addresses this 1 + N complexity problem intrinsic to ORMs by providing a language called Doctrine Query Language (DQL). This allows you to write SQL like statements that use your existing model relationships. e.g
$q = Doctrine_Query::Create()
->select(*)
->from(ModelA m)
->leftJoin(m.ModelB)
->execute()

I'm getting the impression from this thread that the scalability issues of ROR come down primarily to the mess that ORMs are in with regard to loading child objects - ie the '1+N' problem mentioned above. In the above example that Ryan gave with dogs and owners:
Dog.find(:all).each do |dog|
#N queries
dog.owner.siblings.each do |sibling|
#N queries per above N query!!
sibling.pets.each do |pet|
#Do something here
end
end
end
You could actually write a single sql statement to get all that data, and you could also 'stitch' that data up into the Dog.Owner.Siblings.Pets object heirarchy of your custom-written objects. But could someone write an ORM that did that automatically, so that the above example would incur a single round-trip to the DB and a single SQL Statement, instead of potentially hundreds? Totally. Just join those tables into one dataset, then do some logic to stitch it up. It's a bit tricky to make that logic generic so it can handle any set of objects but not the end of the world. In the end, tables and objects only relate to each other in one of three categories (1:1, 1:many, many:many). It's just that no one ever built that ORM.
You need a syntax that tells the system upfront what children you want to load for this particular query. You can sort of do this with the 'eager' loading of LinqToSql (C#), which is not a part of ROR, but even though that results in one round trip to the DB, it's still hundreds of separate SQL statements the way it has currently been set up. It's really more about the history of ORMs. They just got started down the wrong path with that and never really recovered in my opnion. 'Lazy loading' is the default behavior of most ORMs, ie incurring another round trip for every mention of a child object, which is crazy. Then with 'eager' loading - loading the children upfront, that is set up statically in everything I am aware outside of LinqToSql - ie which children always load with certain objects - as if you would always need the same children loaded when you loaded a collection of Dogs.
You need some kind of strongly-typed syntax saying that this time I want to load these children and grandchilren. Ie, something like:
Dog.Owners.Include()
Dog.Owners.Siblings.Include()
Dog.Owners.Siblings.Pets.Include()
then you could issue this command:
Dog.find(:all).each do |dog|
The ORM system would know what tables it needs to join, then stitch up the resulting data into the OM heirarchy. It's true that you can throw hardware at the current problem, which I'm generally in favor of, but it's no reason the ORM (ie Hibernate, Entity Framework, Ruby ActiveRecord) shouldn't just be better written. Hardware really doesn't bail you out of an 8 round-trip, 100-SQL statement query that should have been one round trip and one SQL statement.

Related

How to handle database scalability with Ruby on Rails

I am creating a management system and I want to know how "Ruby on Rails" can support me in the mission of ensuring that each customer has their information, records and tables independent from other customers.
Is it better to put everything in a database and put a customer identifier to pull information through this parameter in queries or create a database for each customer automatically?
I admit that the second option attracts me more ... And I know that putting everything in one database will be detrimental to performance, because I assume that customers and their data will increase exponentially!
I want to know which option is more viable in the long run. And if the best option is to create separate databases, how can I do this with Ruby on Rails ??
There are pro and cons for both solutions which really depend on your use case.
Separating each customer in its own database has definitely advantages for scaling, running in different data centres or even onsite. However, this comes with higher complexity. For instance you can't query across customers anymore, you would need to run queries for each customers and aggregate the results. This approach is called multi tenancy (or shardening). There is a good gem called Apartment available (https://github.com/influitive/apartment).
Keeping everything in one database might be simpler to start of as it's less complex but it really depends on your use case.
Edit
Adding some more information based on the questions.
There are several reasons to use a one db per client architecture.
You have clearly separated tenants. In case it might make sense to go with the one db approach.
Scale. Having separated databases for each tenant makes scaling of course easier.
If 2) is the main reason you want to go for a one db per client approach I would strongly advise you against it. You add so much more complexity to your app which you might not need for years to come (if ever).
If scaling is your main concern I recommend reading Designing Data Intensive Applications by Martin Kleppmann. But basically, don't worry about scale for the first few years and focus on your product.

Is there a better solution than ActiveRecord for batch data imports?

I've developed a web interface for a legacy (vendor) database using Ruby on Rails. The database schema is a complete mess, > 450 tables, and customer data spread over more than 20, involving complex joins, etc.
I've got a good solution for this for the web app, it works very well. But we also do nightly imports from external data sources (currently a view to a SQL Server DB and a SOAP feed) and they run SLOW. About 1.5-2.5 hours for the XML data import and about 4 hours for the DB import.
This is after doing some basic optimizations, which include manually starting the MRI garbage collector. And that right there suggests to me I'm Doing It Wrong. I've considered moving the nightly update/insert tasks out of the main Rails app and trying to use either JRuby or Rubinius to take advantage of the better concurrency and garbage collection.
My question is this: I know ActiveRecord isn't really designed for this type of task. But out of the O/RM options for Ruby (my preferred language), it seems to have the best Oracle support.
What would you do? Stick with AR and use a different interpreter? Would that really help? What about DataMapper or Sequel? Is there a better way of doing this?
I'm open to using Scala or Clojure if there's a better alternative (not limited to, but these are the other languages I'm playing with right now)... but what I don't want is something like DBI where I'm writing straight SQL, if for no other reason than that vendor updates occasionally change the DB schema, and I'd rather change a couple of classes than hundreds of UPDATE or INSERT statements.
Hopefully this question isn't 'too vague,' but I could really use some advice about this issue.
FWIW, Ruby is 1.9.2, Rails is 3.0.7, platform is OS X Server Snow Leopard (or optionally Debian 6.0).
Edit ok just realized that this solution will not work for oracle, sorry ---
You should really check out ActiveRecord-Import, it is easy to use and handles bulk imports with minimal amounts of sql statements. I saw a speed up from 5 hours to 2 minutes. And it will still run validations on the data.
from the github page:
books = []
10.times do |i|
books << Book.new(:name => "book #{i}")
end
Book.import books
https://github.com/zdennis/activerecord-import
From my experience, ORMs are a great tool to use on the front end, where you're mostly just reading the data or updating a single row at a time. On the back end where you're ingesting lost of data at a time, they can cause problems because of the way they tend to interact with the database.
As an example, assume you have a Person object that has a list of Friends that is long (lets say 100 for now). You create the Person object and assign 100 Friends to it, and then save it to the database. It's common for the naive use of an ORM to do 101 writes to the database (one for each Friend, one for the Person). If you were to do this in pure SQL at a lower level, you'd do 2 writes, one for Person and then one for all the Friends at once (an insert with 100 actual rows). The difference between the two actions is significant.
There are a couple ways I've seen to work around the problem.
Use a lower level database API that lets you write your "insert 100 friends in a single call" type command
Use an ORM that lets you write lower level SQL in order to do the Friends insert as a single SQL command (not all of them allow this and I don't know if Rails does)
Use an ORM that lets you batch writes into a single database call. It's still 101 writes to the database, but it allows the ORM to batch them into a single network call to the database and say "do these 101 things". I'm not sure what ORMs allow for this.
There's probably other ways
The main point being that using the ORM to ingest any real sized amount of data can run into efficiency problems. Understanding what the ORM is doing underneath the hood (asking it to log all db calls is a good way to understand what it's doing) is the best first step. Once you know what it's doing, you can look for ways to tell it "what I'm doing doesn't fit well into the normal pattern, lets change how you're using it"... and, should it not have a way that works, you can look at using a lower level API to allow for it.
I'll point out one other thing you can look at with a STRONG caveat that it should be one of the last things you consider. When inserting rows into the database in bulk, you can create a raw text file with all the data (format depends on the db, but the concept is similar to a CSV file) and give the file to the database to import in bulk. It's a bad way to go in almost every case, but I wanted to include it because it does exist as an option.
Edit: As a side note, the comment about more efficiently parsing the XML is a good thing to look at too. Using SAX vs DOM, or a different XML library, can be a huge win in time to completion. In some cases, it can be an even bigger win than more efficient database interaction. For example, you may be parsing a LOT of XML with lots of small pieces of data, and then only use small parts of it. In a case like that, the parsing could take a long time via DOM while SAX could ignore the parts you don't need... or it could be using a lot of memory creating DOM objects and slow down the whole thing due to garbage collection, etc. At the very least, it's worth looking at.
Since your question is indeed "a bit vague", I can only recommend you optimizing the XML import by using XML Pull parsing.
Take a look at this:
https://gist.github.com/827475
I needed to import MySQL XML, and to be fair, using the XML Pull method improved the parse part in factor of around 7 (yes, almost 7 times faster than reading the entire thing in the memory).
Another thing: you are saying "the DB import takes 4 hours". What file formats are these DB exports you are importing?

Is entity framework a bad choice for multiple websites and large aplications?

Scenario : We currently have a website and are working on building couple of websites with an admin website. We are using asp.net-mvc , SQL Server 2005 and Entity Framework 4. So, currently we have a single solution that has all the websites and all the websites are using the same entity framework model. The Model currently has over 70 tables and will potentially have a lot more in the future... around 400?
Questions : Is Entity Framework model going to be slower when it is going to grow bigger? I have read quite a few articles where they say it is pretty slow due to the additional layers of mapping when as compared to say ado.net? Also , we thought of having multiple models but it seems that it is a bad practice too and is LINQ useful when we are not using any ORM?
So, we are just curious what and how all the large websites using a similiar technology as we have achieve good performance while using an ORM like EF or do they never opt for an ORM ? I have also worked on a LINQ to SQL application that had over 150 tables and we encurred a huge startup penalty, site took 15-20 seconds to respond when first loaded. I am pretty sure this was due to large startup cost of LINQ to SQL ORM. It would be great if someone can share their experience and thoughts regarding this ? What are the best practices to follow and I know it depends on every application but if performance is a concern then what are the best steps to be taken ?
I don't have a definite answer for you, but I have found this SO post: ORM performance cost, it will probably be informative for you, expecially the second highest answer mentioning this site:
http://ormbattle.net/
My personal experience is that for any ORM mapper I have seen so far, Joel's law of leaky abstraction applies heavily. So if you are going to use EF, make sure you have alternatives for optimization at hand.
I think you can certainly get EF4 to work in a performant way with a database with a large number of tables. That said, you will certainly have to overcome a number of hurdles that are specific to EF.
I don't think LinqToSql is a good alternative since Microsoft has stopped enhancing it for the most part.
What other alternatives have you considered? ADO.NET? NHibernate? Stored Procedures?
I know NHibernate may have trouble establishing the SessionFactory for 400 tables quickly, but that only happens once when the website application starts, which should be fairly rare if the application is used heavily. Each web request generally has a new Session and creating sessions from the session factory is very quick and inexpensive.
My biggest concern with EF is the management of the thing, if you have multiple models, then you're suddenly going to have multiple work to do maintaining them, making sure you never update the wrong model for the right database, or vice versa. This is a problem for us at the moment, and looks to only get worse.
Personally, I like to write SQL rather than rely on an abstraction on top of an abstration. The DB knows SQL and so I keep it happy with hand-crafted stored procedures, or hand-crafted SQL in some cases. One huge benefit to this is that I can reply code to see what it was trying to do, and see the resulting data by c&p the sql from the log to the sql query editor. That, in my opinion, makes support so much easier it entirely invalidates any programmer benefit you might have from using an ORM in the first place (especially as EF generates absolutely unreadable SQL).
In fact, come to think of it, the only benefit an ORM gives you is that you can code a bit quicker (once you have everything set up and are not changing the schema, of course), and ultimately, I don't think the benefit is worth the cost, not when you consider that I spend most of my coding time thinking about what I'm going to do as the 'doing it' part is relatively small cmpared to the design, test, support and maintain parts.

Achieving 25K Concurrent connections in RubyOnRails Application

I am trying to build a suggestion board application. where a users raises a query and multiple people will post at the same time. expected to be supporting atleast 25k concurrent users. now the question format also has checkboxes or radio buttons, in thats case they will be writing to DB.
Please let me know how can achieve this in Ruby on Rails.
- hardware support (specific Hardware LB)
- software support like (DB clustering/App server clustering/ Web traffic resolution)
I think your best plan is to worry about scaling to this level once you have that many users. There's nothing stopping you from achieving this in Rails, or indeed any other framework/language.
The problem with trying to design your architecture up-front to scale to this level is that, at this point, you have absolutely no idea where the pain points are going to be. Are there specific pages which are going to hit the database harder, are some of your pages heavy on HTML and images... there are so many questions to ask that simply cannot be answered effectively until you've gotten something out there.
This doesn't mean that you shouldn't worry about scaling - by all means, try to design your data structures in such a way as to allow you to scale later. But put off any major decisions, and think about them later when you have some hard data to work with.

The Ruby community values simplicity...what's your argument for simplifying a db schema in a new project?

I'm working on a project with developers who have not worked with Ruby OR Rails before.
They have created a schema that is too complicated, in my opinion. The schema has 117 tables, and obtaining the simplest piece of information would require traversing/joining 7 tabels...and of course, there's no "main" table that serves as a sort of key between them. The schema renders many of the rails tools like 'find' method, and many of the has_many/belongs to relationships almost useless. And coding for all of these relationships will likely be more time-consuming than we have the money to code for.
THE QUESTION:
Assuming you are VERY convinced (IMHO...hehe) that the schema is not ideal, and there are multiple ways to represent the domain, how would you argue FOR simplifying the schema (aside from what I've already said)?
I'll stand up in 2 roles here
DBA: Database admin/designer.
Dev: Application developer.
I assume the DBA is a person who really know all the Database tricks. Reaallyy Knows.
DBA:
Database is the key of the application and should have predefined structure in order to serve its purpose well and with best performance.
If you cannot use random schema (which is reasonably normalised and good) then the tools are wrong.
Dev:
The database is just a data store, so we need to keep it simple and concentrate on the application.
DBA:
Database is not a store it is the core of the application. There is no application without database.
Dev:
No. The application is the core. There is no application without the front-end and the business logic applied to it.
And the war begins...
Both points are valid and it is always trade off.
If the database will ONLY be used by RoR, then you can use it more like a simple store.
If the DB can be used by other application OR it will be used with large amount of data and high traffic it must enforce some best practices.
Generally there is no way you can disagree with DBA.
But they can understand your situation and might allow you to loose the standards a bit so you could be more productive.
So you need to work closely, together.
And you need to talk to each other to explain and prove the point why database should be like this or that.
Otherwise, the team is broken and project can be failure with hight probability.
ActiveRecord is a very handy tool. But it cannot do everything for you. It does not provide Database structure by default that you expect exactly. So it should be tuned.
On the other side. If DBA can accept that all PKs are Auto incremented integers that would make Developer's life easier (ActiveRecord does it by default).
On the other side, if developers would accept some of DBA constraints it would make DBA's life easier.
Now to answer your question:
how would you argue FOR simplifying the schema
Do not argue. Meet the team and deliver the message and point on WHY it should be done.
Maybe it really shouldn't and you don't know all the things, maybe they are not aware of something.
You could agree on the general structure of the database AND try to describe it using RoR migrations as a meta language.
This way they would see the general picture, and you would use your great ActiveRecords.
And also everybody would be on the same page.
Your DB schema should reflect the domain and its relationships.
De-normalisation should only be done when you have measured that there is a performance problem.
7 joins is not excessive or bad, provided you have good indexes in place.
The general way to make this argument up the chain is based on cost. If you do things simply, there will be less code and fewer bugs. The system will be able to be built more quickly, or with more features, and thus will create more ROI. If you can get the money manager on board with that approach, he or she may let you dictate terms to the team. There is the counterargument that extreme over-normalization prevents bad data, but I have found that this is not the case, as the complexity it engenders tends to lead to more errors and more database code in general.
The architectural and technical argument here is simple. You have decided to use Ruby on Rails. Therefore you have decided to use the ActiveRecord pattern. The ActiveRecord pattern is driven by having the database tables match the object model. That's the pattern in use here, and in many other places, so the best practices they are trying to apply for extreme data normalization simply do not apply. Buy a copy of Patterns of Enterprise Application Architecture and put the little red bookmark at page 160 so they can understand how the pattern works from the architecture perspective.
What the DBA types tend to be unaware of is how much work ActiveRecord does for you, from query generation, cascading deletes, optimistic locking, auto populated columns, versioning (with acts_as_versioned), soft deletes (with acts_as_paranoid), etc. There is a strong argument to use well tested, community supported library functions to perform these operations versus custom code that must be maintained by a DBA.
The real issue with DBAs is then that they need some work to do. Let them focus on monitoring performance, finding slow queries in the code, creating indexes and doing backups.
If you end up losing the political battle for a sane schema, you may want to consider switching to DataMapper. It's the next pattern in PoEAA. The other thing you may be able to get them to do is to create views in the database that correspond to the object model. This way, you could use many of the finding capabilities in the ActiveRecord model based on the views, but have custom insert, update, and delete methods.

Resources