Question about Ruby on Rails, Constants, belongs_to & Database Optimization/Performance - ruby-on-rails

I've developed a web based point of sale system for one of my clients in Ruby on Rails with MySQL backend. These guys are growing so fast that they are ringing close to 10,000 transactions per day corporate-wide. For this question, I will use the transactions table as an example. Currently, I store the transactions.status as a string (ie: 'pending', 'completed', 'incomplete') within a varchar(255) field that has an index. In the beginning, it was fine when I was trying to lookup records by different statuses as I didn't have to worry about so many records. Over time, using the query analyzer, I have noticed that performance has worsened and that varchar fields can really slowdown your query speed over thousands of lookups. I've been thinking about converting these varchar fields to integer based status fields utilizing STATUS CONSTANT within the Transaction model like so:
class Transaction < ActiveRecord::Base
STATUS = { :incomplete => 0, :pending => 1, :completed => 2 }
def expensive_query_by_status(status)
self.find(:all,
:select => "id, cashier, total, status",
:condition => { :status => STATUS[status.to_sym] })
end
Is this the best route for me to take? What do you guys suggest? I am already using proper indexes on various lookup fields and memcached for query caching wherever possible. They're currently setup on a distributed server environment of 3 servers where 1st is for application, 2nd for DB & 3rd for caching (all in 1 datacenter & on same VLAN).

Have you tried the alternative on a representative database? From the example given, I'm a little sceptical that it's going to make much difference, you see. If there are only three statuses then a query by status may be better-off not using an index at all.
Say "completed" comprises 80% of your table - with no other indexed column involved, you're going to be requiring more reads if the index is used than not. So a query of that type is almost certainly going to get slower as the table grows. "incomplete" and "pending" queries would probably still benefit from an index, however; they'd only be affected as the total number of rows with those statuses grew.
How often do you look at everything, complete and otherwise, without some more selective criterion? Could you partition the table in some (internal or external) way? For example, store completed transactions in a separate table, moving new ones there as they reach their final (?) state. I think internal database partitioning was introduced in MySQL 5.1 - looking at the documentation it seems that a RANGE partition might be appropriate.
All that said, I do think there's probably some benefit to moving away from storing statuses as strings. Storage and bandwidth considerations aside, it's a lot less likely that you'll inadvertently mis-spell an integer or, better yet, a constant or symbol.

You might want to start limiting your searchings (if your not doing that already), #find(:all) is pretty taxing on that scale. Also you might want to think about what your Transaction model is reaching out for as it gets translated into your views and perhaps eager load those to minimize requests to the db for extra information.

Related

Effects of Rail's default_scope on performance

Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.

What is the 'Rails Way' to implement a dynamic reporting system on data

Intro
I'm doing a system where I have a very simple layout only consisting of transactions (with basic CRUD). Each transaction has a date, a type, a debit amount (minus) and a credit amount (plus). Think of an online banking statement and that's pretty much it.
The issue I'm having is keeping my controller skinny and worrying about possibly over-querying the database.
A Simple Report Example
The total debit over the chosen period e.g. SUM(debit) as total_debit
The total credit over the chosen period e.g. SUM(credit) as total_credit
The overall total e.g. total_credit - total_debit
The report must allow a dynamic date range e.g. where(date BETWEEN 'x' and 'y')
The date range would never be more than a year and will only be a max of say 1000 transactions/rows at a time
So in the controller I create:
def report
#d = Transaction.select("SUM(debit) as total_debit").where("date BETWEEN 'x' AND 'y'")
#c = Transaction.select("SUM(credit) as total_credit").where("date BETWEEN 'x' AND 'y'")
#t = #c.credit_total - #d.debit_total
end
Additional Question Info
My actual report has closer to 6 or 7 database queries (e.g. pulling out the total credit/debit as per type == 1 or type == 2 etc) and has many more calculations e.g totalling up certain credit/debit types and then adding and removing these totals off other totals.
I'm trying my best to adhere to 'skinny model, fat controller' but am having issues with the amount of variables my controller needs to pass to the view. Rails has seemed very straightforward up until the point where you create variables to pass to the view. I don't see how else you do it apart from putting the variable creating line into the controller and making it 'skinnier' by putting some query bits and pieces into the model.
Is there something I'm missing where you create variables in the model and then have the controller pass those to the view?
A more idiomatic way of writing your query in Activerecord would probably be something like:
class Transaction < ActiveRecord::Base
def self.within(start_date, end_date)
where(:date => start_date..end_date)
end
def self.total_credit
sum(:credit)
end
def self.total_debit
sum(:debit)
end
end
This would mean issuing 3 queries in your controller, which should not be a big deal if you create database indices, and limit the number of transactions as well as the time range to a sensible amount:
#transactions = Transaction.within(start_date, end_date)
#total = #transaction.total_credit - #transaction.total_debit
Finally, you could also use Ruby's Enumerable#reduce method to compute your total by directly traversing the list of transactions retrieved from the database.
#total = #transactions.reduce(0) { |memo, t| memo + (t.credit - t.debit) }
For very small datasets this might result in faster performance, as you would hit the database only once. However, I reckon the first approach is preferable, and it will certainly deliver better performance when the number of records in your db starts to increase
I'm putting in params[:year_start]/params[:year_end] for x and y, is that safe to do?
You should never embed params[:anything] directly in a query string. Instead use this form:
where("date BETWEEN ? AND ?", params[:year_start], params[:year_end])
My actual report probably has closer to 5 database calls and then 6 or 7 calculations on those variables, should I just be querying the date range once and then doing all the work on the array/hash etc?
This is a little subjective but I'll give you my opinion. Typically it's easier to scale the application layer than the database layer. Are you currently having performance issues with the database? If so, consider moving the logic to Ruby and adding more resources to your application server. If not, maybe it's too soon to worry about this.
I'm really not seeing how I would get the majority of the work/calculations into the model, I understand scopes but how would you put the date range into a scope and still utilise GET params?
Have you seen has_scope? This is a great gem that lets you define scopes in your models and have them automatically get applied to controller actions. I generally use this for filtering/searching, but it seems like you might have a good use case for it.
If you could give an example on creating an array via a broad database call and then doing various calculations on that array and then passing those variables to the template that would be awesome.
This is not a great fit for Stack Overflow and it's really not far from what you would be doing in a standard Rails application. I would read the Rails guide and a Ruby book and it won't be too hard to figure out.

Handling lots of report / financial data in rails 3, without slowing down?

I'm trying to figure out how to ask this - so I'll update the question as it goes to clear things up if needed.
I have a virtual stock exchange game site I've been building for fun. People make tons of trades, and each trade is its own record in a table.
When showing the portfolio page, I have to calculate everything on the fly, on the table of data - i.e. How many shares the user has, total gains, losses etc.
Things have really started slowing down, when I try to segment it by trades by company by day.
I don't really have any code to show to demonstrate this - but it just feels like I'm not approaching this correctly.
UPDATE: This code in particular is very slow
#Returning an array of values for a total portfolio over time
def portfolio_value_over_time
portfolio_value_array = []
days = self.from_the_first_funding_date
companies = self.list_of_companies
days.each_with_index do |day, index|
#Starting value
days_value = 0
companies.each do |company|
holdings = self.holdings_by_day_and_company(day, company)
price = Company.find_by_id(company).day_price(day)
days_value = days_value + (holdings * price).to_i
end
#Adding all companies together for that day
portfolio_value_array[index] = days_value
end
The page load time can be up to 20+ seconds - totally insane. And I've cached a lot of the requests in Memcache.
Should I not be generating reports / charts on the live data like this? Should I be running a cron task and storing them somewhere? Whats the best practice for handling this volume of data in Rails?
The Problem
Of course it's slow. You're presumably looking up large volumes of data from each table, and performing multiple lookups on multiple tables on every iteration through your loop.
One Solution (Among Many)
You need to normalize your data, create a few new models to store expensive calculated values, and push more of the calculations onto the database or into tables.
The fact that you're doing a nested loop over high-volume data is a red flag. You're making many calls to the database, when ideally you should be making as few sequential requests as possible.
I have no idea how you need to normalize your data or optimize your queries, but you can start by looking at the output of explain. In general, though, you probably will want to eliminate any full table scans and return data in larger chunks, rather than a record at a time.
This really seems more like a modeling or query problem than a Rails problem, per se. Hope this gets you pointed in the right direction!
You should precompute and store all this data on another table. An example table might look like this:
Table: PortfolioValues
Column: user_id
Column: day
Column: company_id
Column: value
Index: user_id
Then you can easily load all the user's portfolio data with a single query, for example:
current_user.portfolio_values
Since you're using memcached anyway, use it to cache some of those queries. For example:
Company.find_by_id(company).day_price(day)

How do you order an array by a connected integer in Ruby on Rails?

I'm creating a most popular activity section for user profiles. I have no difficulty pulling questions through the user_id but I'm having trouble pulling then ordering by the associated integer: question.votes.size . This is probably a simply question but how do I sort then limit the output to 3? How do I do this without lagging the database? There will eventually be a lot of votes to be counted. Should this be a named_scope?
#user_id = User.find_by_username(params[:username]).id
questions = Question.find(:all, :conditions => {:user_id => #user_id })
I wanted to pop in and suggest another way that is, perhaps, a bit more native RoR. :-)
#user = User.find_by_username( params[:username], :include => [{:questions => :votes}])
#sorted_questions = #user.questions.sort { |q1,q2| q2.votes.length <=> q1.votes.length }
This has a number of advantages:
1) No SQL written, maintains DB portability, easier to read(?)
2) Relieves DB of sort compute, should scale better
and a couple of disadvantages:
1) Works Ruby harder, higher latency at low loads, less efficient on single box
2) Moves more data, potentially mitigates advantage #2
Ideally, you'd want to look at ActiveRecord's counter cache functionality. It automatically caches relationship counts by denormalizing the child row count into the parent table. For it to work, all child row manipulation must occur via the parent object, but that's Rails best practice in any case.
Having a votes counter cache in questions would eliminate the need to reference the votes table in the query. Doing this, and sorting in Ruby, might be the ideal situation from both performance and code esthetic points of view.
Finally, I have to admit punting on Rails 3's very cool relational algebra stuff. With it, it's likely that this could be a super readable one-liner that generates optimal SQL. How cool is that going to be? :-)
I'm assuming that each question has many votes.
You can do this by using
Question.find_by_sql("
SELECT question.*, COUNT(votes.id) as vote_count
FROM questions
LEFT JOIN votes on questions.id = votes.question_id
GROUP BY questions.id
ORDER BY vote_count DESC
");
or something roughly equivalent to that (I didn't test it)

Does Ruby on Rails "has_many" array provide data on a "need to know" basis?

On Ruby on Rails, say, if the Actor model object is Tom Hanks, and the "has_many" fans is 20,000 Fan objects, then
actor.fans
gives an Array with 20,000 elements. Probably, the elements are not pre-populated with values? Otherwise, getting each Actor object from the DB can be extremely time consuming.
So it is on a "need to know" basis?
So does it pull data when I access actor.fans[500], and pull data when I access actor.fans[0]? If it jumps from each record to record, then it won't be able to optimize performance by doing sequential read, which can be faster on the hard disk because those records could be in the nearby sector / platter layer -- for example, if the program touches 2 random elements, then it will be faster just to read those 2 records, but what if it touches all elements in random order, then it may be faster just to read all records in a sequential way, and then process the random elements. But how will RoR know whether I am doing only a few random elements or all elements in random?
Why would you want to fetch 50000 records if you only use 2 of them? Then fetch only those two from DB. If you want to list the fans, then you will probably use pagination - i.e. use limit and offset in your query, or some pagination gem like will_paginate.
I see no logical explanation why should you go the way you try to. Explain a real situation so we could help you.
However there is one think you need to know wile loading many associated objects from DB - use :include like
Actor.all(:include => :fans)
this will eager-load all the fans so there will only be 2 queries instead of N+1, where N is a quantity of actors
Look at the SQL which is spewed out by the server in development mode, and that will tell you how many fan records are being loaded. In this case actor.fans will indeed cause them all to be loaded, which is probably not what you want.
You have several options:
Use a paginator as suggested by Tadas;
Set up another association with the fans table that pulls in just the ones you're interested in. This can be done either with a conditions on the has_many statement, e.g.
has_many :fans, :conditions => "country of residence = 'UK'"
Specifying the full SQL to narrow down the rows returned with the :finder_sql option
Specifying the :limit option which will, well, limit, the number of rows returned.
All depends on what you want to do.

Resources