Optimising export of DB using Rails - ruby-on-rails

I have a RoR application which contains an API to manage applications, each of which contain recipes (and groups, ingredients, measurements).
Once the user has finished managing the recipes, they download a JSON file of the entire application. Because each application could have hundreds of recipes, the files can be large. It also means there is a lot of DB calls to get all the required data to export.
Now because of this, the request to download the application can take upwards of 30 seconds, sometimes more.
My current code looks something like this:
application.categories.each do |c|
c.recipes.each do |r|
r.groups.each do |r|
r.ingredients.each do |r|
Within each loop I'm storing the data in a HASH and then giving it to the user.
My question is: where do I go from here?
Is there a way to grab all the data I require from the DB in one query? From looking at the log, I can see it is running hundreds of queries.
If the above solution is still slow, is this something I should put into a background process, and then email the user a link (or similar)?

There are of course ways to grab more data at once. This is done with Rails includes or joins, depending on your needs. See this article for some detailed information.
The basic idea is that you can join between your tables so that each time new queries aren't generated. When you do application.categories, that's one query. For each of those categories, you'll do another query: c.recipes - this creates N+1 queries, where N is the number of categories you have. Rather, you can include them off the get go to create 1 or 2 queries (depending on what Rails does).
The basic syntax is easy:
Application.includes(:categories => :recipes).each do |application| ...
This generates 1 (or 2 - again, see article) query that grabs all applications, their categories, and each categories recipies all at once. You can tack on the groups and ingredients too.
As for putting the work in the background, my suggestion would be to just have a loading image, or get fancy by using a progress bar.

First of all I have to assume that the required has_many and belongs_to associations exist.
Generally you can do something like
c.recipes.includes(:groups)
or even
c.recipes.includes(:groups => :ingredients)
which will fetch recipes and groups (and ingredients) at once.
But since you have a quite big data set IMO it would be better if you limited that technique to the deepest levels.
The most usefull approach would be to use find_each and includes together.
(find_each fetches the items in batches in order to keep the memory usage low)
perhaps something like
application.categories.each do |c|
c.recipes.find_each do |r|
r.groups.includes(:ingredients).each do |r|
r.ingredients.each do |r|
...
end
end
end
end
Now even that can take quite a long time (for an http request) so you can consider using some async processing where the client will generate a request that is going to be processed by the server as a background job, and when that is ready, you can provide a download link (or send an email) to the client.
Resque is one possible solution for handling the async part.

Related

Using caching to optimize a timeline in Rails

I'm hoping to get advice on the proper use of caching to speed up a timeline query in Rails. Here's the background:
I'm developing an iPhone app with a Rails backend. It's a social app, and like other social apps, its primary view is a timeline (i.e., newsfeed) of messages. This works very much like Twitter, where the timeline is made up of messages of the user and of his/her followers. The main query in the API request to retrieve the timeline is the following:
#messages = Message.where("user_id in (?) OR user_id = ?", current_user.followed_users.map(&:id), current_user)
Now this query gets quite inefficient, particularly at scale, so I'm looking into caching. Here are the two things I'm planning to do:
1) Use Redis to cache timelines as lists of message ids
Part of what makes this query so expensive is figuring out which messages to display on-the-fly. My plan here is to keep create a Redis list of message ids for each user. Assuming I build this correctly when a Timeline API request comes in I can call Redis to get a pre-processed ordered list of the ids of the messages to display. For example, I might get something like this: "[21, 18, 15, 14, 8, 5]"
2) Use Memcached to cache individual message objects
While I believe the first point will help a great deal, there's still the potential problem of retrieving the individual message objects from the database. The message objects can get quite big. With them, I return related objects like comments, likes, the user, etc. Ideally, I would cache these individual message objects as well. This is where I'm confused.
Without caching, I would simply make a query call like this to retrieve the message objects:
#messages = Message.where("id in (?)", ids_from_redis)
Then I would return the timeline:
respond_with(:messages => #messages.as_json) # includes related likes, comments, user, etc.
Now given my desire to utilize Memcache to retrieve individual message objects, it seems like I need to retrieve the messages one at a time. Using psuedo-code I'm thinking something like this:
ids_from_redis.each do |m|
message = Rails.cache.fetch("message_#{m}") do
Message.find(m).as_json
end
#messages << message
end
Here are my two specific questions (sorry for the lengthy build):
1) Does this approach generally make sense (redis for lists, memcached for objects)?
2) Specifically, on the pseudo-code below, is this the only way to do this? It feels inefficient grabbing the messages one-by-one but I'm not sure how else to do it given my intention to do object-level caching.
Appreciate any feedback as this is my first time attempting something like this.
On the face of it, this seems reasonable. Redis is well suited to storing lists etc, can be made persistent etc, and memcached will be very fast for retrieving individual messages, even if you call it sequentially like that.
The issue here is that you're going to need to clear/supplement that redis cache each time a message is posted. It seems a bit of a waste just to clear the cache in this circumstance, because you'll already have gone to the trouble of identifying every recipient of the message.
So, without wishing to answer the wrong question, have you thought about 'rendering' the visibility of messages into the database (or redis, for that matter) when each message is posted? Something like this:
class Message
belongs_to :sender
has_many :visibilities
before_create :render_visibility
sender.followers.each do |follower|
visibilities.build(:user => follower)
end
def
end
You could then render the list of messages quite simply:
class User
has_many :visibilities
has_many :messages, :through => :visibilities
end
# in your timeline view:
<%= current_user.messages.each { |message| render message } %>
I would then add of individual messages like this:
# In your message partial, caching individual rendered messages:
<%= cache(message) do %>
<!-- render your message here -->
<% end %>
I would also then add caching of entire timelines like this:
# In your timeline view
<%= cache("timeline-for-#{current_user}-#{current_user.messages.last.cache_key}") do %>
<%= current_user.messages.each { |message| render message } %>
<% end %>
What this should achieve (I've not tested it) is that the entire timeline HTML will be cached until a new message is posted. When that happens, the timeline will be re-rendered, but all the individual messages will come from the cache rather than being rendered again (with the possible exception of any new ones that haven't been viewed by anyone else!)
Note that this assumes that the message rendering is the same for every user. If it isn't, you'll need to cache the messages per user too, which would be a bit of a shame, so try not to do this if you can!
FWIW, I believe this is vaguely (and I mean vaguely) what twitter do. They have a 'big data' approach to it though, where the tweets are exploded and inserted into follower timelines across a large cluster of machines. What I've described here will struggle to scale in a write-heavy environment with lots of followers, although you could improve this somewhat by using resque or similar.
P.S. I've been a bit lazy with the code here - you should look to refactor this to move e.g. the timeline cache key generation into a helper and/or the person model.

How to improve performance of single-page application?

Introduction
I have a (mostly) single-page application built with BackboneJS and a Rails backend.
Because most of the interaction happens on one page of the webapp, when the user first visits the page I basically have to pull a ton of information out of the database in one large deeply joined query.
This is causing me some rather extreme load times on this one page.
NewRelic appears to be telling me that most of my problems are because of 457 individual fast method calls.
Now I've done all the eager loading I can do (I checked with the Bullet gem) and I still have a problem.
These method calls are most likely ocurring in my Rabl serializer which I use to serialize a bunch of JSON to embed into the page for initializing Backbone. You don't need to understand all this but suffice to say it could add up to 457 method calls.
object #search
attributes :id, :name, :subscription_limit
# NOTE: Include a list of the members of this search.
child :searchers => :searchers do
attributes :id, :name, :gravatar_icon
end
# Each search has many concepts (there could be over 100 of them).
child :concepts do |search|
attributes :id, :title, :search_id, :created_at
# The person who suggested each concept.
child :suggester => :suggester do
attributes :id, :name, :gravatar_icon
end
# Each concept has many suggestions (approx. 4 each).
node :suggestions do |concept|
# Here I'm scoping suggestions to only ones which meet certain conditions.
partial "suggestions/show", object: concept.active_suggestions
end
# Add a boolean flag to tell if the concept is a favourite or not.
node :favourite_id do |concept|
# Another method call which occurs for each concept.
concept.favourite_id_for(current_user)
end
end
# Each search has subscriptions to certain services (approx. 4).
child :service_subscriptions do
# This contains a few attributes and 2 fairly innocuous method calls.
extends "service_subscriptions/show"
end
So it seems that I need to do something about this but I'm not sure what approach to take. Here is a list of potential ideas I have:
Performance Improvement Ideas
Dumb-Down the Interface
Maybe I can come up with ways to present information to the user which don't require the actual data to be present. I don't see why I should absolutely need to do this though, other single-page apps such as Trello have incredibly complicated interfaces.
Concept Pagination
If I paginate concepts it will reduct the amount of data being extracted from the database each time. Would product an inferior user interface though.
Caching
At the moment, refreshing the page just extracts the entire search out of the DB again. Perhaps I can cache parts of the app to reduce on DB hits. This seems messy though because not much of the data I'm dealing with is static.
Multiple Requests
It is technically bad to serve the page without embedding the JSON into the page but perhaps the user will feel like things are happening faster if I load the page unpopulated and then fetch the data.
Indexes
I should make sure that I have indexes on all my foreign keys. I should also try to think about places where it would help to have indexes (such as favourites?) and add them.
Move Method Calls into DB
Perhaps I can cache some of the results of the iteration I do in my view layer into the DB and just pull them out instead of computing them. Or I could sync things on write rather than on read.
Question
Does anyone have any suggestions as to what I should be spending my time on?
This is a hard question to answer without being able to see the actual user interface, but I would focus on loading exactly only as much data as is required to display the initial interface. For example, if the user has to drill down to see some of the data you're presenting, then you can load that data on demand, rather than loading it as part of the initial payload. You mention that a search can have as many as 100 "concepts," maybe you don't need to fetch all of those concepts initially?
Bottom line, it doesn't sound like your issue is really on the client side -- it sounds like your server-side code is slowing things down, so I'd explore what you can do to fetch less data, or to defer the complex queries until they are definitely required.
I'd recommend separating your JS code-base into modules that are dynamically loaded using an asset loader like RequireJS. This way you won't have so many XHRs firing at load time.
When a specific module is needed it can load and initialize at an appropriate time instead of every page load.
If you complicate your code a little, each module should be able to start and stop. So, if you have any polling occurring or complex code executing you can stop the module to increase performance and decrease the network load.

Caching bunch of simple queries in rails

In my app there're objects, and they belong to countries, regions, cities, types, groups, companies and other sets. Every set is rather simple - it has id, name and sometimes some pointers to other sets, and it never changes. Some sets are small and I load them in before_filter like that:
#countries = Country.all
#regions = Region.all
But then I call, for example,
offer.country.name
or
region.country.name
and my app performs a separate db query-by-id, although I've already loaded them all. After that I perform query through :include, and this case ids, generated by eager loading, do not depend on either I've already loaded this data with another query-by-id or not.
So I want some cache. For example, I may generate hashes with keys as records-ids in my before_filter and then call #countries[offer.country_id].name. This case it seems I don't need eager loading and it's easy turn on Rails.cache here. But maybe there's some smart built-in rails solution that does not require to rewrite everything?
Caching lists of models like that won't cache individual instances of that exist in other model's associations.
The Rails team has worked on implementing Identity Maps in Rails 3.1 to solve this exact problem, but it is disabled by default for now. You can enable it and see if it works for your problem.

Specifying and Executing Rules in Ruby

I am looking for a Ruby/Rails tool that will help me accomplish the following:
I would like to store the following string, and ones similar to it, in my database. When an object is created, updated, deleted, etc., I want to run through all the strings, check to see if the CRUD event matches the conditions of the string, and if so, run the actions specified.
When a new ticket is created and it's category=6 then notify user 1234 via email
I am planning to create an interface that builds these strings, so it doesn't need to be a human-readable string. If a JSONish structure is better, or a tool has an existing language, that would be fantastic. I'm kinda thinking something along the lines of:
{
object_types: ['ticket'],
events: ['created', 'updated'],
conditions:'ticket.category=6',
actions: 'notify user',
parameters: {
user:1234,
type:'email'
}
}
So basically, I need the following:
Monitor CRUD events - It would be nice if the tool had a way to do this, but Ican use Rails' ModelObservers here if the tool doesn't natively provide it
Find all matching "rules" - This is my major unknown...
Execute the requested method/parameters - Ideally, this would be defined in my Ruby code as classes/methods
Are there any existing tools that I should investigate?
Edit:
Thanks for the responses so far guys! I really appreciate you pointing me down the right paths.
The use case here is that we have many different clients, with many different business rules. For the rules that apply to all clients, I can easily create those in code (using something like Ruleby), but for all of the client-specific ones, I'd like to store them in the database. Ideally, the rule could be written once, stored either in the code, or in the DB, and then run (using something Resque for performance).
At this point, it looks like I'm going to have to roll my own, so any thoughts as to the best way to do that, or any tools I should investigate, would be greatly appreciated.
Thanks again!
I don't think it would be a major thing to write something yourself to do this, I don't know of any gems which would do this (but it would be good if someone wrote one!)
I would tackle the project in the following way, the way I am thinking is that you don't want to do the rule matching at the point the user saves as it may take a while and could interrupt the user experience and/or slow up the server, so...
Use observers to store a record each time a CRUD event happens, or to make things simpler use the Acts as Audited gem which does this for you.
1.5. Use a rake task, running from your crontab to run through the latest changes, perhaps every minute, or you could use Resque which does a good job of handling lots of jobs
Create a set of tables which define the possible rules a user could select from, perhaps something like
Table: Rule
Name
ForEvent (eg. CRUD)
TableInQuestion
FieldOneName
FieldOneCondition etc.
MethodToExecute
You can use a bit of metaprogramming to execute your method and since your method knows your table name and record id then this can be picked up.
Additional Notes
The best way to get going with this is to start simple then work upwards. To get the simple version working first I'd do the following ...
Install acts as audited
Add an additional field to the created audit table, :when_processed
Create yourself a module in your /lib folder called something like processrules which roughly does this
3.1 Grabs all unprocessed audit entries
3.2 Marks them as processed (perhaps make another small audit table at this point to record events happening)
Now create a rules table which simply has a name and condition statement, perhaps add a few sample ones to get going
Name: First | Rule Statement: 'SELECT 1 WHERE table.value = something'
Adapt your new processrules method to execute that sql for each changed entry (perhaps you want to restrict it to just the tables you are working with)
If the rule matched, add it to your log file.
From here you can extrapolate out the additional functionality you need and perhaps ask another question about the metaprogramaming side of dynamically calling methods as this question is quite broad, am more than happy to help further.
I tend to think the best way to go about task processing is to setup the process nicely first so it will work with any server load and situation then plug in the custom bits.
You could make this abstract enough so that you can specify arbitrary conditions and rules, but then you'd be developing a framework/engine as opposed to solving the specific problems of your app.
There's a good chance that using ActiveRecord::Observer will solve your needs, since you can hardcode all the different types of conditions you expect, and then only put the unknowns in the database. For example, say you know that you'll have people watching categories, then create an association like category_watchers, and use the following Observer:
class TicketObserver < ActiveRecord::Observer
# observe :ticket # not needed here, since it's inferred by the class name
def after_create(ticket)
ticket.category.watchers.each{ |user| notify_user(ticket, user) }
end
# def after_update ... (similar)
private
def notify_user(ticket, user)
# lookup the user's stored email preferences
# send an email if appropriate
end
end
If you want to store the email preference along with the fact that the user is watching the category, then use a join model with a flag indicating that.
If you then want to abstract it a step further, I'd suggest using something like treetop to generate the observers themselves, but I'm not convinced that this adds more value than abstracting similar observers in code.
There's a Ruby & Rules Engines SO post that might have some info that you might find useful. There's another Ruby-based rules engine that you may want to explore that as well - Ruleby.
Hope that this helps you start your investigation.

How to make ActiveRecord work with legacy partitioned/sharded databases/tables?

thanks for your time first...after all the searching on google, github and here, and got more confused about the big words(partition/shard/fedorate),I figure that I have to describe the specific problem I met and ask around.
My company's databases deals with massive users and orders, so we split databases and tables in various ways, some are described below:
way database and table name shard by (maybe it's should be called partitioned by?)
YZ.X db_YZ.tb_X order serial number last three digits
YYYYMMDD. db_YYYYMMDD.tb date
YYYYMM.DD db_YYYYMM.tb_ DD date too
The basic concept is that databases and tables are seperated acording to a field(not nessissarily the primary key), and there are too many databases and too many tables, so that writing or magically generate one database.yml config for each database and one model for each table isn't possible or at least not the best solution.
I looked into drnic's magic solutions, and datafabric, and even the source code of active record, maybe I could use ERB to generate database.yml and do database connection in around filter, and maybe I could use named_scope to dynamically decide the table name for find, but update/create opertions are bounded to "self.class.quoted_table_name" so that I couldn't easily get my problem solved. And even I could generate one model for each table, because its amount is up to 30 most.
But this is just not DRY!
What I need is a clean solution like the following DSL:
class Order < ActiveRecord::Base
shard_by :order_serialno do |key|
[get_db_config_by(key), #because some or all of the databaes might share the same machine in a regular way or can be configed by a hash of regex, and it can also be a const
get_db_name_by(key),
get_tb_name_by(key),
]
end
end
Can anybody enlight me? Any help would be greatly appreciated~~~~
Case two (where only db name changes) is pretty easy to implement with DbCharmer. You need to create your own sharding method in DbCharmer, that would return a connection parameters hash based on the key.
Other two cases are not supported right away, but could be easily added to your system:
You implement sharding method that knows how to deal with database names in your sharded dabatase, this would give you an ability to do shard_for(key) calls to your model to switch db connection.
You add a method like this:
class MyModel < ActiveRecord::Base
db_magic :sharded => { :sharded_connection => :my_sharding_method }
def switch_shard(key)
set_table_name(table_for_key(key)) # switch table
shard_for(key) # switch connection
end
end
Now you could use your model like this:
MyModel.switch_shard(key).first
MyModel.switch_shard(key).count
and, considering you have shard_for(key) call results returned from the switch_shard method, you could use it like this:
m = MyModel.switch_shard(key) # Switch connection and get a connection proxy
m.first # Call any AR methods on the proxy
m.count
If you want that particular DSL, or something that matches the logic behind the legacy sharding you are going to need to dig into ActiveRecord and write a gem to give you that kind of capability. All the existing solutions that you mention were not necessarily written with your situation in mind. You may be able to bend any number of solutions to your will, but in the end you're gonna have to probably write custom code to get what you are looking for.
Sounds like, in this case, you should consider not use SQL.
If the data sets are that big and can be expressed as key/value pairs (with a little de-normalization), you should look into couchDB or other noSQL solutions.
These solutions are fast, fully scalable, and is REST based, so it is easy to grow and backup and replicate.
We all have gotten into solving all our problems with the same tool (Believe me, I try to too).
It would be much easier to switch to a noSQL solution then to rewrite activeRecord.

Resources