Rails model caching - ruby-on-rails

I have two models, called Division and Tier, which are queried a lot, but inserts or updates are hardly ever done. The tables only have a handful or records in, so the whole tables can be stored in memory.
class Division
# position: integer
belongs_to :tier
end
class Tier
# name: string
has_many :tiers
end
These tables are queried in almost every page, so it seems like a waste of database calls
Model caching solutions like identity cache, which cache records in memcached, only allow you to retrieve records from the cache by id.
A lot of queries are just selecting by id, but I also perform a lot of queries like
SELECT * FROM divisions WHERE divisions.position BETWEEN ? AND ?
I have a few questions
Will the database cache these tables in memory by itself, making another caching solution unnecessary?
If not, is there any way to cache these models in memory, and query against the cache using conditions other than just the id?
Am I essentially asking for something with the speed of redis/memcached, but the features of a relational database, which isn't possible?

Rails will cache models and avoid duplicate DB lookups but only for identical SQL queries, and only if that SQL query has already been made in the same client request. So another caching solution may be warranted.
For your case (non-volatile, reference data), caching sounds like a good idea. However, there are a few caveats:
As you pointed out, you aren't always querying by id. Which means that you either need to cache the results of given ranges (which seems like a bad idea), or you'll need a cache that can perform lookups by multiple indexes and find results across a range (elasticsearch would be a good example of this).
Caches can become stale when you modify data (INSERT/UPDATE/DELETE). Range caches are particularly vulnerable to this (another reason it was a bad idea), but even if you cache by indexes you'll still need to ensure that you update your caches accordingly.

Related

Rails 4 encrypt foreign key

I'm building an app that requires HIPAA compliance, which, to cut to the chase, means that I can't allow for certain connections to be freely viewable in the database (patients and recommendations for them).
These tables are connected through the patients_recommendations table in my app, which worked well until I added the encryption via attr_encrypted. In an effort to cut down on the amount of encrypting and decrypting (and associated overhead), I'd like to be able to simply encrypt the patient_id in the patients_recommendations table. However, upon changing the data type to string and the column name to encrypted_patient_id, the app breaks with the following error when I try to reseed my database:
can't write unknown attribute `patient_id'
I assume this is because the join is looking for the column directly and not by going through the model (makes sense, using the model is probably slower). Is there any way that I can make Rails go through the model (where attr_encrypted has added the necessary helper methods)?
Update:
In an effort to find a work-around, I've tried adding a before_save to the model like so:
before_save :encrypt_patient_id
...
private
def encrypt_patient_id
self.encrypted_patient_id = PatientRecommendation.encrypt(:patient_id, self.patient_id)
self.patient_id = nil
end
This doesn't work either, however, resulting in the same error of unknown attribute. Either solution would work for me (though the first would address the primary problem), any ideas why the before_save isn't being called when created through an association?
You should probably store the PII data and the PHI data in separate DBs. Encrypt the PII data (including any associations to a provider or provider location) and then hash out all of the PHI data to make it easier. As long as there are not direct associations between the two, it would be acceptable to not have the PHI data encrypted as it's anonymized.
Plan A
Don't set patient_id to nil in encrypt_patient_id since it does not exist and the problem could go away.
Also, ending a callback with a nil or false will halt the callback chain, put an explicite true at the end of method.
Plan B, rethink your options
There are more options - from database-level transparent encryption (which formally encrypts the data on disk), to encrypted filesystems for storing certain tablespaces, to flat out encryption of data in the columns.
Encrypting the join columns sounds like a road to unhappiness for a variety of reasons ranging from reporting issues to performance issues when joining is necessary which might be pretty severe,
the trouble you're currently experiencing with the seed looks like its the first bump caused by this on what promises to be a bad road (in this case activerecord seems to be confused how to handle your association, it tries to set patient_id on initialize and breaks).
The overhead of encrypting restricted data might not be as high as you think, not sure how things go for HIPAA but for PCI you're not exactly encouraged to render the protected data on screen so encryption incurs only a small overhead because it happens relatively rarely (business-need-to-know etc).
Also, memory is probably considered to be 'not at rest and not in transit', you could in theory cache some of the clear values for limited periods of time and thus save up on the decryption overhead.
Basically, encrypting data might not be that bad, and encrypting keys in database might be worse then you think
I suggest we talk directly, I'm doing PCI DSS compliance stuff and this topic interests me.
Option: 1-way hashes for primary/foreign keys
PatientRecommendation would have hash of patient_id - call it patient_hash and Patient would be capable of generating the same patient_hash from its id - but I'd suggest storing the patient_hash in both tables, for Patient it would be the primary key for join and for PatientRecommendation it would be the foreign key for join,
thus you define rails relation in these terms and rails will no longer be confused about your relation scheme
has_many :patient_recommendations, primary_key: :patient_hash, foreign_key: :patient_hash
and the result is cryptographically more robust and easy for the database to handle
IF you're adamant about not storing the patient_hash in Patient you could use a plain SQL statement to do the relation - less convenient but workable - something in the lines of this pseudosql:
JOIN ON generate_hash(patient.id) = patient_recommendations.patient_hash
Oracle, for example, has an option to make functional indexes (think create index generate_hash(patient.id)) so this approach could be pretty efficient depending on your choice of database.
However, playing with join keys will complicate your life a lot, even with these measures
I'll expand on this post later on with additional options

grails when to use enum vs database lookup table

I'm geting towards the end of my application and have 25 lookup tables. Some of my domain classes have 15 table references as properties (one-to-one). Now I have just remembered/learned I can use enums instead. I'm sure I can refactor a bunch of those lookup tables to be enum classes. However, what are the best practices for using an enum vs. a lookup table? A few things about my application scenario that may assist in the answer:
There is only one developer (me)
This is a web based application
It won't be a heavily used application
When the view for the domain class with currently 15 lookups is rendered, all of those lookups will need to be loaded to display the data in the view. (15 joins)
The lookups will be cached in memory.
The lookup relationships to the domain class are all one-to-one
The data in the lookups will rarely change
Some of the values in the lookup tables are wordy, example: "Taking care of animals"
Some of the lookups have many records. Example: one lookup is U.S. states. Another is a person's occupation.
I think the most obvious answer is to use Enums for things that don't change - ever. Like Days of the week,
or Planets in the Solar System - ok so sometimes thnigs you thought wouldnt change actually changed, but that OK - a quick update and you are good to go for a few more years.
But when you are working with data that will certainly change over time and often enough, then by all means store that in the database. A change to a table will not require a code change and redeployment. Additionally it would be nice to add an admin interface to these - and scaffolded screens should be quick and easy enough.
This is just my opinion. I don't think there is a "right" answer here.
If it gets stored in a database I use a lookup table. It is more flexible and based on a few things, should more performant. While you do have to do join queries, you can lessen that impact by enabling 2nd level cache. cache: true In your mappings.
One thing to remember when using an Enum instead of a lookup is that there is no longer a foreign key to a separate table so you can't guarantee the values are in the list you desire from a database level. In addition, if you need to query based on the Enum, it is now a string comparison instead of a number comparison typically.

Rails Minimizing Database Load

I am relatively new to rails. I understand that rails lets you play with your database values with much ease but I am a little bit in the blind about what kind of approach is more energy efficient on the database and which not.
Here is a case in point. I have a model appointment which belongs_to user. In my syntax I can sometimes say process_user #appointment.user. When I write that, does that run a separate SELECT query on the database to retrieve that user? Is it more efficient to write process_user #appointment.user_id where user_id is an attribute in the appointment and then try use the user_id value to perform my evaluation related tasks as long as I don't need the whole user object #appointment.user.
Frankly, from a peace of mind point of view, I just love to be able to use process_user #appointment.user because it reads better, looks nicer and works better when preparing logic. Is it a performance efficient way?
You are perfectly fine with using code like process_user #appointment.user, as ActiveRecord tries its best to minimize the number of database queries. Of course it does not handle all situations perfectly, but your example is a very basic one. There would probably no immediate database query happen and the object would only be loaded when its attributes are accessed.
If you notice performance problems in a running large-scaled application and you can track the problems down to ActiveRecord using profiling, it is probably time to optimize. Trying to pre-optimize from the very beginning would be against Rails' philosophy and will only result in ugly (and possible even slower) code. Remember that the real performance bottlenecks are often at places where you would never expect them.
EDIT: As Winfield pointed out, optimizing the number of queries does usually not mean to manage foreign keys or similar internals by yourself. There are quite a number of flags and options for DB access methods that allow you to control how your database is queries.
You can eagerly load your associated users with your Appointment models:
Appointment.all(:include => :user)
...which will join in the users or do a separate lookup for all the associated users in a single query.
This will then load the user association in advance (eagerly) so the user attribute is already populated with the object when you reference it, instead of having to stop and execute a separate query to look it up one by one (N+1 queries).

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Rails Caching DB Queries and Best Practices

The DB load on my site is getting really high so it is time for me to cache common queries that are being called 1000s of times an hour where the results are not changing.
So for instance on my city model I do the following:
def self.fetch(id)
Rails.cache.fetch("city_#{id}") { City.find(id) }
end
def after_save
Rails.cache.delete("city_#{self.id}")
end
def after_destroy
Rails.cache.delete("city_#{self.id}")
end
So now when I can City.find(1) the first time I hit the DB but the next 1000 times I get the result from memory. Great. But most of the calls to city are not City.find(1) but #user.city.name where Rails does not use the fetch but queries the DB again... which makes sense but not exactly what I want it to do.
I can do City.find(#user.city_id) but that is ugly.
So my question to you guys. What are the smart people doing? What is
the right way to do this?
With respect to the caching, a couple of minor points:
It's worth using slash for separation of object type and id, which is rails convention. Even better, ActiveRecord models provide the cacke_key instance method which will provide a unique identifier of table name and id, "cities/13" etc.
One minor correction to your after_save filter. Since you have the data on hand, you might as well write it back to the cache as opposed to delete it. That's saving you a single trip to the database ;)
def after_save
Rails.cache.write(cache_key,self)
end
As to the root of the question, if you're continuously pulling #user.city.name, there are two real choices:
Denormalize the user's city name to the user row. #user.city_name (keep the city_id foreign key). This value should be written to at save time.
-or-
Implement your User.fetch method to eager load the city. Only do this if the contents of the city row never change (i.e. name etc.), otherwise you can potentially open up a can of worms with respect to cache invalidation.
Personal opinion:
Implement basic id based fetch methods (or use a plugin) to integrate with memcached, and denormalize the city name to the user's row.
I'm personally not a huge fan of cached model style plugins, I've never seen one that's saved a significant amount of development time that I haven't grown out of in a hurry.
If you're getting way too many database queries it's definitely worth checking out eager loading (through :include) if you haven't already. That should be the first step for reducing the quantity of database queries.
If you need to speed up sql queries on data that doesnt change much over time then you can use materialized views.
A matview stores the results of a query into a table-like structure of
its own, from which the data can be queried. It is not possible to add
or delete rows, but the rest of the time it behaves just like an
actual table. Queries are faster, and the matview itself can be
indexed.
At the time of this writing, matviews are natively available in Oracle
DB, PostgreSQL, Sybase, IBM DB2, and Microsoft SQL Server. MySQL
doesn’t provide native support for matviews, unfortunately, but there
are open source alternatives to it.
Here is some good articles on how to use matviews in Rails
sitepoint.com/speed-up-with-materialized-views-on-postgresql-and-rails
hashrocket.com/materialized-view-strategies-using-postgresql
I would go ahead and take a look at Memoization, which is now in Rails 2.2.
"Memoization is a pattern of
initializing a method once and then
stashing its value away for repeat
use."
There was a great Railscast episode on it recently that should get you up and running nicely.
Quick code sample from the Railscast:
class Product < ActiveRecord::Base
extend ActiveSupport::Memoizable
belongs_to :category
def filesize(num = 1)
# some expensive operation
sleep 2
12345789 * num
end
memoize :filesize
end
More on Memoization
Check out cached_model

Resources