Rails ActiveRecord - Uniqueness and Lookup on Array Attribute - ruby-on-rails

Good morning,
I have a Rails model in which I’m currently serializing an array of information. Two things are important to me:
I want to be able to ensure that this is unique (i.e. can’t have two models with the same array)
I want to be able to search existing models for this hash (in a type of find_or_create_by method).
This model describes a “portfolio” – i.e. a group of stock or bonds. The array is the description of what securities are inside the portfolio, and in what weights. I also have a second model, which is a group of portfolios (lets call it a “Portcollection” to keep things simple). A collection has many portfolios, and a portfolio can be in many collections. In other words:
class Portfolio
serialize :weights
has_and_belongs_to_many :portcollections
class Portcollection
has_and_belongs_to_many :portfolios
When I am generating a “portcollection” I need to build a bunch of portfolios, which I do programmatically (implementation not important). Building a portfolio is an expensive operation, so I’m trying to check for the existence of one first. I thought I could do this via find_or_create_by, but wasn’t having much luck. This is my current solution:
Class Portcollection
before_save :build_portfolios
def build_portfolios
……
proposed_weights = ……
yml =proposed_weights.to_yaml
if port = Portfolio.find_by_weights(yml)
self.portfolios << port
else
self.portfolios << Portfolio.create!(:weights => proposed_weights)
end
……..
end
This does work, but it is quite slow. I have a feeling this is because I’m converting stuff to YAML each time it runs when I try to check for an existing portfolio (this is running probably millions of times), and I’m searching for a string, as opposed to an integer. I do have an index on this column though.
Is there a better way to do this? A few thoughts had crossed my mind:
Calculate an MD5 hash of the “weights” array, and save to a database column. I’ll still have to calculate this hash each time I want to search for an array, but I have a gut feeling this would be easier for the database to index & search?
Work on moving from has_and_belongs_to_many to a has_many => through, and store the array information as database columns. That way I could try to sort out a database query that could check for the uniqueness, without any YAML or serialization…
i.e. something like :
class Portfolio
has_many :portcollections, :through => security_weights
class Portcollections
has_many :portfolios, :through => security_weights
SECURITY_WEIGHTS
id portfolio_id portcollection_id weight_of_GOOG weight_of_APPLE ……
1 14 15 0.4 0.3
In case it is important, the “weights” array would look like this:
[ [‘GOOG’, 0.4] , [‘AAPL’, 0.3] , [‘GE’, 0.3] ]
Any help would be appreciated. Please keep in mind I'm quite an amateur - programming is just a hobby for me! Please excuse me if I'm doing anything really hacky or missing something obvious....
Thanks!
UPDATE 1
I've done some research into the Rails 3.2 "store" method, but that doesn't seem to be the answer either... It just stores objects as JSON, which gives me the same lack of searchability I have now.

I think storing a separate hash in it's own column is the only way to do this efficiently. You are using serialization or a key/value store that is designed to not be easily searchable.
Just make sure you consider sorting on your values before hashing them, other wise you could have the same content but differing hashes.

Related

More efficient, rails way to check for any of three fields being unique?

So, I need check three fields for uniqueness of an object before creating it (from a form), but I will create the object so long as any of the three fields are unique.
My first thought was to just pass the params from the controller to the model, and then run a query to check if a query with those three fields returns > 0 documents. However, I've since learned that this is a dangerous approach, and should not be used.
So I checked the docs, and based off of this snippet
Or even multiple scope parameters. For example, making sure that a teacher can only be on the schedule once per semester for a particular class.
class TeacherSchedule < ActiveRecord::Base
validates_uniqueness_of :teacher_id, scope: [:semester_id, :class_id]
end
I thought I had found my answer, and implemented:
validates_uniqueness_of :link_to_event, :scope => [:name_of_event, :date_of_event]
which works! But, this dataset is going to get very large (not from this form alone, lol), and I'm under the impression that with this implementation, Rails is going to query for all fields with a link_to_event, and then all fields with a name_of_event, and then all fields with a date_of_event. So, my question(s) is:
A) Am I wrong about how rails will implement this? Is it going to be more efficient out of the box?
B) If this will not be efficient for a table with a couple million entries, is there a better (and still railsy) way to do this?
You can define a method that queries the records with all the fields that you want to be unique as a group:
validate :uniqueness_of_teacher_semester_and_class
def uniqueness_of_teacher_semester_and_class
users = self.class.where(teacher_id: teacher_id, semester_id: semester_id, class_id: class_id)
errors.add :base, 'Record not unique.' if users.exists?
end
To answer your questions:
A) Am I wrong about how rails will implement this? Is it going to be more efficient out of the box?
I think Rails will query for a match on all 3 fields, and you should check the Mongo (or Rails) log to see for sure.
B) If this will not be efficient for a table with a couple million entries, is there a better (and still railsy) way to do this?
This is the Rails way. There are 2 things you can do to make it efficient:
You would need indexes on all 3 fields, or a compound index of the 3 fields. The compound index *might* be faster, but you can benchmark to find out.
You can add a new field with the 3 fields concatenated, and an index on it. But this will take up extra space and may not be faster than the compound index.
These days a couple million documents is not that much, but depends on document size and hardware.

Rails Data Modelling

In my company, we are trying to cache some data that we are querying from an API. We are using Rails. Two of my models are 'Query' and 'Response'. I want to create a one-to-many relationship between Query and Response, wherein, one query can have many responses.
I thought this is the right way to do it.
Query = [query]
Response = [query_id, response_detail_1, response_detail_2]
Then, in the Models, I did the following Data Associations:
class Query < ActiveRecord::Base
has_many :response
end
class Response < ActiveRecord::Base
belongs_to :query
end
So, canonically, whenever I want to find all the responses for a given query, I would do -
"_id" = Query.where(:query => "given query").id
Response.where(:query_id => "_id")
But my boss made me use an Array column in the Query model, remove the Data Associations between the models and put the id of each response record in that array column in the Query model. So, now the Query model looks like
Query = [query_id, [response_id_1, response_id_2, response_id_3,...]]
I just want to know what are the merits and demerits of doing it both ways and which is the right way to do it.
If the relationship is really a one-to-many relationship, the "standard" approach is what you originally suggested, or using a junction table. You're losing out on referential integrity that you could get with a FK by using the array. Postgres almost had FK constraints on array columns, but from what I researched it looks like it's not currently in the roadmap:
http://blog.2ndquadrant.com/postgresql-9-3-development-array-element-foreign-keys/
You might get some performance advantages out of the array approach if you consider it like a denormalization/caching assist. See this answer for some info on that, but it still recommends using a junction table:
https://stackoverflow.com/a/17012344/4280232. This answer and the comments also offer some thoughts on the array performance vs the join performance:
https://stackoverflow.com/a/13840557/4280232
Another advantage of using the array is that arrays will preserve order, so if order is important you could get some benefits there:
https://stackoverflow.com/a/2489805/4280232
But even then, you could put the order directly on the responses table (assuming they're unique to each query) or you could put it on a join table.
So, in sum, you might get some performance advantages out of the array foreign keys, and they might help with ordering, but you won't be able to enforce FK constraints on them (as of the time of this writing). Unless there's a special situation going on here, it's probably better to stick with the "FK column on the child table" approach, as that is considerably more common.
Granted, that all applies mainly to SQL databases, which I notice now you didn't specify in your question. If you're using NoSQL there may be other conventions for this.

A database design for variable column names

I have a situation that involves Companies, Projects, and Employees who write Reports on Projects.
A Company owns many projects, many reports, and many employees.
One report is written by one employee for one of the company's projects.
Companies each want different things in a report. Let's say one company wants to know about project performance and speed, while another wants to know about cost-effectiveness. There are 5-15 criteria, set differently by each company, which ALL apply to all of that company's project reports.
I was thinking about different ways to do this, but my current stalemate is this:
To company table, add text field criteria, which contains an array of the criteria desired in order.
In the report table, have a company_id and columns criterion1, criterion2, etc.
I am completely aware that this is typically considered horrible database design - inelegant and inflexible. So, I need your help! How can I build this better?
Conclusion
I decided to go with the serialized option in my case, for these reasons:
My requirements for the criteria are simple - no searching or sorting will be required of the reports once they are submitted by each employee.
I wanted to minimize database load - where these are going to be implemented, there is already a large page with overhead.
I want to avoid complicating my database structure for what I believe is a relatively simple need.
CouchDB and Mongo are not currently in my repertoire so I'll save them for a more needy day.
This would be a great opportunity to use NoSQL! Seems like the textbook use-case to me. So head over to CouchDB or Mongo and start hacking.
With conventional DBs you are slightly caught in the problem of how much to normalize your data:
A sort of "good" way (meaning very normalized) would look something like this:
class Company < AR::Base
has_many :reports
has_many :criteria
end
class Report < AR::Base
belongs_to :company
has_many :criteria_values
has_many :criteria, :through => :criteria_values
end
class Criteria < AR::Base # should be Criterion but whatever
belongs_to :company
has_many :criteria_values
# one attribute 'name' (or 'type' and you can mess with STI)
end
class CriteriaValues < AR::Base
belongs_to :report
belongs_to :criteria
# one attribute 'value'
end
This makes something very simple and fast in NoSQL a triple or quadruple join in SQL and you have many models that pretty much do nothing.
Another way is to denormalize:
class Company < AR::Base
has_many :reports
serialize :criteria
end
class Report < AR::Base
belongs_to :company
serialize :criteria_values
def criteria
self.company.criteria
end
# custom code here to validate that criteria_values correspond to criteria etc.
end
Related to that is the rather clever way of serializing at least the criteria (and maybe values if they were all boolean) is using bit fields. This basically gives you more or less easy migrations (hard to delete and modify, but easy to add) and search-ability without any overhead.
A good plugin that implements this is Flag Shih Tzu which I've used on a few projects and could recommend.
Variable columns (eg. crit1, crit2, etc.).
I'd strongly advise against it. You don't get much benefit (it's still not very searchable since you don't know in which column your info is) and it leads to maintainability nightmares. Imagine your db gets to a few million records and suddenly someone needs 16 criteria. What could have been a complete no-issue is suddenly a migration that adds a completely useless field to millions of records.
Another problem is that a lot of the ActiveRecord magic doesn't work with this - you'll have to figure out what crit1 means by yourself - now if you wan't to add validations on these fields then that adds a lot of pointless work.
So to summarize: Have a look at Mongo or CouchDB and if that seems impractical, go ahead and save your stuff serialized. If you need to do complex validation and don't care too much about DB load then normalize away and take option 1.
Well, when you say "To company table, add text field criteria, which contains an array of the criteria desired in order" that smells like the company table wants to be normalized: you might break out each criterion in one of 15 columns called "criterion1", ..., "criterion15" where any or all columns can default to null.
To me, you are on the right track with your report table. Each row in that table might represent one report; and might have corresponding columns "criterion1",...,"criterion15", as you say, where each cell says how well the company did on that column's criterion. There will be multiple reports per company, so you'll need a date (or report-number or similar) column in the report table. Then the date plus the company id can be a composite key; and the company id can be a non-unique index. As can the report date/number/some-identifier. And don't forget a column for the reporting-employee id.
Any and every criterion column in the report table can be null, meaning (maybe) that the employee did not report on this criterion; or that this criterion (column) did not apply in this report (row).
It seems like that would work fine. I don't see that you ever need to do a join. It looks perfectly straightforward, at least to these naive and ignorant eyes.
Create a criteria table that lists the criteria for each company (company 1 .. * criteria).
Then, create a report_criteria table (report 1 .. * report_criteria) that lists the criteria for that specific report based on the criteria table (criteria 1 .. * report_criteria).

Best way to handle multiple tables to replace one big table in Rails? (e.g. 'todo_items1', 'todo_items2', etc., instead of just 'todo_items')?

Update:
Originally, this post was using Books as the example entity, with
Books1, Books2, etc. being the
separated table. I think this was a
bit confusing, so I've changed the
example entity to be "private
todo_items created by a particular
user."
This kind of makes Horace and Ryan's original comments seem a bit off, and
I apologize for that. Please know that
their points were valid when it looked
like I was dealing with books.
Hello,
I've decided to use multiple tables for an entity (e.g. todo_items1, todo_items2, todo_items3, etc.), instead of just one main table which could end up having a lot of rows (e.g. just todo_items). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.
With that, I'm looking for a good way to handle this in Rails, mainly by trying to avoid loading a bunch of unused associations for each User object. I'm guessing that other have done something similar, so there's probably a few good tips/recommendations out there.
(I know that I could use a partition for this, but, for now, I've decided to go the 'multiple tables' route.)
Each user has their todo_items placed into a specific table. The actual "todo items" table is chosen when the user is created, and all of their todo_items go into the same table. The data in their todo items collection is private, so when it comes time to process a users todo_items, I'll only have to look at one table.
One thing I don't particularly want to have is a bunch of unused associations in the User class. Right now, it looks like I'd have to do the following:
class User < ActiveRecord::Base
has_many :todo_items1, :todo_items2, :todo_items3, :todo_items4, :todo_items5
end
class todo_items1 < ActiveRecord::Base
belongs_to :user
end
class todo_items2 < ActiveRecord::Base
belongs_to :user
end
class todo_items3 < ActiveRecord::Base
belongs_to :user
end
The thing is, for each individual user, only one of the "todo items" tables would be usable/applicable/accessible since all of a user's todo_items are stored in the same table. This means only one of the associations would be in use at any time and all of the other has_many :todo_itemsX associations that were loaded would be a waste.
For example, with a user.id of 2, I'd only need todo_items3.find_by_text('search_word'), but the way I'm thinking of setting this up, I'd still have access to todo_items1, todo_items2, todo_items4 and todo_items5.
I'm thinking that these "extra associations" adds extra overhead and makes each User object's size in memory much bigger than it has to be. Also, there's a bunch of stuff that Ruby/Rails is doing in the background which may cause other performance problems.
I'm also guessing that there could be some additional method call/lookup overhead for each User object, since it has to load all of those associations, which in turn creates all of those nice, dynamic model accessor methods like User.find_by_something.
I don't really know Ruby/Rails does internally with all of those has_many associations though, so maybe it's not so bad. But right now I'm thinking that it's really wasteful, and that there may just be a better, more efficient way of doing this.
So, a few questions:
1) Is there's some sort of special Ruby/Rails methodology that could be applied to this 'multiple tables to represent one entity' scheme? Are there any 'best practices' for this?
2) Is it really bad to have so many unused has_many associations for each object? Is there a better way to do this?
3) Does anyone have any advice on how to abstract the fact that there's multiple "todo items" tables behind a single todo_items model/class? For example, so I can call todo_items.find_by_text('search_phrase') instead of todo_items3.find_by_text('search_phrase').
Thank you!
This is not the way to scale.
It would probably be better going with master-slave replication and proper indexing (besides primary key) on fields such as "title" and/or "author" if that's what you're going to be looking up books based on. Having it in n-tables, how are you going to know the best place to go looking for the book the user is after? Are you going to go looking through 4 tables?
I agree with Horace: " don't try to solve a performance issue before you have figures to prove it." I suggest, however, that you should really look into adding indexes to your table if you want lookups to be fast. If they aren't fast, then tell us how they aren't fast and we will tell you how to make it go ZOOOOOM.

Rails model: "has many" *simple* attribute

Let's assume this model:
Movie
- Title: String
- Has many:
- Alternative Title: String
My questions is, how should I store the alt. title attribute? I am deciding between three approaches:
Separate AR model: probably an overkill
CSV in a signle DB column
Serialized array in single DB column
The latter two seems logically equivilent. I am leaning towards the CSV approach. Can anyone give some advise on this? What would be the implications on speed and searchability?
If a movie can have many titles, it makes most sense to have a Title model and give the Movie model a has_many :titles relation, especially if you later on decide to add more metadata about titles. It may seem like overkill, but I think it will be the least hassle in the long run. Furthermore, I think that a movie's "main" title should be a Title object as well, perhaps with an is_main_title or similar attribute to distinguish it from the others.
If most of the time you only use the primary title, I'll go with your CSV option.
If most of the time you use all the titles, I'll put all the titles (primary and secondary) inside a single CSV column (named "titles") and just get the first when the primary is needed (with a helper function).
Why?
Because it makes things simple- and if the time has come, like Jordan said, that you need another attribute you can always migrate to a separate model.
Until then, YAGNI.
I would also vote for a separate model even though it seems like overkill it will allow you to basically follow the Rails way the easiest. However, if you choose not to reap the benefits of all the baked in magic associated with associations, then I would recommend YAML or JSON over CSV. CSV is quite simple, but Rails has baked in support for YAML serialization and would probably be the easiest solution. Check out RDoc on #serialize. For the given example this would basically amount to:
class Movie < ActiveRecord::Base
serialize :alternate_titles
end
With that, Rails would handle a lot of the drudgery for you and you'll have a nice array of alternate titles always available.

Resources