Has Many Through Relationship in MongoDB with ONE query - ruby-on-rails

I want to create the following two pages, using ONE query each:
A list of all courses for a particular student
A list of all students for a particular course
I would like the recreate the following with MongoDB (or another NoSQL solution)
class Student < ActiveRecord::Base
has_many :assignments
has_many :courses, :through=>:assignments
end
class Course < ActiveRecord::Base
has_many :assignments
has_many :students, :through=>:assignments
end
class Assignment &lt ActiveRecord::Base
belongs_to :course
belongs_to :student
With a relational database, I can accomplish this by eager loading the association.
However in Mongo, when I create by schema, I can either chose to embed or link the student data in the course, or embed or link the course data in the student.
If I chose to embed the student data in the course document, I can easily pull all students for a particular course (one document). However, if I want to find all courses a particular student is taking, I will have to query MongoDB N times (where N is the number of courses a student is taking)!
In a web application backed by a relational database, making N calls to the database to load a page isn't an option. It needs to be done in a constant number of queries. Is this not the case with MongoDB? Is there a way to structure my documents such that I can load the two pages above with ONE query?

If I chose to embed the student data in the course document, I
can easily pull all students for a particular course (one document).
However, if I want to find all courses a particular student is taking,
I will have to query MongoDB N times (where N is the number of courses
a student is taking)!
If you keep your information highly normalized, that is correct. There are other possibilities, though, such as maintaining an array of current course IDs in the student record rather than relying on relationship queries.
In a web application backed by a relational database, making N calls
to the database to load a page isn't an option. It needs to be done
in a constant number of queries. Is this not the case with MongoDB?
Is there a way to structure my documents such that I can load the two
pages above with ONE query?
MongoDB is not a relational database, and intentionally does not support joins. The MongoDB data modelling approach is more akin to a data warehouse, where some data is denormalized or embedded in related data to remove the need for joins.
There is a client driver notion of relationships using the convention of Database References (DBRefs) but there is no server support for hydrating references and these will result in additional queries.
Multiple queries aren't necessarily bad .. but the general design goal would be to model your data to most effectively balance speed of inserts, updates, and reads based on your application's common use cases.
For more information on data modelling, see:
MongoDB Schema Design .. includes some best practices and related links
Designing MongoDB Schemas with Embedded, Non-Embedded and Bucket Structures

Related

Is it a good idea to serialize immutable data from an association?

Let's say we have a collection of products, each with their own specifics e.g. price.
We want to issue invoices that contain said products. Using a direct association from Invoice to Product via :has_many is a no-go, since products may change and invoices must be immutable, thus resulting in an alteration of the invoice price, concept, etc.
I first thought of having an intermediate model like InvoiceProduct that would be associated to the Invoice and created from a Product. Each InvoiceProduct would be unique to its parent invoice and immutable. This option would increase the db size significantly as more invoices get issued though, so I think it is not a good option.
I'm now considering adding a serialized field to the invoice model with all the products information that are associated to it, a hash of the collection of items the invoice contains. This way we can have them in an immutable manner even if the product gets modified in the future.
I'm not sure of possible mid or long term downsides to this approach, though. Would like to hear your thoughts about it.
Also, if there's some more obvious approach that I might have overlooked I'd love to hear about it too.
Cheers
In my experience, the main downside of a serialized field approach vs the InvoiceProducts approach described above is decreased flexibility in terms of how you can use your invoice data going forward.
In our case, we have Orders and OrderItems tables in our database and use this data to generate sales analytics reports as well as customer Invoices.
Querying the OrderItem data to generate the sales reports we need is much faster and easier with this approach than it would be if the same data was stored as serialized data in the db.
No.
Serialized columns have no place in a modern application. They are a overused dirty hack from the days before native JSON/JSONB columns were widespread and have only downsides. The only exception to this rule is when you're using application side encryption.
JSON/JSONB columns can be used for a limited number of tasks where the data defies being defined by a fixed schema or if you're just storing raw json responses - but it should not be how you're defining your schema out of convenience because you're just shooting yourself in the foot. Its a special tool for special jobs.
The better alternative is to actually use good relational database design and store the price at the time of sale and everything else in a separate table:
class Order < ApplicationRecord
has_many :line_items
end
# rails g model line_item order:belongs_to product:belongs_to units:decimal unit_price:decimal subtotal:decimal
# The line item model is responsible for each item of an order
# and records the price at the time of order and any discounts applied to that line
class LineItem < ApplicationRecord
belongs_to :order
belongs_to :product
end
class Product < ApplicationRecord
has_many :line_items
end
A serialized column is not immutable in any way - its actually more prone to denormalization and corruption as there are no database side constraints to ensure its correctness.
Tables can actually be made immutable in many databases by using triggers.
Advantages:
No violation of 1NF.
A normalized fixed data schema to work with - constraints ensure the validity of the data on the database level.
Joins are an extremely powerful tool and not as expensive as you might think.
You can actually access and make sense of the data outside of the application if needed.
DECIMAL data types. JSON/JSONB only has a single number type that uses IEEE 754 floating point.
You have an actual model and assocations instead of having to deal with raw hashes.
You can query the data in sane queries.
You can generate aggregates on the database level and use tools like materialized views.

Postgres: Many-to-many vs. multiple columns vs. array column

I need help designing complex user permissions within a Postgres database. In my Rails app, each user will be able to access a unique set of features. In other words, there are no pre-defined "roles" that determine which features a user can access.
In almost every controller/view, the app will check whether or not the current user has access to different features. Ideally, the app will provide ~100 different features and will support 500k+ users.
At the moment, I am considering three different options (but welcome alternatives!) and would like to know which option offers the best performance. Thank you in advance for any help/suggestions.
Option 1: Many-to-many relationship
By constructing a many-to-many relationship between the User table and a Feature table, the app could check whether a user has access to a given feature by querying the join table.
E.g., if there is a record in the join table that connects user1 and feature1, then user1 has access to feature1.
Option 2: Multiple columns
The app could represent each feature as a boolean column on the User table. This would avoid querying multiple tables to check permissions.
E.g., if user1.has_feature1 is true, then user1 has access to feature1.
Option 3: Array column
The app could store features as strings in a (GIN-indexed?) array column on the User table. Then, to check whether a user has access to a feature, it would search the array column for the given feature.
E.g., if user1.features.include? 'feature1' is true, then user1 has access to feature1.
Many-to-many relationships are the only viable option here. There is a reason why they call it a relational database.
Why?
Joins are actually not that expensive.
Multiple columns - The number of columns in your tables will be ludicris and it will be true developer hell. As each feature adds a migration the amount of churn in your codebase will be silly.
Array column - Using an array column may seem like an attractive alternative until you realize that its actually just a marginal improvement over stuffing things into a comma seperated string. you have no referential integrety and none of the code organization benefits that come from have having models that represent the entities in your application.
Oh and every time a feature is yanked you have to update every one of those 500k+ users. VS just using CASCADE.
class Feature
has_many :user_features
has_many :users, through: :user_features
end
class UserFeature
belongs_to :user
belongs_to :feature
end
class User
has_many :user_features
has_many :features, through: :user_features
def has_feature?(name)
features.exist?(name: name)
end
end

Database record containing multiple entries in one column

I am working on a web app written in rails. It is currently running on heroku with a postgres database.
I am supposed to add a feature where users may enter up to three codes for each one of the user's students. The codes themselves are irrelevant, they are simply strings that will be entered into the database.
This brings me to my dilemma. I am unsure of how to best store the codes in terms of their relationship to the student table. My original thought was to use the rails method serialize to store up to three codes in an array, but I have read that more often than not, storing data in an array in a database is not what you want to do.
Should I create a new table "codes" and set up a has_many relationship with the "students" table? Or is there a more preferable away to set up this relationship?
Given your situation, this sounds like the most reasonable approach to have a Code model and then setup has_many association with Student model.
student has_many codes and
code belongs_to student.

Real data in Ruby on Rails join table

In our RoR app, we have two core models and a join table:
class Workflow < ActiveRecord::Base
has_many :workflow_datafiles
has_many :datafiles, :through => :workflow_datafiles
class Datafile < ActiveRecord::Base
has_many :workflow_datafiles
has_many :workflows, :through => :workflow_datafiles
class WorkflowDatafile < ActiveRecord::Base
belongs_to :datafile
belongs_to :workflow
The join table has its own model code, and contains an actual data element that describes the nature of the relationship between a given workflow and datafile. I've written code to import data from XML files, and I need to put some data into the join rows after I associate the imported files and workflows. The problem is that, even though an unsaved workflow object has a datafiles array, and an unsaved datafile object has a workflows array, neither has a workflow_datafile array. I think they show up after I save (I should verify this, I guess).
So, when I'm processing the XML file, instantiating workflow and datafile objects, and adding them to each other's collections, I don't have a good way to access their join objects.
I see two options:
I could save the workflow and datafile objects at this point to force Rails to write them and their join rows out to the database. Presumably, this would populate the .workflow_datafiles arrays, or at least let me query and update the rows I need directly in the database. The problem with this is that both of these objects are part of a larger XML structure, and right now the code does all of its validations once everything is loaded, then saves it all at once. This would short-circuit that logic, and leave open the possibility that I create workflows and/or datafiles (and join rows) whose larger projects aren't in the database.
I could put code at a higher level of the XML processing to check the loaded data for workflows that have datafiles associated with them after saving the whole thing, and do some extra processing on them at that point to populate the join table columns. This is a little more of a structural change to my code, and revisits structures I've already saved for post-processing, but at least it doesn't risk creating orphaned rows in the tables.
I'm leaning toward option 2, but I'm really hoping that someone has a better option, or at least some wisdom about the situation as a whole.

A database design for variable column names

I have a situation that involves Companies, Projects, and Employees who write Reports on Projects.
A Company owns many projects, many reports, and many employees.
One report is written by one employee for one of the company's projects.
Companies each want different things in a report. Let's say one company wants to know about project performance and speed, while another wants to know about cost-effectiveness. There are 5-15 criteria, set differently by each company, which ALL apply to all of that company's project reports.
I was thinking about different ways to do this, but my current stalemate is this:
To company table, add text field criteria, which contains an array of the criteria desired in order.
In the report table, have a company_id and columns criterion1, criterion2, etc.
I am completely aware that this is typically considered horrible database design - inelegant and inflexible. So, I need your help! How can I build this better?
Conclusion
I decided to go with the serialized option in my case, for these reasons:
My requirements for the criteria are simple - no searching or sorting will be required of the reports once they are submitted by each employee.
I wanted to minimize database load - where these are going to be implemented, there is already a large page with overhead.
I want to avoid complicating my database structure for what I believe is a relatively simple need.
CouchDB and Mongo are not currently in my repertoire so I'll save them for a more needy day.
This would be a great opportunity to use NoSQL! Seems like the textbook use-case to me. So head over to CouchDB or Mongo and start hacking.
With conventional DBs you are slightly caught in the problem of how much to normalize your data:
A sort of "good" way (meaning very normalized) would look something like this:
class Company < AR::Base
has_many :reports
has_many :criteria
end
class Report < AR::Base
belongs_to :company
has_many :criteria_values
has_many :criteria, :through => :criteria_values
end
class Criteria < AR::Base # should be Criterion but whatever
belongs_to :company
has_many :criteria_values
# one attribute 'name' (or 'type' and you can mess with STI)
end
class CriteriaValues < AR::Base
belongs_to :report
belongs_to :criteria
# one attribute 'value'
end
This makes something very simple and fast in NoSQL a triple or quadruple join in SQL and you have many models that pretty much do nothing.
Another way is to denormalize:
class Company < AR::Base
has_many :reports
serialize :criteria
end
class Report < AR::Base
belongs_to :company
serialize :criteria_values
def criteria
self.company.criteria
end
# custom code here to validate that criteria_values correspond to criteria etc.
end
Related to that is the rather clever way of serializing at least the criteria (and maybe values if they were all boolean) is using bit fields. This basically gives you more or less easy migrations (hard to delete and modify, but easy to add) and search-ability without any overhead.
A good plugin that implements this is Flag Shih Tzu which I've used on a few projects and could recommend.
Variable columns (eg. crit1, crit2, etc.).
I'd strongly advise against it. You don't get much benefit (it's still not very searchable since you don't know in which column your info is) and it leads to maintainability nightmares. Imagine your db gets to a few million records and suddenly someone needs 16 criteria. What could have been a complete no-issue is suddenly a migration that adds a completely useless field to millions of records.
Another problem is that a lot of the ActiveRecord magic doesn't work with this - you'll have to figure out what crit1 means by yourself - now if you wan't to add validations on these fields then that adds a lot of pointless work.
So to summarize: Have a look at Mongo or CouchDB and if that seems impractical, go ahead and save your stuff serialized. If you need to do complex validation and don't care too much about DB load then normalize away and take option 1.
Well, when you say "To company table, add text field criteria, which contains an array of the criteria desired in order" that smells like the company table wants to be normalized: you might break out each criterion in one of 15 columns called "criterion1", ..., "criterion15" where any or all columns can default to null.
To me, you are on the right track with your report table. Each row in that table might represent one report; and might have corresponding columns "criterion1",...,"criterion15", as you say, where each cell says how well the company did on that column's criterion. There will be multiple reports per company, so you'll need a date (or report-number or similar) column in the report table. Then the date plus the company id can be a composite key; and the company id can be a non-unique index. As can the report date/number/some-identifier. And don't forget a column for the reporting-employee id.
Any and every criterion column in the report table can be null, meaning (maybe) that the employee did not report on this criterion; or that this criterion (column) did not apply in this report (row).
It seems like that would work fine. I don't see that you ever need to do a join. It looks perfectly straightforward, at least to these naive and ignorant eyes.
Create a criteria table that lists the criteria for each company (company 1 .. * criteria).
Then, create a report_criteria table (report 1 .. * report_criteria) that lists the criteria for that specific report based on the criteria table (criteria 1 .. * report_criteria).

Resources