To normalize or not to normalize user_ids - ruby-on-rails

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?

I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.

As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!

Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Related

Join an ActiveRecord model to a table in another schema with no model

I need to join an ActiveRecord model in my Ruby on Rails app to another table in a different schema that has no model. I've searched for the answer, and found parts of it, but not a whole solution in one place, hence this question.
I have Vehicle model, with many millions of rows.
I have a table (reload_cars) in another schema (temp_cars) in the same database, with a few million records. This is an ad hoc table, to be used for one ad hoc data update, and will never be used again. There is no model associated with that table.
I initially was lazy and selected all the reload_cars records into an array (reload_vins) in one query, and then in a second query did something like:
`Vehicle.where(vin_status: :invalid).where('vin in (?)', reload_vins)`.
That's simplified a bit from the actual query, but demonstrates the join I need. In other queries, I need full sets of inner and outer joins between these tables. I also need to put various selection criteria on the model table and/or the non-model table in various steps.
That blunt approach worked fine in development, but did not scale up to the production database. I thought it would take a few minutes, which is plenty fast enough for a one-time operation. But, it timed out, particularly when looping through sets of records. Small tweaks did not help.
So, I need to do a legit join.
In retrospect, the answer seems pretty obvious. This query ran pretty much instantly, and gave the exact expected result, with various criteria on each table.
Here is one such query:
Vehicle.where(vin_status: :invalid)
.joins("
join temp_cars.reload_cars tcar
on tcar.vin = vehicles.vin
where tcar.registration_id is not null
")

How to create an Order Model with a type field that dictates other fields

I'm building a Ruby on Rails App for a business and will be utilizing an ActiveRecord database. My question really has to do with Database Architecture and really the best way I should organize all the different tables and models within my app. So the App I'm building is going to have a database of orders for an ECommerce Business that sells products through 2 different channels, a subscription service where they pick the products and sell it for a fixed monthly fee and a traditional ECommerce channel, where customers pay for their products directly. So essentially while all of these would be classified as the Order model, there are two types of Orders: Subscription Order and Regular Order.
So initially I thought I would classify all this activity in my Orders Table and include a field 'Type' that would indicate whether it is a subscription order or a regular order. My issue is that there are a bunch of fields that I would need that would be specific to each type. For instance, transaction_id, batch_id and sub_id are all fields that would only be present if that order type was a subscription, and conversely would be absent if the order type was regular.
My question is, would it be in my best interest to just create two separate tables, one for subscription orders and one for regular orders? Or is there a way that fields could only appear conditional on what the Type field is? I would hate to see so many Nil values, for instance, if the order type was a regular order.
Sorry this question isn't as technical as it is just pertaining to best practice and organization.
Thanks,
Sunny
What you've described is a pattern called Single Table Inheritance — aka, having one table store data for different types of objects with different behavior.
Generally, people will tell you not to do it, since it leads to a lot of empty fields in your database which will hurt performance long term. It also just looks gross.
You should probably instead store the data in separate tables. If you want to get fancy, you can try to implement Class Table Inheritance, in which there are actually separate but connected table for each of the child classes. This isn't supported natively by ActiveRecord. This gem and this gem might be able to help you, but I've never used either, so I can't give you a firm recommendation.
I would keep all of my orders in one table. You could create a second table for "subscription order information" that would only contain the columns transaction_id, batch_id and sub_id as well as a primary key to link it back to the main orders table. You would still want to include an order type column in the main database though to make it a little easier when debugging.
Assuming you're using Postgres, I might lean towards an Hstore for that.
Some reading:
http://www.devmynd.com/blog/2013-3-single-table-inheritance-hstore-lovely-combination
https://github.com/devmynd/hstore_accessor
Make an integer column called order_type.
In the model do:
SUBSCRIPTION = 0
ONLINE = 1
...
It'll query better than strings and whenever you want to call one you do Order:SUBSCRIPTION.
Make two+ other tables with a foreign key equal to whatever the ID of the corresponding row in orders.
Now you can keep all shared data in the orders table, for easy querying, and all unique data in the other tables so you don't have bloated models.

How we design Dynamo db with keep relation of two entity

Hi iam new in dynamo db and, with my knowledge its a non relational db ie we cant join the tables. My doubt is how we design the table structure. Please clarify with following example.
I have a following tables
1) users - user_id, username, password, email, phone number, role
2) roles - id, name [ie admin, supervisor, ect..]
a) My first doubt is we have any provision to set auto increment for user_id fields ?
b) Is this correct way of setting primary key as user_id?
c) Is this is the correct method to store user role in dynamo db? ie a roles table contains id and title and store role id in user table?
e) Is this possible to retrieve two tables data along with each user? Am using rails 3 and aws-sdk gem
If anybody reply it will be very helpful for me like a new dynamodb user
Typically with nosql style databases you would provide the unique identifier, rather than having an auto increment PK field do that for you. This usually would mean that you would have a GUID be the key for each User record.
As far as the user roles, there are many ways to accomplish this and each has benefits and problems:
One simple way would be to add a "Role" attribute to the Users table and have one entry per role for that user. Then you could grab the User and you would have all the roles in one query. DynamoDB allows attributes to have multiple values, so one attribute can have one value per role.
If you need to be able to query users in a particular role (ie. "Give me all the Users who are Supervisors") then you will be doing a table scan in DynamoDB, which can be an expensive operation. But, if your number of users is reasonably small, and if the need to do this kind of lookup is infrequent, this still may be acceptable for your application.
If you really need to do this expensive type of lookup often, then you will need to create a new table something like "RolesWithUsers" having one record per Role, with the userIds of the users in the role record. For most applications I'd advise against doing something like this, because now you have two tables representing one fact: what role does a particular user have. So, delete or update needs to be done in two places each time. Not impossible to do, but it takes more vigilance and testing to be sure your application doesn't get wrong data. The other disadvantage of this approach is that you need two queries to get the information, which may be more expensive than the table scan, again, depending on the quantity of records.
Another option that makes sense for this specific use case would be to use SimpleDb. It has better querying capability (all attributes are indexed by default) and the single table with roles as multi-valued attribute is going to be a much better solution than DynamoDB in this case.
Hope this helps!
We have a similar situation and we simply use two DBs, a relational and a NoSQL (Dynamo). For a "User" object, everything that is tied to other things, such as roles, projects, skills, etc, that goes in relational, and everything about the user (attributes, etc) goes in Dynamo. If we need to add new attributes to the user, that is fine, since NoSQL doesn't care about those attributes. The rule of thumb is if we only need something on that object page (that is, we don't need to associate with other objects), then we put in Dynamo. Otherwise, it goes in relational.
Using a table scan on the NoSQL DB is not really an option after you cross even a small threshold (up to that point, you can just use an in memory DB anyway).

SQL Relationships

I'm using MS SQL Server 2008R2, but I believe this is database agnostic.
I'm redesigning some of my sql structure, and I'm looking for the best way to set up 1 to many relationships.
I have 3 tables, Companies, Suppliers and Utilities, any of these can have a 1 to many relationship with another table called VanInfo.
A van info record can either belong to a company, supplier or utility.
I originally had a company_id in the VanInfo table that pointed to the company table, but then when I added suppliers, they needed vaninfo records as well, so I added another column in VanInfo for supplier_id, and set a constraint that either supplier_id or company_id was set and the other was null.
Now I've added Utilities, and now they need access to the VanInfo table, and I'm realizing that this is not the optimum structure.
What would be the proper way of setting up these relationships? Or should I just continue adding foreign keys to the VanInfo table? or set up some sort of cross reference table.
The application isn't technically live yet, but I want to make sure that this is set up using the best possible practices.
UPDATE:
Thank you for all the quick responses.
I've read all the suggestions, checked out all the links. My main criteria is something that would be easy to modify and maintain as clients requirements always tend to change without a lot of notice. After studying, research and planning, I'm thinking it is best to go with a cross reference table of sorts named Organizations, and 1 to 1 relationships between Companies/Utilities/Suppliers and the Organizations table, allowing a clean relationship to the Vaninfo table. This is going to be easy to maintain and still properly model my business objects.
With your example I would always go for 'some sort of cross reference table' - adding columns to the VanInfo table smells.
Ultimately you'll have more joins in your SP's but I think the overhead is worth it.
When you design a database you should not think about where the primary/foreign key goes because those are concepts that doesn’t belong to the design stage. I know it sound weird but you should not think about tables as well ! (you could implement your E/R model using XML/Files/Whatever
Sticking to E/R relationship design you should just indentify your entity (in your case Company/supplier/utilities/vanInfo) and then think about what kind of relationship there is between them(if there are any). For example you said the company can have one or more VanInfo but the Van Info can belong only to one Company. We are talking about a one – to- many relationship as you have already guessed. At this point when you “convert” you design model (a one-to many relationship) to a Database table you will know where to put the keys/ foreign keys. In the case of a one-to-Many relationship the foreign key should go to the “Many” side. In this case the van info will have a foreign keys to company (so the vaninfo table will contain the company id) . You have to follow this way for all the others tables
Have a look at the link below:
https://homepages.westminster.org.uk/it_new/BTEC%20Development/Advanced/Advanced%20Data%20Handling/ERdiagrams/build.htm
Consider making Com, Sup and Util PKs a GUID, this should be enough to solve the problem. However this sutiation may be a good indicator of poor database design, but to propose a different solution one should know more broad database context, i.e. that you are trying to achive. To me this seems like a VanInfo should be just a separate entity for each of the tables (yes, exact duplicate like Com_VanInfo, Sup_VanInfo etc), unless VanInfo isn't shared between this entities (then relationships should be inverted, i.e. Com, Sup and Util should contain FK for VanInfo).
Your database basically need normalization and I think you're database should be on its fifth normal form where you have two tables linked by one table. Please see this article, this will help you:
http://en.wikipedia.org/wiki/Fifth_normal_form
You may also want to see this, database normalization:
http://en.wikipedia.org/wiki/Database_normalization

Generating sequential numbers in multi-user saas application

How do people generate auto_incrementing integers for a particular user in a typical saas application?
For example, the invoice numbers for all the invoices for a particular user should be auto_incrementing and start from 1. The rails id field can't be used in this case, as it's shared amongst all the users.
Off the top of my head, I could count all the invoices a user has, and then add 1, but does anyone know of any better solution?
Typical solution for any relation database could be a table like
user_invoice_numbers (user_id int primary key clustered, last_id int)
and a stored procedure or a SQL query like
update user_invoice_numbers set last_id = last_id + 1 where user_id = #user_id
select last_id from user_invoice_numbers where user_id = #user_id
It will work for users (if each user has a few simultaneously running transactions) but will not work for companies (for example when you need companies_invoice_numbers) because transactions from different users inside the same company may block each other and there will be a performance bottleneck in this table.
The most important functional requirement you should check is whether your system is allowed to have gaps in invoice numbering or not. When you use standard auto_increment, you allow gaps, because in most database I know, when you rollback transaction, the incremented number will not be rolled back. Having this in mind, you can improve performance using one of the following guidelines
1) Exclude the procedure that you use for getting new numbers from the long running transactions. Let's suppose that insert into invoice procedure is a long running transaction with complex server-side logic. In this case you first acquire a new id , and then, in separate transaction insert new invoice. If last transaction will be rolled back, auto-number will not decrease. But user_invoice_numbers will not be locked for long time, so a lot of simultaneous users could insert invoices at the same time
2) Do not use a traditional transactional database to store the data with last id for each user. When you need to maintain simple list of keys and values there are lot of small but fast database engines that can do that work for you. List of Key/Value databases. Probably memcached is the most popular. In the past, I saw the projects where simple key/value storages where implemented using Windows Registry or even a file system. There was a directory where each file name was the key and inside each file was the last id. And this rough solution was still better then using SQL table, because locks were issued and released very quickly and were not involved into transaction scope.
Well, if my proposal for the optimization seems to be overcomplicated for your project, forget about this now, until you will actually run into performance issues. In most projects simple method with an additional table will work pretty fast.
You could introduce another table associated with your "users" table that tracks the most recent invoice number for a user. However, reading this value will result in a database query, so you might as well just get a count of the user's invoices and add one, as you suggested. Either way, it's a database hit.
If the invoice numbers are independent for each user/customer then it seems like having "lastInvoice" field in some persistent store (eg. DB record) associated with the user is pretty unavoidable. However this could lead to some contention for the "latest" number.
Does it really matter if we send a user invoices 1, 2, 3 and 5, and never send them invoice
4? If you can relax the requirement a bit.
If the requirement is actually "every invoice number must be unique" then we can look at all the normal id generating tricks, and these can be quite efficient.
Ensuring that the numbers are sequenctial adds to the complexity, does it add to the business benefit?
I've just uploaded a gem that should resolve your need (a few years late is better than never!) :)
https://github.com/alisyed/sequenceid/
Not sure if this is the best solution, but you could store the last Invoice ID on the User and then use that to determine the next ID when creating a new Invoice for that User. But this simple solution may have problems with integrity, will need to be careful.
Do you really want to generate the invoice IDs in an incremental format? Would this not open security holes (where in, if a user can guess the invoice number generation, they can change it in the request and may lead to information disclosure).
I would ideally generate the numbers randomly (and keep track of used numbers). This prevents collisions as well (Chances of collision are reduced as the numbers are allocated randomly over a range).

Resources