SQL Relationships - entity-relationship

I'm using MS SQL Server 2008R2, but I believe this is database agnostic.
I'm redesigning some of my sql structure, and I'm looking for the best way to set up 1 to many relationships.
I have 3 tables, Companies, Suppliers and Utilities, any of these can have a 1 to many relationship with another table called VanInfo.
A van info record can either belong to a company, supplier or utility.
I originally had a company_id in the VanInfo table that pointed to the company table, but then when I added suppliers, they needed vaninfo records as well, so I added another column in VanInfo for supplier_id, and set a constraint that either supplier_id or company_id was set and the other was null.
Now I've added Utilities, and now they need access to the VanInfo table, and I'm realizing that this is not the optimum structure.
What would be the proper way of setting up these relationships? Or should I just continue adding foreign keys to the VanInfo table? or set up some sort of cross reference table.
The application isn't technically live yet, but I want to make sure that this is set up using the best possible practices.
UPDATE:
Thank you for all the quick responses.
I've read all the suggestions, checked out all the links. My main criteria is something that would be easy to modify and maintain as clients requirements always tend to change without a lot of notice. After studying, research and planning, I'm thinking it is best to go with a cross reference table of sorts named Organizations, and 1 to 1 relationships between Companies/Utilities/Suppliers and the Organizations table, allowing a clean relationship to the Vaninfo table. This is going to be easy to maintain and still properly model my business objects.

With your example I would always go for 'some sort of cross reference table' - adding columns to the VanInfo table smells.
Ultimately you'll have more joins in your SP's but I think the overhead is worth it.

When you design a database you should not think about where the primary/foreign key goes because those are concepts that doesn’t belong to the design stage. I know it sound weird but you should not think about tables as well ! (you could implement your E/R model using XML/Files/Whatever
Sticking to E/R relationship design you should just indentify your entity (in your case Company/supplier/utilities/vanInfo) and then think about what kind of relationship there is between them(if there are any). For example you said the company can have one or more VanInfo but the Van Info can belong only to one Company. We are talking about a one – to- many relationship as you have already guessed. At this point when you “convert” you design model (a one-to many relationship) to a Database table you will know where to put the keys/ foreign keys. In the case of a one-to-Many relationship the foreign key should go to the “Many” side. In this case the van info will have a foreign keys to company (so the vaninfo table will contain the company id) . You have to follow this way for all the others tables
Have a look at the link below:
https://homepages.westminster.org.uk/it_new/BTEC%20Development/Advanced/Advanced%20Data%20Handling/ERdiagrams/build.htm

Consider making Com, Sup and Util PKs a GUID, this should be enough to solve the problem. However this sutiation may be a good indicator of poor database design, but to propose a different solution one should know more broad database context, i.e. that you are trying to achive. To me this seems like a VanInfo should be just a separate entity for each of the tables (yes, exact duplicate like Com_VanInfo, Sup_VanInfo etc), unless VanInfo isn't shared between this entities (then relationships should be inverted, i.e. Com, Sup and Util should contain FK for VanInfo).

Your database basically need normalization and I think you're database should be on its fifth normal form where you have two tables linked by one table. Please see this article, this will help you:
http://en.wikipedia.org/wiki/Fifth_normal_form
You may also want to see this, database normalization:
http://en.wikipedia.org/wiki/Database_normalization

Related

Rails data model question: should I have optional foreign keys or join tables?

I have a Rails application that includes tables for surveys and (survey) questions.
A survey has_many questions
A question belongs_to a survey
The surveys are inherently teacher surveys. Now we are introducing student surveys, which are meaningfully different from teacher surveys, with different types of information that we need to store about them, such that they seem to each warrant their own table/model, so I'm thinking we want separate tables for teacher_surveys and student_surveys.
However, the questions are really pretty much the same. Includes things like the question text, type of question (text, checkbox, dropdown), etc. So it seems like questions should remain a single table.
How do I best model this data?
Should the questions table have a teacher_survey_id and a student_survey_id where each is optional but one of the two of them is required?
Should I have join tables for questions_teacher_surveys and questions_student_surveys?
Something else?
There is no easy answer to this question. Separating questions into student_question and teach_question tables does mean you have a slight bit of duplication and you can't query them as a homogenized collection if that happens to be important.
The code duplication can be very simply addressed by using inheritance or composition but there is an increased maintainence burdon / complexity cost here.
But it does come with the advantage that you get a guarentee of referential integrity from the non-nullable foreign key column without resorting to stuff like creating a database trigger or the lack of real foreign key constraints if you choose to use a polymorphic association.
An additional potential advantage is that you're querying smaller tables that can remain dense instead of sparse if the requirements diverge.
sounds like you need to make your surveys table polymorphic and add a type column as an enum with :student and :teacher as options, you can use the type column as a flag to control the different business logic. Once the survey table becomes polymorphic you probably wont need to do anything with your questions table. If you decide to go with this solution its also recommend to add a concern class named Surveyable to hold the needed associations and shared logic between your surveyed models (in your case students and teachers).

Bitemporal master-satellite relationship

I am a DB newbie to the bitemporal world and had a naive question.
Say you have a master-satellite relationship between two tables - where the master stores essential information and the satellite stores information that is relevant to only few of the records of the master table. Example would be 'trade' as a master table and 'trade_support' as the satellite table where 'trade_support' will only house supporting information for non-electronic trades (which will be a small minority).
In a non-bitemporal landscape, we would model it as a parent-child relationship. My question is: in a bitemporal world, should such a use case be still modeled as a two-table parent-child relationship with 4 temporal columns on both tables? I don't see a reason why it can't be done, but the question of "should it be done" is quite hazy in my mind. Any gurus to help me out with the rationale behind the choice?
Pros:
Normalization
Cons:
Additional table and temporal columns to maintain and manage via DAO's
Defining performant join conditions
I believe this should be a pretty common use-case and wanted to know if there are any best practices that I can benefit from.
Thanks in advance!
Bitemporal data management and foreign keys can be quite tricky. For a master-satellite relationship between bitemporal tables, an "artificial key" needs to be introduced in the master table that is not unique but identical for different temporal or historical versions of an object. This key is referenced from the satellite. When joining the two tables a bitemporal context (T_TIME, V_TIME) where T_TIME is the transaction time and V_TIME is the valid time must be given for the join. The join would be something like the following:
SELECT m.*, s.*
FROM master m
LEFT JOIN satellite s
ON m.key = s.master_key
AND <V_TIME> between s.valid_from and s.valid_to
AND <T_TIME> between s.t_from and s.t_to
WHERE <V_TIME> between m.valid_from and m.valid_to
AND <T_TIME> between m.t_from and m.t_to
In this query the valid period is given by the columns valid_from and valid_to and the transaction period is given by the columns t_from and t_to for both the master and the satellite table. The artificial key in the master is given by the column m.key and the reference to this key by s.master_key. A left outer join is used to retrieve also those entries of the master table for which there is no corresponding entry in the satellite table.
As you have noted above, this join condition is likely to be slow.
On the other hand this layout may be more space efficient in that if only the master data (in able trade) or only the satellite data (in table trade_support) is updated, this will only require a new entry in the respective table. When using one table for all data, a new entry for all columns in the combined table would be necessary. Also you will end up with a table with many null values.
So the question you are asking boils down to a trade-off between space requirements and concise code. The amount of space you are sacrificing with the single-table solution depends on the number of columns of your satellite table. I would probably go for the single-table solution, since it is much easier to understand.
If you have any chance to switch database technology, a document oriented database might make more sense. I have written a prototype of a bitemporal scala layer based on mongodb, which is available here:
https://github.com/1123/bitemporaldb
This will allow you to work without joins, and with a more flexible structure of your trade data.

One polymorphic association vs many through/HABTM associations

I am working on a project that currently has tons of HABTM associations. Essentially, everything is related to everything else. I am considering setting up a single intermediate table/model that has two polymorphic fields. This way, if I add another model I can easily connect it to the remaining models. Is this a good idea? If not, why not? If it is, why don't all rails projects have this kind of intermediate table?
I see two other options. I could keep adding intermediate tables or I could add a table that contains one of each type. The former option is kind of a hassle and the latter option does not allow for self joins.
While a polymorphic join table sounds like it would make things easier, I think you will end up creating more headache for yourself than it's worth. Here are a few potential challenges/problems off the top of my head:
You will not be able to use ActiveRecord's has_and_belongs_to_many association or related helpers without a ton of hacking/monkeypatching which will immediately eclipse the time it would take to setup individual pairwise link tables.
Your join table will have two id columns, let's call them a_id and b_id. For any given pair of models you will have to ensure that the ids always end up in the same column.
Example: If you have two models called User and Role, you would have to ensure for that pair that the user_id is always stored in col a_id and the role_id is always stored in col b_id, otherwise you will not be able to index the table in any kind of meaningful way (and will run the risk of defining the same relationship twice).
If you ever want to use database enforcement of FOREIGN KEY constraints it is unlikely that this polymorphic link table scheme will be supported.
The universal link table will get n times larger than n separate link tables. It shouldn't matter much with good indexing but as your application and data grow this could become a headache and limit some of your options in regards to scaling. Give your DB a break.
Most or least importantly (I can't decide) you will be bucking the norm which means a lot fewer (if any) resources out there to help you when you run into trouble. Basically the Adam Sandler "they're all gonna laugh at you" rationale.
Last thought: Can you eliminate any of the link tables by using has_many :xxx, :through => :xxx relationships?
Thinking it all through, you could actually do this, but I wouldn't. Join tables grow fast enough as it is and i like to keep model relationships simple and easy to alter.
I'm used to working on very large systems / data sets though, so if you're going going to have much in each join then ok. I'd still do it separately for joins however and i really like my polymorphics.
I think it would be cleaner and more flexible if you were to use multiple join tables as opposed to one giant multipurpose join table.

Domain Driven Design: When to make an Aggregate Root?

I'm attempting to implement DDD for the first time with a ASP.NET MVC project and I'm struggling with a few things.
I have 2 related entities, a Company and a Supplier. My initial thought was that Company was an aggregate root and that Supplier was a value object for Company. So I have a Repository for company and none for Supplier.
But as I have started to build out my app, I ended up needing separate list, create, and update forms for the Supplier. The list was easy I could call Company.Suppliers, and create was horrible I could do Company.Suppliers.Add(supplier), but update is giving me a headache. Since I need just one entity and I can't exactly stick it in memory between forms, I ended up needing to refetch the company and all of the suppliers and find the one I needed to bind to it and again to modified it and persist it back to the db.
I really just needed to do a GetOne if I had a repository for Supplier. I could add some work arounds by adding a GetOneSupplier to my Company or CompanyRepository, but that seems junky.
So, I'm really wondering if it's actually a Value Object, and not a full domain entity itself.
tldr;
Is needing separate list/create/update view/pages a sign that an entity should be it's own root?
Based on your terminology I assume you are performing DDD based on Eric Evans' book. It sounds like you have already identified a problem with your initial go at modeling and you are right on.
You mention you thought of supplier as a Value Object... I suggest it is not. A Value Object is something primarily identified by its properties. For example, the date "September 30th, 2009" is a value object. Why? Because all date instances with a different month/day/year combo are different dates. All date instances with the same month/day/year combo are considered identical. We would never argue over swapping my "September 30th, 2009" for yours because they are the same :-)
An Entity on the other hand is primarily identified by its "ID". For example, bank accounts have IDs - they all have account numbers. If there are two accounts at a bank, each with $500, if their account numbers are different, so are they. Their properties (in this example, their balance) do not identify them or imply equality. I bet we would argue over swapping bank accounts even if their balances were the same :-)
So, in your example, I would consider a supplier an Entity, as I would presume each supplier is primarily identified by its ID rather than its properties. My own company shares its name with two others in the world - yet we are not all interchangeable.
I think your suggestion that if you need views for CRUDing an object then it is an Entity probably holds true as a rule of thumb, but you should focus more on what makes one object different from others: properties or ID.
Now as far as Aggregate Roots go, you want to focus on the lifecycle and access control of the objects. Consider that I have a blog with many posts each with many comments - where is/are the Aggregate Root(s)? Let's start with comments. Does it make sense to have a comment without a post? Would you create a comment, then go find a post and attach it to it? If you delete a post, would you keep its comments around? I suggest a post is an Aggregate Root with one "leaf" - comments. Now consider the blog itself - its relationship with its posts is similar to that between posts and comments. It too in my opinion is an Aggregate Root with one "leaf" - posts.
So in your example, is there a strong relationship between company and supplier whereby if you delete a company (I know... you probably only have one instance of company) you would also delete its suppliers? If you delete "Starbucks" (a coffee company in the US) do all its coffee bean suppliers cease to exist? This all depends on your domain and application, but I suggest more than likely neither of your Entities are Aggregate Roots, or perhaps a better way to think about them is that they are Aggregate Roots each with no "leaves" (nothing to aggregate). In other words, company does not control access to or control the lifecycle of suppliers. It simply has a one-to-many relationship with suppliers (or perhaps many-to-many).
This brings us to Repositories. A Repository is for storing and retrieving Aggregate Roots. You have two (technically they are not aggregating anything but its easier than saying "repositories store aggregate roots or entities that are not leaves in an aggregate"), therefore you need two Repositories. One for company and one for suppliers.
I hope this helps. Perhaps Eric Evans lurks around here and will tell me where I deviated from his paradigm.
Sounds like a no-brainer to me - Supplier should have its own repository. If there is any logical possibility that an entity could exist independently in the model then it should be a root entity, otherwise you'll just end up refactoring later on anyway, which is redundant work.
Root entities are always more flexible than value objects, despite the extra implementation work up front. I find that value objects in a model become rarer over time as the model evolves, and entities that remain value objects were usually the ones that you could logically constrain that way from day one.
If companies share suppliers then having supplier as a root entity removes data redundancy as well, as you do not duplicate the supplier definition per company but share the reference instead, and the association between Company and Supplier can be bi-directional as well, which may yield more benefits.

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Resources