Resolving Many-To-Many relationship using Fact and single common key - data-warehouse

We have 3 tables as below:
DimTask
FactTask
DimReason
As the name suggests, DimTask and DimReason are dimensions and FactTask is a fact table.
DimTask table records details of task like "Description" and "Title" of task where a task just means an activity happened like maintenance/cleaning.
FactTask table records the measures of that task like fees,charges,time taken in minutes.
DimReason table records purpose of why that activity happened which could be multiple for a single task.
Now, DimTask and FactTask are connected/related through TaskID.
How do we handle the relation of DimReason in the model.
Should I create a bridge table and connect it to DimTask?
Can I connect DimReason to FactTask with same common key "TaskID"? Or is it a bad design practice.
Please help.

Without the data, it is tough
Thought 1
Looks like DimReason is not the dimension, rather it should be Fact
DimTask::TaskID > FactTask::TaskID ... the way your comments says as order header, FactOrderHeader type
DimTask::TaskID > FactReason(You called as DimReason)::TaskID ... they way your comments says as Order details, FactOrderDetails type
Note: Time bound activities are typically facts (with exceptions)
Thought 2
Multi-valued dimensions: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Hope this helps!

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How to model Players for different Sports in RoR?

I am building an app that have the following requirements:
-> A User can be a player of different teams.
-> A Team can be of a sport type.
My question is:
-> Since for each sport type I want to store different information of a Player, what would be the best way to model that?
I have thought on having several models (and tables) for each kind of Sport, for example:
Basketball_Players, Football_Players and so on, but I am not sure if that would be a good approach. How do you usually do this on RoR?
I'd say you have two options, and I don't know that it's really possible to say which is the "most correct" way to do it without knowing the details of the requirements of your application.
What's a given is that you'll have a sport table and a player table. I can say that for sure. The question is how you connect the two.
Option 1: a single join table
You could have a table called player_sport (or whatever) with a player_id column, a sport_id column, and a serialized_player_data column or something like that, where you'd keep serialized player data (JSON, perhaps) depending on the sport. Pros: simple schema. Cons: not properly normalized, and therefore subject to inconsistencies.
Option 2: a separate join table for each sport
This is what you alluded to in your question, where you have a basketball_player, football_player, etc. Here you'd also have a player_id column but probably not a sport_id column because that would be redundant now that you're specifying the sport right in the table name. The need to have a serialized_player_data column would go away, since you'd now be free to store the needed attributes directly in columns, e.g. wrestling_player.weight_class_id or whatever. Pros: proper normalization. Cons: more complex schema, and therefore more work in your application code.
There's actually a third option as well:
Option 3: a combination of 1 and 2
Here you might do everything you would do in Option 2, except you'd move the common player attributes to the player_sport table and save basketball_player, etc. for the sport-specific attributes. So weight_class_id would stay in wrestling_player but player_sport would have height, weight, and other columns that are relevant to all sports.
If you're looking for a recommendation, I would probably do Option 2, or, if it looks like there's enough overlap for it to make sense, Option 3.

SQL Relationships

I'm using MS SQL Server 2008R2, but I believe this is database agnostic.
I'm redesigning some of my sql structure, and I'm looking for the best way to set up 1 to many relationships.
I have 3 tables, Companies, Suppliers and Utilities, any of these can have a 1 to many relationship with another table called VanInfo.
A van info record can either belong to a company, supplier or utility.
I originally had a company_id in the VanInfo table that pointed to the company table, but then when I added suppliers, they needed vaninfo records as well, so I added another column in VanInfo for supplier_id, and set a constraint that either supplier_id or company_id was set and the other was null.
Now I've added Utilities, and now they need access to the VanInfo table, and I'm realizing that this is not the optimum structure.
What would be the proper way of setting up these relationships? Or should I just continue adding foreign keys to the VanInfo table? or set up some sort of cross reference table.
The application isn't technically live yet, but I want to make sure that this is set up using the best possible practices.
UPDATE:
Thank you for all the quick responses.
I've read all the suggestions, checked out all the links. My main criteria is something that would be easy to modify and maintain as clients requirements always tend to change without a lot of notice. After studying, research and planning, I'm thinking it is best to go with a cross reference table of sorts named Organizations, and 1 to 1 relationships between Companies/Utilities/Suppliers and the Organizations table, allowing a clean relationship to the Vaninfo table. This is going to be easy to maintain and still properly model my business objects.
With your example I would always go for 'some sort of cross reference table' - adding columns to the VanInfo table smells.
Ultimately you'll have more joins in your SP's but I think the overhead is worth it.
When you design a database you should not think about where the primary/foreign key goes because those are concepts that doesn’t belong to the design stage. I know it sound weird but you should not think about tables as well ! (you could implement your E/R model using XML/Files/Whatever
Sticking to E/R relationship design you should just indentify your entity (in your case Company/supplier/utilities/vanInfo) and then think about what kind of relationship there is between them(if there are any). For example you said the company can have one or more VanInfo but the Van Info can belong only to one Company. We are talking about a one – to- many relationship as you have already guessed. At this point when you “convert” you design model (a one-to many relationship) to a Database table you will know where to put the keys/ foreign keys. In the case of a one-to-Many relationship the foreign key should go to the “Many” side. In this case the van info will have a foreign keys to company (so the vaninfo table will contain the company id) . You have to follow this way for all the others tables
Have a look at the link below:
https://homepages.westminster.org.uk/it_new/BTEC%20Development/Advanced/Advanced%20Data%20Handling/ERdiagrams/build.htm
Consider making Com, Sup and Util PKs a GUID, this should be enough to solve the problem. However this sutiation may be a good indicator of poor database design, but to propose a different solution one should know more broad database context, i.e. that you are trying to achive. To me this seems like a VanInfo should be just a separate entity for each of the tables (yes, exact duplicate like Com_VanInfo, Sup_VanInfo etc), unless VanInfo isn't shared between this entities (then relationships should be inverted, i.e. Com, Sup and Util should contain FK for VanInfo).
Your database basically need normalization and I think you're database should be on its fifth normal form where you have two tables linked by one table. Please see this article, this will help you:
http://en.wikipedia.org/wiki/Fifth_normal_form
You may also want to see this, database normalization:
http://en.wikipedia.org/wiki/Database_normalization

Wrapping my head around MongoDB, mongomapper and joins

I'm new to MongoDB and I've used RDBMS for years.
Anyway, let's say I have the following collections:
Realtors
many :bookmarks
key :name
Houses
key :address, String
key :bathrooms, Integer
Properties
key :address, String
key :landtype, String
Bookmark
key :notes
I want a Realtor to be able to bookmark a House and/or a Property. Notice that Houses and Properties are stand-alone and have no idea about Realtors or Bookmarks. I want the Bookmark to be sort of like a "join table" in MySQL.
The Houses/Properties come from a different source so they can't be modified.
I would like to be able to do this in Rails:
r = Realtor.first
r.bookmarks would give me:
House1
House2
PropertyABC
PropertyOO1
etc...
There will be thousands of Houses and Properties.
I realize that this is what RDBMS were made for. But there are several reasons why I am using MongoDB so I would like to make this work.
Any suggestions on how to do something like this would be appreciated.
Thanks!
OK, first things first. You've structured your data as if this were an RDBMS. You've even run off and created a "join table" as if such a thing were useful in Mongo.
The short answer to your question is that you're probably going to have re-define "first" to load the given "Bookmarks". Either "server-side" with an $in clause or "client-side" with a big for loop.
So two Big Questions about the data:
If Bookmarks completely belong to a Realtor, why are they in their own collection?
If Realtors can Bookmark Houses and Property, then why are these in different collections? Isn't this needless complication? If you want something like Realtor.first on bookmarks why put them in different collections?
The Realtors collection should probably be composed of items that look like this:
{"name":"John", "bookmarks": [
{"h":"House1","notes":[{"Nice location","High Ask"}] },
{"p":"PropertyABC","notes":[{"Haunted"}] }
] }
Note how I've differentiated "h" and "p" for ID of the house and ID of the property? If you take my next suggestion you won't need even that.
Taking this one step further, you probably want Houses and Properties in the same collection, say "Locations". In the "Locations" collection, you're just going to stuff all Houses and Properties and mark them with "type":"house" or "type":"property". Then you'll index on the "type" field.
Why? Because now when you write the "first" method, your query is pretty easy. All you do is loop through "bookmarks" and grab the appropriate key ("House1", "PropertyABC") from the "Locations" collection. Paging is straight forward, you query for 10 items and then return.
I know that at some level it seems kind of lame."Why am I writing a for loop to grab data? I tried to stop doing that 15 years ago!" But Mongo is a "document-oriented" store, so it's optimized for loading individual documents. You're trying to load a bunch of documents, so you have to jump through this little hoop.
Fortunately, it's not all bad. Mongo is really fast at loading individual docs. Running a query to fetch 10 items at once is still going to be very quick.

Resources