Design of a data warehouse with more than one fact tables - data-warehouse

I'm new to data warehousing. First, I want to precise than my copy of The Data Warehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net.
What I don't find on the net, however, is what to do when you seems to have more than one fact in a DW. In my case (insurance), I have refunds that occur on a non regular basis. One client can have none for 3 months and then ten in the same months. On the other hands, I have "subscription fee" (not sure what is the correct english term, but you get the point), that occur every month or every three months. That seems clearly like two distinct facts to me.
Those two are kind of loosely coupled by some dimensions, like the client or the "insurance product". Now are these two different warehouse, on which I have to produce two different report and then connect the reports outside of the DW ? Or is there a way to design this to fit a single descent DW. Or should I combine these two facts in one? I would probably lose granularity on refunds then.
Some blog I read said a DW always has one fact table. Others mention the step of designing what are the fact tables with a S, but there is no clear instruction of if there is a link between them or they are just distinct components of a same DW project.
Does anyone know some references on that precise part of DW design?

I realize that I am answering an old post, but I am not satisfied with either of the answers provided. I feel that neither answered the question.
A schema can have one or more facts, but these facts are not linked by any key relationship. It is best practice not to join fact tables in a single query as you would whey querying a normalized/transactional database. Due to the nature of many to many joins, etc - the results would be incorrect if attempted.
The answer you are looking for is that you need to "drill across" which basically means that you are querying each fact table (schema) separately and merging the results. This can occur using SQl or preferably via a reporting/analytics tool that you may have which referenced the data warehouse. Instead of duplicating the answers on how to do this, I will direct everyone to two very good articles:
Three ways to drill across by Chris Adamson
and
The Soul of the Warehouse - Drilling Across by Ralph Kimball

You can have as many fact tables as you like. In your example you may have something like:
dimProduct lists several products -- subscription being one of those.
dimTransactionType would list possible transactions (purchase, refund, recurring subscription fee ...)
Now suppose you are interested in simplified subscription reporting, you could add a factSubscription like this:

Taking your questions backwards.
A data warehouse can have more than one fact table. However, you do want to minimize joins between fact tables. It's ok to duplicate fact information in different fact tables.
Of the objects you mentioned:
Refund is a fact. Timestamp is the dimension of the refund fact.
Subscription fee is a fact. Timestamp is the dimension of the subscription fee fact.
A refund can happen more than once. I'm guessing that each customer has one subscription fee. So it appears we have two fact tables so far, customer, and customer refund.
If you knew that there could only be at the most 3 refunds (as an example), then you would eliminate the customer refund fact table, and put 3 refund columns in the customer table.
You also mention insurance. A customer can have more than one policy. So we have a third fact table.
A data warehouse is usually designed using a star schema. The star schema is basically one fact table connected to one or more dimension tables. You'll probably have more than one star in a data warehouse, since we already defined 3 fact tables.

Related

Can I work with Many Fact Tables? My DW has many fact tables, to diferent products

Can I work with many fact tables? In my model I have many fact tables cause I have multiple diferent products in my company.
But the model in analysis services became big, today we have 60 tables in analysis services model.
Is there any documentation about this theme? In all kimbal documentation I only read thins about one Big fact table.
But im Afraid that i could follow the wrong strategy and have problems in the future.
Fact tables do not have a complex mechanism but if you keep adding fact tables in your analysis reports, these can manipulate and complicate your analysis and even can create performance issues related to data fetching from DWH Cubes. Pleas read below topic for detailed information about the facts and decide for yourself as per your business requirement:
https://www.nuwavesolutions.com/fact-tables/

How to Organize an out of control table?

Hello and good morning.
I am working on a side project where I am adding an analytic board to an already existing app. The problem is that now the users table has over 400 columns. My question is that what's a better way of organizing this table such as splintering the table off into separate tables. How do you do that and how do you communicate the tables between the new tables?
Another concern is that If I separate the table will I still be able to save into it through the user model? I have code right now that says:
user.wallet += 100
user.save
If I separate wallet from user and link the two tables will I have to change this code. The reason I'm asking this is that there is a ton of code like this in the app.
Thank you so much if you can help me understanding how to organize a database. As a bonus if there is a book that talks about database organization can you recommend it to me (preferably one that is in rails).
Edit: Is there also a way to do all of this without loosing any data. For example transfer the data to a new column on the new table then destroying the old column.
Please read about:
Database Normalization
You'll get loads of hits when searching for that string and there are many books about database design covering that subject.
It is most likely, that this table of yours lacks normalization, but you have to see yourself!
Just to give an orientation - I would get a little anxious when dealing with a tenth of that number of columns. That saying, I clearly have to stress that there might be well normalized tables with 400 columns as well as sloppily created examples with just 10 columns.
Generally speaking, the probability of dealing with bad designed tables and hence facing trouble simply rises with the number of columns.
So take your time and if you find out, that users table needs normalization next step would indeed be to spread data over several tables. Because that clearly (and most likely even heavily) affects the coding of your application here is where you thoroughly have to balance pros and cons - simply impossible to judge that from far away.
Say, you have substantial problems (e.g. fierce performance problems - you wouldn't post it) that could be eased by normalization there are different approaches of how to split data. Here please read about:
Cardinalities
Usually the new tables are linked by
Foreign Keys
, identical data (like a user id) that appear in multiple tables and that are used to join them, that is.
And finally, yes, you can do that without losing data as the overall amount of information never changes when normalizing.
In case your last question was meant to be technical: There is no problem in reading data from one column and inserting them into a new one (of a new table). That has to happen in a certain order as foreign keys have to be filled before you can use them. See
Referential Integrity
However, quite obvious: Deleting data and dropping columns interferes with the operability of your application. Good planning is due.

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Natural Key and Fact tables

I'm new on dimensional modelling I believe that you guys can help me in the following doubts.
In the production system I have a transaction table, sales table for example.The unique identifier is a primary key called SaleId.
Example:
My doubt is when modelling the fact table should the SaleID be included in the fact table as a NaturalKey?
Also should the Fact table have a SurrogateKey?
Please feel free to send me any link as reference.
Thanks in advance
Technically speaking, it is probably not a natural key - it does look system generated. However, sometimes it is very valid to store a system generated ID in a Fact for use as a Degenerate Dimension. Usually, these are cases where either the business users do have sight of this system generated ID (order numbers, invoice numbers, purchase order numbers, etc.), or where there's no other useful way of identifying some rows which can be usefully grouped together.
If the users of your BI solutions are likely to want to be able to drill down into information and look at it by sale, then the SaleID might well be a good candidate for this treatment. Have a think whether there's any other way for them to get to this level - could a customer be associated with two distinct sales on the same day? If so, would your users want to look at them as two separate sales? You might need to speak to them to find out what's going to be useful for them. If for some reason you can't get a clear answer, I'd say keep it - there's little harm, and you can always remove it later if it's not used.
Here's the Kimball group's take on Degenerate Dimensions, in case you're at all unclear on how they work:
http://www.kimballgroup.com/2003/06/design-tip-46-another-look-at-degenerate-dimensions/
As far as Fact table surrogate keys - I always use them. As Kimball's Design Tip #81 points out, they're sometimes useful, and it's the kind of thing I'd rather put in at the beginning and not use than realise later on that it would have been useful to have. Point 2 - where you might want to make updates by inserting new rows and deleting the old ones - certainly applies to work I've done.
The requirement for a primary key in a fact table depends on the type of the fact table. Transactional facts which are never updated do not need it. Periodic snapshots probably don't need it, unless the current period is a to-date measure. Accumulating snapshots definitely need it.

Document-oriented database (MongoDB?) and invoicing/ordering?

New to document-oriented database concepts and have a few high-level questions related to orders and order processing.
How does one capture an order in this world? Would an order just be a new document in an Orders collection? Would order_item relate back to a product listed in another document? Or is it assumed that order_item would be copied and inserted into the order document and thus, perhaps, difficult to report the total of product sold over time?
How does one work around lack of transactions and maintain integrity
Sorry, very new to me though eager to understand...it sounds very appealing to encapsulate all these 'things' for sale as "objects" and move them around as such between server & clients, etc...if it's indeed plausible. Just need some help conceptualizing big picture dos and don'ts.
How does one capture an order in this world? Would an order just be a new document in an Orders collection?
Yes. That's the way these databases work.
Would order_item relate back to a product listed in another document?
It could. Depends on what you're doing.
Or is it assumed that order_item would be copied and inserted into the order document
Also possible. This works well for historical analysis and data warehousing.
and thus, perhaps, difficult to report the total of product sold over time?
It's always hard to report total product sold over time.
Today, product "23SKIDOO" is a 23l, open-valved, framistat with double widgets.
Last year, before the recall, the same product was a 23l, closed-valved framistat with only a single widget.
In a previous year, the same product was actually 22.5l.
Are these the "same" product? Marketing calls them all "23SKIDOO". But there are differences.
A single Product table doesn't resolve this correctly. What folks then do is invent product lines and product families so they can introduce the "23SKIDOO-B" and "23SKIDOO-PLUS" products which are all part of the "23SKIDOO" family.
Product lines and product families and other more fanciful groupings are workarounds and hacks to magically make unrelated products report together and provide a "total product sold over time" even though the products are clearly different.
Copying the product into the order (while it seems wasteful) can preserve more historical fidelity than many of the commonly-used workarounds.
How does one work around lack of transactions and maintain integrity?
MongoDB has locks. http://www.mongodb.org/display/DOCS/How+does+concurrency+work.
It's not clear what you mean by lacking transactions.
So its always hard to answer a generic question. However, what I would encourage you to do it look at the patterns of read and write you expect your application to perform. There are trade offs for certain document designs just like there are from RDBMS schema designs.
Here's a link to a MongoDB centric schema design presentation. It may help you to understand some of these trade off and options for design.
http://www.scribd.com/doc/47326395/MongoBoulder-Schema-Design

Resources