Data Warehousing: Redundant combinations of dimensions - data-warehouse

I have built my own, very basic data warehouse. In it I have very simple cubes, for example:
Fact: ReviewRatingByday
Dimensions: Review, Organization, Date
In the OLTP side of my application, an Organization has a 1 to many relationship with Reviews.
Currently my data warehouse provides my Fact's extract function with all possible combinations of the dimensions. This results in redundant combinations where a given Review is combined with an Organization, yet the Review is in fact associated with a different Organization.
How do other data warehouse systems avoid this?
Should I mirror my OLTP relationships in my Dimensions?

I don't really understand your question. If some combinations of Review and Organization do not exist in the source data, then you will have no rows for them in the fact table anyway. So where is the "redundant combination"?
I think you might be asking, "how do I show users only valid combinations of Review and Organization when they select their report criteria". If that's correct then you have two main options:
Use a reporting tool that is able to present only valid combinations to the user
Combine Review and Organization into a single dimension that contains all valid combinations of Review and Organization (Kimball's term for this is a mini-dimension)
If I misunderstood your question, please give some more information about exactly what your issue is, especially what you mean by "redundant combination".

Related

Handling Contracts extension and licensing/Subscriptions addition/removal in dimensional model

Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts

How to integrate various data marts?

I recently joined a healthcare company and they have separate datamarts for each type of each type of diseases. Lets say I have three different DM's as follows:
HIV
HepC
Respiratory
How would I go on to integrate these into one Data-warehouse?
From what I have read, this is a Kimball Aprroach.
And I should look for similar dimensions and try to build on that.
Any other recommendations ?
Your question is too vague. Without knowing what you want to do with a data warehouse, and how the data marts are structured, it's hard to comment on how you should go about it. You might want to step back and think about two things, and explain: what do I want to do? and what do I have?
Talk with the stake holders to settle on what you they to have on a data warehouse. How do they want to use a data warehouse? Is it for internal analytics is for simple aggregation reports? If so, what kind of metrics need to be aggregated? If they are doing complex analytics, what kind of metrics do they need for it? I recommend identifying a list of "needs", and prioritize them, so you can think about what dimensions need to be delivered first.
After that, research what you have closely. What does each disease data mart have? Does it have information about disease? taxonomy? patients who have that disease? procedures done for that disease? Identify the structure of the data marts, and make a list of attributes that can be derived from them.
After that, you might have a more fruitful conversation on the methods of integration.

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Document-oriented database (MongoDB?) and invoicing/ordering?

New to document-oriented database concepts and have a few high-level questions related to orders and order processing.
How does one capture an order in this world? Would an order just be a new document in an Orders collection? Would order_item relate back to a product listed in another document? Or is it assumed that order_item would be copied and inserted into the order document and thus, perhaps, difficult to report the total of product sold over time?
How does one work around lack of transactions and maintain integrity
Sorry, very new to me though eager to understand...it sounds very appealing to encapsulate all these 'things' for sale as "objects" and move them around as such between server & clients, etc...if it's indeed plausible. Just need some help conceptualizing big picture dos and don'ts.
How does one capture an order in this world? Would an order just be a new document in an Orders collection?
Yes. That's the way these databases work.
Would order_item relate back to a product listed in another document?
It could. Depends on what you're doing.
Or is it assumed that order_item would be copied and inserted into the order document
Also possible. This works well for historical analysis and data warehousing.
and thus, perhaps, difficult to report the total of product sold over time?
It's always hard to report total product sold over time.
Today, product "23SKIDOO" is a 23l, open-valved, framistat with double widgets.
Last year, before the recall, the same product was a 23l, closed-valved framistat with only a single widget.
In a previous year, the same product was actually 22.5l.
Are these the "same" product? Marketing calls them all "23SKIDOO". But there are differences.
A single Product table doesn't resolve this correctly. What folks then do is invent product lines and product families so they can introduce the "23SKIDOO-B" and "23SKIDOO-PLUS" products which are all part of the "23SKIDOO" family.
Product lines and product families and other more fanciful groupings are workarounds and hacks to magically make unrelated products report together and provide a "total product sold over time" even though the products are clearly different.
Copying the product into the order (while it seems wasteful) can preserve more historical fidelity than many of the commonly-used workarounds.
How does one work around lack of transactions and maintain integrity?
MongoDB has locks. http://www.mongodb.org/display/DOCS/How+does+concurrency+work.
It's not clear what you mean by lacking transactions.
So its always hard to answer a generic question. However, what I would encourage you to do it look at the patterns of read and write you expect your application to perform. There are trade offs for certain document designs just like there are from RDBMS schema designs.
Here's a link to a MongoDB centric schema design presentation. It may help you to understand some of these trade off and options for design.
http://www.scribd.com/doc/47326395/MongoBoulder-Schema-Design

Design of a data warehouse with more than one fact tables

I'm new to data warehousing. First, I want to precise than my copy of The Data Warehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net.
What I don't find on the net, however, is what to do when you seems to have more than one fact in a DW. In my case (insurance), I have refunds that occur on a non regular basis. One client can have none for 3 months and then ten in the same months. On the other hands, I have "subscription fee" (not sure what is the correct english term, but you get the point), that occur every month or every three months. That seems clearly like two distinct facts to me.
Those two are kind of loosely coupled by some dimensions, like the client or the "insurance product". Now are these two different warehouse, on which I have to produce two different report and then connect the reports outside of the DW ? Or is there a way to design this to fit a single descent DW. Or should I combine these two facts in one? I would probably lose granularity on refunds then.
Some blog I read said a DW always has one fact table. Others mention the step of designing what are the fact tables with a S, but there is no clear instruction of if there is a link between them or they are just distinct components of a same DW project.
Does anyone know some references on that precise part of DW design?
I realize that I am answering an old post, but I am not satisfied with either of the answers provided. I feel that neither answered the question.
A schema can have one or more facts, but these facts are not linked by any key relationship. It is best practice not to join fact tables in a single query as you would whey querying a normalized/transactional database. Due to the nature of many to many joins, etc - the results would be incorrect if attempted.
The answer you are looking for is that you need to "drill across" which basically means that you are querying each fact table (schema) separately and merging the results. This can occur using SQl or preferably via a reporting/analytics tool that you may have which referenced the data warehouse. Instead of duplicating the answers on how to do this, I will direct everyone to two very good articles:
Three ways to drill across by Chris Adamson
and
The Soul of the Warehouse - Drilling Across by Ralph Kimball
You can have as many fact tables as you like. In your example you may have something like:
dimProduct lists several products -- subscription being one of those.
dimTransactionType would list possible transactions (purchase, refund, recurring subscription fee ...)
Now suppose you are interested in simplified subscription reporting, you could add a factSubscription like this:
Taking your questions backwards.
A data warehouse can have more than one fact table. However, you do want to minimize joins between fact tables. It's ok to duplicate fact information in different fact tables.
Of the objects you mentioned:
Refund is a fact. Timestamp is the dimension of the refund fact.
Subscription fee is a fact. Timestamp is the dimension of the subscription fee fact.
A refund can happen more than once. I'm guessing that each customer has one subscription fee. So it appears we have two fact tables so far, customer, and customer refund.
If you knew that there could only be at the most 3 refunds (as an example), then you would eliminate the customer refund fact table, and put 3 refund columns in the customer table.
You also mention insurance. A customer can have more than one policy. So we have a third fact table.
A data warehouse is usually designed using a star schema. The star schema is basically one fact table connected to one or more dimension tables. You'll probably have more than one star in a data warehouse, since we already defined 3 fact tables.

Resources