I am new to BI/Datawarehousing, and after building some easy samples, I have the need to build a more complex structure. My project initially involved product licenses, and I was measuring how many sold, by month/year and by program, and just counting the number of licenses.
Now the requirement is to introduce jump offs from those metrics. As in, when you come to a certain group of licenses, they want to see a whole different metrics of those. Such as, if 100 licenses were sold in mar 2011, how many of them installed, activated and cancelled the product. (we track that info, but not in the DW). So, I am looking for the best way to do this...I assume the first thing I have to do is add three dimensions for installed, activated and cancelled - and have three fact tables? Or have one fact table with each license, and have a row for cancelled, installed or activated? (so one license may be repeated). Or have one fact table, with different fields for installed, cancelled, activated? Also, how do you relate one fact table to another? Is it through dimensions, or they can related in some other ways?
Any help would be much appreciated!
EDIT:
Thanks for the post... I was also thinking the second option is probably the correct one. But in this implementation, I have a unique problem. So, one of the facts that is measured is the number of licenses that are sold - by date of course. Lets say I add a row for installed, cancelled, activated. The requirement is for them to be able to see a connected fact. For example, if I add individual rows, given a timeframe, I can tell how many were sold, and how many were installed.
But they want to see given a timeframe, how many were bought, and out of them, how many installed. e.g., if the timeframe is march, and 100 were sold in march, out of those 100, how many were installed - even though they could have installed much later than march, and therefore the row date would be not in the timeframe they are looking at....is this a common problem? how is it solved?
I assume the first thing I have to do is add three dimensions for installed, activated and cancelled - and have three fact tables?
Not really. A license sale is a fact. It has a price.
A license sale has has dimensions like date, product, customer and program.
An "installation" or "activitation" is a state-change event of a license. You have "events" for each license (sale, install, activate, etc.)
So a license has a "sale" fact, an "installation" fact and an "activation" fact. Each of which is (minimally) a relationship with time.
Or have one fact table with each license, and have a row for cancelled, installed or activated? (so one license may be repeated).
This gives the most flexibility, because each event can be rich with multiple dimensions. A sequence of events can be then be organized to provide the history of a license.
This works out very well.
You will often want to create summary tables for simple counts and sums to save having to traverse all events for the most common dashboard metrics.
The requirement is for them to be able to see a connected fact.
Right. You're joining several rows from the fact table together. A row where the event was sold, outer joined with a row where the event was installed outer joined with row where the event was activated, etc. It's just outer joins among the facts.
So. Count of sales in March is easy. Event = "Sale". Time is all the rows where time.month = "march". Easy.
Count of sales in march which became installs. Same "march sales" where clause outer joined with all "install" events for those licenses. Count of "sales" is the same as count(*). Count of installs may be smaller because the outer join puts in some nulls.
Count of sales in march which became activations. The "march sales" where clause outer joined with all "activation" events. Note that the activation has no date constraint.
Or have one fact table, with different fields for installed, cancelled, activated?
This doesn't work out as well because the table's columns dictate a business process. That business process might change and you'll be endlessly tweaking the columns in the fact table.
Having said it doesn't work out "as well" means it doesn't give ultimate flexibility. In some cases, you don't need ultimate flexibility. In some cases, the industry (or regulations) may define a structure that's quite fixed.
Also, how do you relate one fact table to another? Is it through dimensions, or they can related in some other ways?
Dimensions by definition. A fact table only has two things -- measurements and FK's to dimensions.
Some dimensions (like "license instance") are degenerate because the dimension may have almost no usable attributes other than a PK.
So you have an "sold" fact that ties to a license, a optional "installed" fact that ties to a license and an optional "activate" fact that ties to a license. The license is an object ID (the database surrogate key) and -- perhaps -- the license identifier itself (maybe a license serial number or something outside the database).
Please by Ralph Kimball's Data Warehouse Toolkit before doing anything more.
Related
I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?
your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't
Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts
I'm new on dimensional modelling I believe that you guys can help me in the following doubts.
In the production system I have a transaction table, sales table for example.The unique identifier is a primary key called SaleId.
Example:
My doubt is when modelling the fact table should the SaleID be included in the fact table as a NaturalKey?
Also should the Fact table have a SurrogateKey?
Please feel free to send me any link as reference.
Thanks in advance
Technically speaking, it is probably not a natural key - it does look system generated. However, sometimes it is very valid to store a system generated ID in a Fact for use as a Degenerate Dimension. Usually, these are cases where either the business users do have sight of this system generated ID (order numbers, invoice numbers, purchase order numbers, etc.), or where there's no other useful way of identifying some rows which can be usefully grouped together.
If the users of your BI solutions are likely to want to be able to drill down into information and look at it by sale, then the SaleID might well be a good candidate for this treatment. Have a think whether there's any other way for them to get to this level - could a customer be associated with two distinct sales on the same day? If so, would your users want to look at them as two separate sales? You might need to speak to them to find out what's going to be useful for them. If for some reason you can't get a clear answer, I'd say keep it - there's little harm, and you can always remove it later if it's not used.
Here's the Kimball group's take on Degenerate Dimensions, in case you're at all unclear on how they work:
http://www.kimballgroup.com/2003/06/design-tip-46-another-look-at-degenerate-dimensions/
As far as Fact table surrogate keys - I always use them. As Kimball's Design Tip #81 points out, they're sometimes useful, and it's the kind of thing I'd rather put in at the beginning and not use than realise later on that it would have been useful to have. Point 2 - where you might want to make updates by inserting new rows and deleting the old ones - certainly applies to work I've done.
The requirement for a primary key in a fact table depends on the type of the fact table. Transactional facts which are never updated do not need it. Periodic snapshots probably don't need it, unless the current period is a to-date measure. Accumulating snapshots definitely need it.
New to document-oriented database concepts and have a few high-level questions related to orders and order processing.
How does one capture an order in this world? Would an order just be a new document in an Orders collection? Would order_item relate back to a product listed in another document? Or is it assumed that order_item would be copied and inserted into the order document and thus, perhaps, difficult to report the total of product sold over time?
How does one work around lack of transactions and maintain integrity
Sorry, very new to me though eager to understand...it sounds very appealing to encapsulate all these 'things' for sale as "objects" and move them around as such between server & clients, etc...if it's indeed plausible. Just need some help conceptualizing big picture dos and don'ts.
How does one capture an order in this world? Would an order just be a new document in an Orders collection?
Yes. That's the way these databases work.
Would order_item relate back to a product listed in another document?
It could. Depends on what you're doing.
Or is it assumed that order_item would be copied and inserted into the order document
Also possible. This works well for historical analysis and data warehousing.
and thus, perhaps, difficult to report the total of product sold over time?
It's always hard to report total product sold over time.
Today, product "23SKIDOO" is a 23l, open-valved, framistat with double widgets.
Last year, before the recall, the same product was a 23l, closed-valved framistat with only a single widget.
In a previous year, the same product was actually 22.5l.
Are these the "same" product? Marketing calls them all "23SKIDOO". But there are differences.
A single Product table doesn't resolve this correctly. What folks then do is invent product lines and product families so they can introduce the "23SKIDOO-B" and "23SKIDOO-PLUS" products which are all part of the "23SKIDOO" family.
Product lines and product families and other more fanciful groupings are workarounds and hacks to magically make unrelated products report together and provide a "total product sold over time" even though the products are clearly different.
Copying the product into the order (while it seems wasteful) can preserve more historical fidelity than many of the commonly-used workarounds.
How does one work around lack of transactions and maintain integrity?
MongoDB has locks. http://www.mongodb.org/display/DOCS/How+does+concurrency+work.
It's not clear what you mean by lacking transactions.
So its always hard to answer a generic question. However, what I would encourage you to do it look at the patterns of read and write you expect your application to perform. There are trade offs for certain document designs just like there are from RDBMS schema designs.
Here's a link to a MongoDB centric schema design presentation. It may help you to understand some of these trade off and options for design.
http://www.scribd.com/doc/47326395/MongoBoulder-Schema-Design
I'm new to data warehousing. First, I want to precise than my copy of The Data Warehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net.
What I don't find on the net, however, is what to do when you seems to have more than one fact in a DW. In my case (insurance), I have refunds that occur on a non regular basis. One client can have none for 3 months and then ten in the same months. On the other hands, I have "subscription fee" (not sure what is the correct english term, but you get the point), that occur every month or every three months. That seems clearly like two distinct facts to me.
Those two are kind of loosely coupled by some dimensions, like the client or the "insurance product". Now are these two different warehouse, on which I have to produce two different report and then connect the reports outside of the DW ? Or is there a way to design this to fit a single descent DW. Or should I combine these two facts in one? I would probably lose granularity on refunds then.
Some blog I read said a DW always has one fact table. Others mention the step of designing what are the fact tables with a S, but there is no clear instruction of if there is a link between them or they are just distinct components of a same DW project.
Does anyone know some references on that precise part of DW design?
I realize that I am answering an old post, but I am not satisfied with either of the answers provided. I feel that neither answered the question.
A schema can have one or more facts, but these facts are not linked by any key relationship. It is best practice not to join fact tables in a single query as you would whey querying a normalized/transactional database. Due to the nature of many to many joins, etc - the results would be incorrect if attempted.
The answer you are looking for is that you need to "drill across" which basically means that you are querying each fact table (schema) separately and merging the results. This can occur using SQl or preferably via a reporting/analytics tool that you may have which referenced the data warehouse. Instead of duplicating the answers on how to do this, I will direct everyone to two very good articles:
Three ways to drill across by Chris Adamson
and
The Soul of the Warehouse - Drilling Across by Ralph Kimball
You can have as many fact tables as you like. In your example you may have something like:
dimProduct lists several products -- subscription being one of those.
dimTransactionType would list possible transactions (purchase, refund, recurring subscription fee ...)
Now suppose you are interested in simplified subscription reporting, you could add a factSubscription like this:
Taking your questions backwards.
A data warehouse can have more than one fact table. However, you do want to minimize joins between fact tables. It's ok to duplicate fact information in different fact tables.
Of the objects you mentioned:
Refund is a fact. Timestamp is the dimension of the refund fact.
Subscription fee is a fact. Timestamp is the dimension of the subscription fee fact.
A refund can happen more than once. I'm guessing that each customer has one subscription fee. So it appears we have two fact tables so far, customer, and customer refund.
If you knew that there could only be at the most 3 refunds (as an example), then you would eliminate the customer refund fact table, and put 3 refund columns in the customer table.
You also mention insurance. A customer can have more than one policy. So we have a third fact table.
A data warehouse is usually designed using a star schema. The star schema is basically one fact table connected to one or more dimension tables. You'll probably have more than one star in a data warehouse, since we already defined 3 fact tables.