Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts
Related
I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?
your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't
I am currently creating a datwarehouse for a (coffee) aggregator in latin america. They have two main business operations:
buying coffee from farmers and selling it in the international
market and
providing micro-credit loans to these farmers to
increase their yield.
My plan is to create a datawarehouse on top of their operational systems/dbs.
The first business process I will integrate is the credit operation, after that I will add the buying of the coffee of individual farmers.
For the credit operation I envision a single fact table which consists of the loan-amount, with dimensions to farmer, loan-officer etc. But before getting into the fact table concerning loans, I am currently working on the creation of the farmer dimension.
I have a nice little farmer dimension with some keys, geograhical location, sex, education, etc etc
I also would like to include the "economic production" of the farmer. This is information that is captured in the loan application process and basically says what kind of coffee they produce and the size of the land they produce this on. The relation between farmer and economic production is thus 1:n
Furthermore, this changes from year to year and is obviously only known for farmers that have done a loan application.
The goal of this information, is to be able to (even before the credit fact table is created) create some basic figures / insights on the farmers, their spatial distribution and their economic activity/output.
Iam thus thinking of having a farmer dimension connected to a "production dimension". This production dimension would be (1) time varying and (2) multivalued. The time variance I plan to implement according to type 2 (valid_from, valid_to and currently_valid columns).
Since I am rather new to the whole datawarehouse scene I have been reading a lot up on common techniques and principles, mainly from Kimball's excellent (!) book. However, I havent come across anything which describes such a dimension-dimension connection.
My questions therefore are:
is this common and considered a good approach?
where can I find some information on best practices concerning this matter
EDIT: A second possibility that I am thinking about would be to create some kind of a factless fact table which deals with "customer interaction" (e.g. the loan application process in which such information from farmers is collected). This fact table would then have a FK to the farmer and FK to the production dimension as well as a FK to a time dimension table. As it has no facts associated with it, this would only form some kind of a 1:n linking table. The only difference with the former method would be that the time-dimension is now in a seperate table as opposed to be included in the production table in my opinion..
EDIT2: Or should I create perhaps a production fact table, although it does not coincide with a business process of the aggregator. In that case probably the surface area to produce a certain crop would become the measure, and potentially the crop types / varieties etc would go into a seperate dimension.
is there any free or good paid tools to allow business users to edit data warehouse dimensions and then initiate updates to related tables?
Looking for a really simple one solution. One example, is to let business users change Product dimension so they can assign/change Product Category or Price.
I am on SQL Server 2008R2
Just as an example about back applying: when the user changes a product price they may wish to back date it. This requires the following changes:
Create a new dimension record (assuming this is an SCD2), generating a new surrogate key with a start/end date
Replace the old surrogate key in the fact with the new surrogate key from the effective date
So this is at the very least a two step process which I wrap up in a stored procedure called by the ADP
Again all the usual suspects (Microsoft, IBM etc,) have what they call MDM tools, but they are all really complicated, requiring definition of a business model (which is fair enough)
I have built my own, very basic data warehouse. In it I have very simple cubes, for example:
Fact: ReviewRatingByday
Dimensions: Review, Organization, Date
In the OLTP side of my application, an Organization has a 1 to many relationship with Reviews.
Currently my data warehouse provides my Fact's extract function with all possible combinations of the dimensions. This results in redundant combinations where a given Review is combined with an Organization, yet the Review is in fact associated with a different Organization.
How do other data warehouse systems avoid this?
Should I mirror my OLTP relationships in my Dimensions?
I don't really understand your question. If some combinations of Review and Organization do not exist in the source data, then you will have no rows for them in the fact table anyway. So where is the "redundant combination"?
I think you might be asking, "how do I show users only valid combinations of Review and Organization when they select their report criteria". If that's correct then you have two main options:
Use a reporting tool that is able to present only valid combinations to the user
Combine Review and Organization into a single dimension that contains all valid combinations of Review and Organization (Kimball's term for this is a mini-dimension)
If I misunderstood your question, please give some more information about exactly what your issue is, especially what you mean by "redundant combination".
New to document-oriented database concepts and have a few high-level questions related to orders and order processing.
How does one capture an order in this world? Would an order just be a new document in an Orders collection? Would order_item relate back to a product listed in another document? Or is it assumed that order_item would be copied and inserted into the order document and thus, perhaps, difficult to report the total of product sold over time?
How does one work around lack of transactions and maintain integrity
Sorry, very new to me though eager to understand...it sounds very appealing to encapsulate all these 'things' for sale as "objects" and move them around as such between server & clients, etc...if it's indeed plausible. Just need some help conceptualizing big picture dos and don'ts.
How does one capture an order in this world? Would an order just be a new document in an Orders collection?
Yes. That's the way these databases work.
Would order_item relate back to a product listed in another document?
It could. Depends on what you're doing.
Or is it assumed that order_item would be copied and inserted into the order document
Also possible. This works well for historical analysis and data warehousing.
and thus, perhaps, difficult to report the total of product sold over time?
It's always hard to report total product sold over time.
Today, product "23SKIDOO" is a 23l, open-valved, framistat with double widgets.
Last year, before the recall, the same product was a 23l, closed-valved framistat with only a single widget.
In a previous year, the same product was actually 22.5l.
Are these the "same" product? Marketing calls them all "23SKIDOO". But there are differences.
A single Product table doesn't resolve this correctly. What folks then do is invent product lines and product families so they can introduce the "23SKIDOO-B" and "23SKIDOO-PLUS" products which are all part of the "23SKIDOO" family.
Product lines and product families and other more fanciful groupings are workarounds and hacks to magically make unrelated products report together and provide a "total product sold over time" even though the products are clearly different.
Copying the product into the order (while it seems wasteful) can preserve more historical fidelity than many of the commonly-used workarounds.
How does one work around lack of transactions and maintain integrity?
MongoDB has locks. http://www.mongodb.org/display/DOCS/How+does+concurrency+work.
It's not clear what you mean by lacking transactions.
So its always hard to answer a generic question. However, what I would encourage you to do it look at the patterns of read and write you expect your application to perform. There are trade offs for certain document designs just like there are from RDBMS schema designs.
Here's a link to a MongoDB centric schema design presentation. It may help you to understand some of these trade off and options for design.
http://www.scribd.com/doc/47326395/MongoBoulder-Schema-Design