Datawarehouse: Multivalued Slowly Changing Dimensions - data-warehouse

I am currently creating a datwarehouse for a (coffee) aggregator in latin america. They have two main business operations:
buying coffee from farmers and selling it in the international
market and
providing micro-credit loans to these farmers to
increase their yield.
My plan is to create a datawarehouse on top of their operational systems/dbs.
The first business process I will integrate is the credit operation, after that I will add the buying of the coffee of individual farmers.
For the credit operation I envision a single fact table which consists of the loan-amount, with dimensions to farmer, loan-officer etc. But before getting into the fact table concerning loans, I am currently working on the creation of the farmer dimension.
I have a nice little farmer dimension with some keys, geograhical location, sex, education, etc etc
I also would like to include the "economic production" of the farmer. This is information that is captured in the loan application process and basically says what kind of coffee they produce and the size of the land they produce this on. The relation between farmer and economic production is thus 1:n
Furthermore, this changes from year to year and is obviously only known for farmers that have done a loan application.
The goal of this information, is to be able to (even before the credit fact table is created) create some basic figures / insights on the farmers, their spatial distribution and their economic activity/output.
Iam thus thinking of having a farmer dimension connected to a "production dimension". This production dimension would be (1) time varying and (2) multivalued. The time variance I plan to implement according to type 2 (valid_from, valid_to and currently_valid columns).
Since I am rather new to the whole datawarehouse scene I have been reading a lot up on common techniques and principles, mainly from Kimball's excellent (!) book. However, I havent come across anything which describes such a dimension-dimension connection.
My questions therefore are:
is this common and considered a good approach?
where can I find some information on best practices concerning this matter
EDIT: A second possibility that I am thinking about would be to create some kind of a factless fact table which deals with "customer interaction" (e.g. the loan application process in which such information from farmers is collected). This fact table would then have a FK to the farmer and FK to the production dimension as well as a FK to a time dimension table. As it has no facts associated with it, this would only form some kind of a 1:n linking table. The only difference with the former method would be that the time-dimension is now in a seperate table as opposed to be included in the production table in my opinion..
EDIT2: Or should I create perhaps a production fact table, although it does not coincide with a business process of the aggregator. In that case probably the surface area to produce a certain crop would become the measure, and potentially the crop types / varieties etc would go into a seperate dimension.

Related

Dimensional Modeling: app session or activity measures

I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?
your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't

Handling Contracts extension and licensing/Subscriptions addition/removal in dimensional model

Background: I am trying to design a star schema for a data warehouse. We have the following business model where we have few products that our customers can buy and then use. The customers are companies and then they have people in their organization who can be mapped to the licenses they have brought for products.
I have the following dimensions.
Account_dim: The dimension contains all the list of companies that have are our current/prospective with our company. It could have companies who still don't have a contract with us and are still in a discussion phase. so some rows might not have a contract.
User_dim: This is the list of users the company has nominated point of contacts for their company. So a user will belong to one particular Account in the Account_dim. One account can have many users.
Product_Dim: This dimension contains all the information regarding all the products we sell. The cost of a license and how many users are allowed on a license.So if for example he brought product A a max of two users can use it.
Now I have three tables that have data regarding the contract.
Contract: It contains information regarding a contract we have which will include the contract start date and end date and the account which this contract is assigned to.
products_bought: This table contains the product brought under a contract. A contract can hold multiple products bought.Each product row will have the product start date/end date and the price of the asset the client has paid.
allocated users:Each product bought can have users mapped to it who are allowed to use the product which is the user in user_dim for that account. Basically attaching a license to a user.
I am trying to model the contract, product bought and allocated user so I can generate the following data.
The amount of money a account has spend on products.
THe utilization of licenses by an account. for example an account has a product that allows 3 users but has only one user mapped to it will show the product is under utilized.
I tried denormalizing all three tables into one fact table but the I am running into problem where the contract end date can be changed if it is extended. As well as new assets can be mapped to it. Also last be not least, the company can remove a user and then map another user to the product or remove users because they left the company or add more users.
How can this be best modeled. Because they contract and asset users can change they should be SCDs rather than fact table or how should I implement a fact to handle these changes as well which must be captured as well to maintain history of usage over time.
your best bet is to read a book on how to go about designing a data warehouse: The Data Warehouse Lifecycle Toolkit as this will give you all the information you need to be able to answer questions like this.
However, to specifically address your question, the best way to approach this is as follows:
Define your measures: what are the values that you wish to be able to aggregate in your reports
Define the grain of each measure: what are the dimensions that uniquely identify each measure. For example, a transaction amount might be defined by Store, Customer and Date/Time; if you dropped any of these then the transaction amount would change; if you added another dimension, such as rainfall, it would not change the transaction amount (n.b. having defined the grain of a measure you should never add dimensions that would change the grain e.g. Product Dimension, in this example)
Once you have defined your measures and their grains you can add all the other dimensions to them (that won't affect their grain) andn then decide whether to hold them in separate fact tables or combine them into one fact table:
Rule: if two measures don't have the same grain you must not put them in the same fact table
Guidance: for measures that meet the above rule, if there is also significant overlap in the other dimensions you want to use for each measure then consider combining them into a single fact table. My rule of thumb is that if you have 2-3 dimensions that don't apply to all measures then that's OK; if you hit 5 or more then you probably need to be thinking of splitting the measures into separate facts

Transaction Fact Table approach

I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.

Can two data marts be entirely different from each other in spite of being (both) part of an organization?

In a text book, the above mentioned line is printed. But I don't agree with it. Because if we look from top to bottom or vice versa, each business component has single business goal (and in most cases "profit"). So somewhere these data marts belonging to two different branches or organisation have something in common "business goal".
Please help me understand the meaning of query and let me know if I am wrong at my interpretation .
In a complex organisation with different business units you will have data that is only relevant to the business unit. Even different types of insurance product (marine, property, public liability) have different information attached to them. Even if you have an enterprise-wide data warehouse you will still have reporting requirements that are specific to individual departments, even individual users.
Business-wide performance management systems (dashboards etc.) will tend to have a relatively narrow set of dimensional data and metrics common to all the business. It is quite reasonable to have a narrower set of business-wide performance metrics sliced by a higher level reporting hierarchy for company-wide reporting. Essentially this is what accounting systems do, although accounts data is by no means the be-all and end-all of management information.
A mature data warehouse system is likely to have a 'board pack', dashboard system or other MIS component where company-wide metrics are reported to senior management. This will be fairly narrow. Individual departments will have richer reporting on data that is specific to their operations, but may have little in common with other departments.
Yes.. This is possible.. Let's take an example of an organization with several Business Units.. Each Business Unit has its own characteristics and Key Performance Indicators.. Therefore, each data mart is distinct on its own.But a C Level executives would like to see the 360 degree of the organization's performance would like to see the target for the overall business goals. There will be the need to connect these datamarts in such cases.

How do I avoid complex joins in star schema?

My fact table holds a user score in a course he took. Some of the details of the course, which I have to show on the report, comes from more then one table (in the actual OLTP db).
Do I create a none normalized version of that course entry in a dimension table?
Or do I just join the fact table directly to the course table join to the other tables that describe this course (course_type,faculty who created this course etc)
Snowflaking or bridge tables do make the joins more complicated, and not just from a coding perspective, it also makes it less simple for BI users.
In most cases, I would put these directly in existing or additional dimension tables.
For instance, you have a scores fact table, which has the user details in a dimension which may or may not hold demographics on the user (perhaps it's only a bridge). Sometimes it is better to split out demographic information. So even though the gender and age might be associated with a user entity, in the dimensional model, these might be individual dimensions or lumped into a single dimension - all depending on the usage scenarios.
Perhaps your scores are attached to a state and states have regions (snowflake). It might be far more efficient for analysis to have the region dimension linked directly instead of going through the state dimension.
I think what you will find is that the dimensional model is a very pragmatic denormalization approach. The main things which are non-negotiable are the facts - after that the choice of dimensions is very much informed by the behavior of the data, your foresight for common usage scenarios - and avoiding falling into the too few dimensions and too many dimensions problems.
Maybe I do not understand your question, but a fact table in a star schema is supposed to be joined to dimension tables surrounding it.
If you do not feel like making joins, simply create a view, and use the view for reporting.
If you were to post a model (schema), it would be easier to comment/help.
It is a common practice to consolidate several dimensions together, sacrificing normalization in favor of performance. This is usually done when your typical query will need all dimensions together (as opposed to using different bits for different use cases).
Also remember that while you receive a reduction in join overhead, there are some drawbacks:
Loss of flexibility, which might hinder development as the warehouse expands
Full table scans take longer (in traditional row-based RDBMS such as SQL Server)
Disk space consumption
You will have to consider each case separately.
It might be worthwhile to also consider the option of creating a materialized view, if such ability is offered by your RDBMS.
We commonly have a snowflake schema as the physical DWH design, but add a reporting view layer that flattens the snowflake schema into a star schema.
This way your OLAP cube becomes much simpler adn easier to manage.

Resources