I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.
Related
I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?
I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.
First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.
If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...