Can you extract a dimension table from a fact table? - data-warehouse

Here's the situation, in the source database we have more than 600K active rows for a dimension but in reality the business only uses 100 of them.
Unfortunately the list of values that they might use is not known and we can't manually filter on those values to populate the dimension table.
I was thinking, what if I include the dimension columns for that table in the fact table and then when we send that to staging area, just seperate it from the fact and send it to it's own table.
This way, I will only capture the values that are actually used.
P.S. They have a search function in the application that help users navigate through 600K values. it's not like a drop-down field !
Do you have a better recommendation?

Yes - you could build the Dimension from the fact staging table. A couple of things to consider:
If the only attribute for the Dimension is the field in the fact staging table then you can keep this as a degenerate dimension in the fact table; no need to build a dimension table for it - unless you have other requirements that require a standalone dimension table, such as your BI tool needs it.
If there are other attributes you need to include in the dimension then you are still going to need to bring in the source dimension table - but you can filter it using the the values in the fact staging table and only load the used values into your dimension

Related

Should we put all the fields related to the `user` in a `dim_user` table in a data warehouse?

Considering there is a data warehouse contains one fact table and three dimension tables.
Fact table:
fact_orders
Dimension tables:
dim_user
dim_product
dim_date
All the data of these tables are extracted from our business systems.
In the business system, the user has many attributes, some of which could change upon time(mobile, avatar_url, nick_name, status), some others won't change once the record is created(id,gender,register_channel).
So generally in the dim_user table, which fields should we use and why?
Dim_User should have both changeable and unchangeable fields. In denormalized model, it is preferrable to keep all the related attributes of a dimension in a single table.
Also, it is preferrable to keep all the information available about user in the dimension table, as they might be used for reporting purposes. If they won't be needed for reporting purpose, you can skip them.
If you want to keep the history of change of the user, you can consider implementing slowly changing dimensions. Otherwise, you can update the dimension attributes, as and when they change. It is called SCD Type I.

Is A Table Linking a Dimension to a Fact table, a Dimension of a Fact?

In my incomplete view of BI tables, a Fact table represents action and a dimension an entity.
I have a FactOrder table that contains order information (including OrderId and CustomerId). There is a separate dimension for people who are actually connected to the order but are not customers. So they are saved in a separate table called DimServiceUser. A linking table connects an Order to a ServiceUser. Should this intermediate OrderServeruser table be defined as a Dimension, a Fact or another type?
It really is more of a bridge table. Here's what is going on.
Your FactOrder table is a fact table, but it also contains a degenerate dimension. A degenerate dimension acts as a dimension key in the fact table, however does not join to a corresponding dimension table because all its interesting attributes have already been placed in other analytic dimensions. So you have an implied DimOrder in there that didn't require a separate table.
A bridge table can connect a set of values to a single fact table row, or it can connect two dimensions (such as customers and bank accounts). It is a way of handling legitimate many-to-many relationships. A bridge table is like a factless fact table. But in dimensional modeling we do not join fact tables together, whereas it is acceptable to join bridge tables and fact tables together. If you must force your bridge table into being a fact or dimension, it is closer to a fact table. But doing so may make it easier to implement bad modeling habits in the future. If you can call it a bridge instead, I would just go with that. (Make sure you read that third link on "like a factless fact table". It was written by the author of Star Schema: The Complete Reference. That is a pretty well accepted source.)
As OrderService does not contain any facts/measurements therefore you cannot call it a fact table.
Dimension table:
A dimension table contains dimensions of a fact.
They are joined to fact table via a foreign key.
Dimension tables are de-normalized tables.
The Dimension Attributes are the various columns in a dimension table
Dimensions offers descriptive characteristics of the facts with the help of their attributes
No set limit set for given for number of dimensions
The dimension can also contain one or more hierarchical relationships
Based on the above definition of dimension table, I believe you table should be referred as dimension table.
OrderServeruser table should be prefix with "bridge".

Star schema: how to handle dimension table with constantly changing set of columns?

First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

Resources