We have some use cases for our DW where we have fact tables at different grains - e.g., sales by store by day (fact 1) and sales budget targets by month (fact 2). They both involve Date as a grain, but in one case the grain is day and the other the grain is period.
Assuming we can't in the near term change the grain, what's the right way to model this?
A Date and a Month dimension, which will have conformed attributes?
1 Date dimension, with nulls or flags or something when it's representing a higher value (e.g., month)
Something else?
You only need one date dimension with one row per day. Just link to the last day of your period.
E.g. for a monthly aggregated fact just link to the last day of the month in your date dimension.
Two different dimensions, one for Date and one for Month
Related
I have a simple data warehouse with an existing data mart. This data mart includes a date dimension table. Because the date grain of the fact table is day, then the date dimension table is also at the grain of day (i.e., 1 row per day). Because date is a hierarchical dimension, I have de-normalized the hierarchy into the date dimension table. So, even though the grain of the date dimension table is day, it also includes attributes like week, month, and year.
I have a new data mart that I'm designing whose fact table date grain is month. So, I have to join to a month dimension from this new fact table. What is the best implementation of this month dimension? That is, should it be a view using the date dimension table? Or, should it be its own physical table?
I have learned rollup, cube & grouping sets but one thing confuses me is how do I know which to use. For example, if I need to find the sale for each month in 2006 by region & by manager the two queries follow
SELECT month, region, sales_mgr, SUM(price)
FROM Sales
WHERE year = 1996
GROUP BY GROUPING SETS((month, region),(month, sales_mgr))
and
SELECT month, region, sales_mgr, SUM(price)
FROM Sales
WHERE year = 1996
GROUP BY ROLLUP(month, region, sales_mgr)
I know the result of each one but I don't know which to use to answer the question properly, is there something I missed or are both considered correct?
ROLLUP and CUBE are just shorthand for two common usages of GROUPING SETS.
GROUPING SETS gives more precise control of which aggregations you want to calculate.
I am a beginner in warehousing. I have two facts Which their names are sales and budget.
I can put days (Date Dimension key) in my sales Fact, but the table i have for budget can be just in month detail. so i don't know what i should do. would you please tell me what are the best practices in this case?
regards
Mana
In this scenario, I generally find it easiest to store the month level data always on either the first/last day of the month. This way, you can still aggregate up to month from date and compare sales & budget; and you will only store the budget value once a month as intended. This would also help if down the road you're asked to store the budget data at the day level.
If you don't want to use this approach, then you would want to snowflake out your date dimension and have a separate month dimension, and then your budget fact table can FK to this new dimension.
I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.
I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!
The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.
Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.