I have a dataset that I need to subset for dates that fall within daylight savings time and dates that are not within daylight savings time to perform different adjustments on each group of data. The lookup table, DST_HOUR_SHIFT contains dates from 2007-2020 (example below) and the main data table contains hourly meter data with a value for each hour from 2007-2020 (example below). Joins seem to be my Achilles heel and I do not know the most efficient way to separate the data. Any help would be greatly appreciated.
DST_HOUR_SHIFT
DST_BEG DST_END
03/11/2007 11/04/2007
03/09/2008 11/02/2008
03/08/2009 11/01/2009
INITIAL_SYSTEM_LOAD
SHORT_DATE HOUR VALUE
01/01/2007 0 1225.00
01/01/2007 1 1170.00
01/01/2007 2 1124.00
01/01/2007 3 1101.00
Related
We have some use cases for our DW where we have fact tables at different grains - e.g., sales by store by day (fact 1) and sales budget targets by month (fact 2). They both involve Date as a grain, but in one case the grain is day and the other the grain is period.
Assuming we can't in the near term change the grain, what's the right way to model this?
A Date and a Month dimension, which will have conformed attributes?
1 Date dimension, with nulls or flags or something when it's representing a higher value (e.g., month)
Something else?
You only need one date dimension with one row per day. Just link to the last day of your period.
E.g. for a monthly aggregated fact just link to the last day of the month in your date dimension.
Two different dimensions, one for Date and one for Month
I'm working on a google sheet that generates a monthly time series based on the following information:
Start Date (C17)
End Date (C18)
Amount at the beginning of the period (C19)
Deductions during the period (D19:L19)
My current setup looks like this:
I need two ArrayFormulas:
One to populate the column's deducted amount (D19:L19) if the date (B21:B) is within the deduction date range (D17:17, D18:18)
One to calculate the effective post-deduction amount at the end of each month.
I am using ArrayFormulas because I expect regular users to have a hard time auto-populating normal formulas after pasting particularly long rows of deduction details.
The time series table is intended to be generated by feeding data from another table - one that has the monthly starting amount and deduction details using the fields of Start Date, End Date and Amount. So far, I've been able to put together are the following:
Months in Range:
=datedif($C$17,$C$18,"M") - named range "ScheduleMonths"
Column of Months:
=ArrayFormula(edate(C17,row(B1:indirect("B"&ScheduleMonths))))
Monthly Total Remaining:
=ARRAYFORMULA(C21:indirect("C"&ScheduleMonths+20)-
D21:indirect("D"&ScheduleMonths+20)-E21:indirect("E"&ScheduleMonths+20)-
F21:indirect("F"&ScheduleMonths+20)-G21:indirect("G"&ScheduleMonths+20)-
H21:indirect("H"&ScheduleMonths+20)-I21:indirect("I"&ScheduleMonths+20)-
J21:indirect("J"&ScheduleMonths+20)-K21:indirect("K"&ScheduleMonths+20)-
L21:indirect("L"&ScheduleMonths+20))
This is the normal formula I set up for use in the table fields:
=if(AND(C$17<=$B21,$B21<=C$18),C$19,"")
I expect the resulting ArrayFormulas to populate all monthly deduction cells with the proper deduction amounts should the date (B21:B) fall within the deduction date range (D17:17, D18:18). So far, I have only achieved this with normal formula.
Update: I figured out how to do the post-deduction amount in A22.
=ARRAYFORMULA(C22:indirect("C"&'Amplaine Auto Time Series'!ScheduleMonths+21)-SUMIF(IF(COLUMN(D22:indirect("AC"&'Amplaine Auto Time Series'!ScheduleMonths+21)),ROW(D22:indirect("AC"&'Amplaine Auto Time Series'!ScheduleMonths+21))),ROW(D22:indirect("AC"&'Amplaine Auto Time Series'!ScheduleMonths+21)),D22:indirect("AC"&'Amplaine Auto Time Series'!ScheduleMonths+21)))
The population of individual deduction fields is still a work in progress with growing requirements.
Which way of storing date and time will provide the quickest search for this data segments if I plan to search separately for each? I always store in datetime type, but maybe it would be more pragmatically to store in the separate columns(e.g. date and time database types) for this purpose?
I haven't run any benchmarks but I could only see the need to saving them as different columns if you're planning on doing massive - and I mean massive - queries on Date alone.
timestamp (with time zone) 8 bytes both date and time, with time zone
date 4 bytes date (no time of day)
time (without time zone) 8 bytes time of day (no date)
time (with time zone) 12 bytes times of day only, with time zone
So if you store them separately, it takes at least more +4bytes, or +8 if you keep the time zone on the time field.
If you're going to have a massive number of rows to query, where you'll be querying only for Date then it might make sense, otherwise I don't think so (in fact it might still make no sense since with a massive number of rows it will also use more unnecessary space possibly offsetting any perceived advantage).
I have used integer fields representing Dates (up 'till months - e.g. 201704 as index), but because the records were reflecting unique monthly records and it made sense, there's no Date format reflecting only Year+Month, otherwise PG is already quite optimised to handle date timestamp situations.
I am working on a system where 200,000+ records have been created in the past year and I need to plot their creation on a time series with various added filters. At this present, this requires performing lots of count queries (30 for each month plotted). How should these dates be stored for maximum speed?
One idea: store the most commonly-visualized data in a number of serialized fields containing counts for each day over the past month. Update each day with cron and serve up as necessary. (Where should these be stored - some new database table or a separate file accessible by Heroku cron?)
I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.