What to do when daylight savings results in duplicate data rows? - data-warehouse

I have a fact table for energy consumption as follows:
f_meter_data:
utc_calendar_id
local_calendar_id
meter_id
reading
timestamp
The calendar table is structured as per the Kimball recommendations, and it's the recommendations in the Data Warehouse Toolkit that are why I have the two calendar IDs so users can query on local and UTC time.
This is all well and good but the problems arise when daylight savings kicks in.
As the granularity is half hour periods, there will be a duplicate fact records when the clocks change.
And when the clocks change in the other direction there will be a gap in the data.
How can I handle this situation?
Should I average the duplicate values and store that instead?
And for when it's a gap in the data, should I use an average of the point immediately before and the point immediately after the gap?

I have a feeling this question may end up getting closed as "primarily opinion based", but my particular opinion is that the system should be set up to deal with the fact that not every day has exactly 24 hours. There may be 23, 24 or 25. (Or, if you're on Lord Howe Island, 23.5, 24 or 24.5).
Depending on when your extra hour falls (which will be different for each time zone), you may have something like:
00 01a 01b 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Or you might consider coupling the hour with the local UTC offset, like:
00-04:00 01-04:00 01-05:00 02-05:00 03-05:00 etc...
Or if you're doing half-hour buckets:
00:00-04:00 00:30-04:00 01:00-04:00 01:30-04:00 01:00-05:00 01:30-05:00 ...
It probably wouldn't be appropriate to do any averaging to align to 24 hours. If you did, then totals would be off.
You also should consider how people will be using the data. Will they be trying to figure out trends across a given hour of the day? If so, then how will they compensate for a spike or dip caused by the DST transition? It may be as simple as putting an asterisk and footnote on the output report. Or it may be much more involved than that, depending on the usage.
Also, you said you're working with 30-minute intervals. Be aware that there are some time zones that are 45-minute offset (Nepal, Chatham Islands, and a small region in Australia). So if you're trying to cover the entire world then you would need 15-minute interval buckets.
And, as Whichert pointed out in comments, if you're using UTC then there is no daylight saving time. It's only when you group by local-time that you'll have this concern.
You may also find the graphs in the DST tag wiki useful.

I think you should simplify this with your business. Meaning when the clock is turned back, you turn back your record by pushing the old records out into a warning or error table and putting the new ones for the same interval.
As suggested by Matt, anyways reports would not tell the true story, if run by local time. Then, why give wrong data in the reports.
Or to followup on Matt's advice again change your interval records. You should then not bind the time interval to the local_id. Instead use a Interval_seq_id that runs in interval of 30 minutes that might have 48 records (1-48), 50 records (1-50) or 52 (1-52) records for a given day depending on your region. This technically will remove your duplicate problems on the Local_Int_starttime and Time_interval_Endtime, its no more dependant or bond with the time intervals.
This though moves the issue to your reports/query tools to solve how they now want to display time in the graphs that have duplicates on local time.Especially, if you want to do some analytics based on local time and meter reading. Though, this way the database design now differentiates the records through Interval_Seq_id and not using the time interval.

There is a similar thread about daylight savings problems in C# here.
The answer goes into deep details about daylight savings. I believe the problem is somewhat similar.

Related

Do UNIX timestamps change across timezones?

As the subject asks; do UNIX timestamps change in each timezone?
For example, if I sent a request to another email the other side of the world saying, "Send out an email when the time is 1397484936", would the other server's timestamp be 12 hours behind my own?
The definition of UNIX timestamp is time zone independent. The UNIX timestamp is the number of seconds (or milliseconds) elapsed since an absolute point in time, midnight of Jan 1 1970 in UTC time. (UTC is Greenwich Mean Time without Daylight Savings time adjustments.)
Regardless of your time zone, the UNIX timestamp represents a moment that is the same everywhere. Of course you can convert back and forth to a local time zone representation (time 1397484936 is such-and-such local time in New York, or some other local time in Djakarta) if you want.
The article at http://en.wikipedia.org/wiki/Unix_time is pretty impressive if you'd like a longer read.
Unix time is defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. So the answer is no
Unix timestamps do not change accross timezones, they are created for the purpose of having a standard time across globe.
NOTE:-
Timestamps are calculated on the basis of current time in the computer thus do not rely on them until and unless you are very sure about the time settings in the participating machines.
Someone stated that "UTC is Greenwich Mean Time without Daylight Savings time adjustments." This is simply untrue. GMT does not have Dayllight Savings Time. GMT is measured in Greenwich, England (at the Naval Obeservatory) [0 longitude, but not 0 lattitude]. UTC is measured at the equator [0 longitude and 0 lattitude - which happens to lie in the ocean off the cost of Africa].
What difference does it make? It doesn't make a difference in terms of "what time of day is it?" It does, however, make a difference in terms of calculating a year. Now you'd think a year would be measured based upon the location of the center (the core) of the earth, right? When the earth's core is back in the same location it was ~365 days ago, it has been a year. It isn't measured that way. It is measured by a specific location on the earth getting back to the same location (relative to the sun) that it was ~365 days ago. But the period of a day and a year don't divide evenly. Once the earth is back to about where it was a year ago, the earth isn't facing the same direction it was last year, so that spot on the earth isn't facing the same direction it was a year ago. Being further north, Greenwich isn't going to get back to the same spot (relative to the sun) that it was last year at the same time that 0 Lat / 0 Long is. So if you base the definition on Greenwith vs. 0/0, you get a, albeit slightly, different answer to the question "how many days are in a year". To put it another way, when a given spot on the earth gets back to where it was a year ago (relative to the Sun), the core of the earth isn't in the same spot it was a year ago, so what spot you pick matters because the core of the earth is going to be in a different spot (relative to the sun) than it was one year ago, if you pick a different spot on the earth.
Neither UTC nor GMT have daylight savings time. Europe/London time, the timezone that Greenwich resides in, does. But GMT does not. GMT is, what Americans would call a "Standard Time" - i.e. without DST.
Getting back to the question, Epoch time doesn't technically have a timezone. It is based on a particular point in time, which just so happens to line up to an "even" UTC time (at the exact beginning of a year and a decade, etc.). If that concept doesn't fit well in your brain, and if it helps to think of Epoch time as being in UTC, go right ahead. You're in good company and in the grand scheme of things, it really doesn't matter. You ever see those law suits where somoene is awarded $1. It's kind of a "you're right, but it doesn't really matter" type of verdict. If someone sued you for saying Epoch time is in the UTC timezone, they would win $1. That wouldn't buy them a cup of coffee at any Starbucks in any timezone on the planet.
IF both computers are set up correctly with their clocks set for the correct timezone and UTC values, they should return the same value.
Of course that's a big IF. There's almost certain to be a difference of at least a second, more often minutes between the time reported by two computers. And many computers are set up to have incorrect timezone settings, and will report their local time when asked a timestamp rather than UTC.
And in that lies the difference between theory and practice. In theory it's all the same, in practice you should not rely on it.
No, epoch timestamp should not change, because it has a fixed timezone which is UTC.
If you want to use a time object in other time zone, just look it up in libraries of the language you use, but do NOT try to add/substract a couple of hours from epoch timestamp and assume it's in another time zone, which will make things very confusing to other people, especially when you expose it in your API.
If you use C++, I recommend this library. I heard it will soon be added into standard library.
For all, I understand sometimes time object is hard to deal with and it looks easier to add/substruct on epoch timestamp. Please don't do it and do not persuade others to do it. A time object is much easier once you get used to it and can take care of time zone conversion easily without messing up with historical time zone changes due to politics/law etc...

Is timezone just an offset number or "more information"?

I live in a country where they change the time twice a year. That is: there is a period in the year when the offset from UTC is -3 hours (-180 mins) and other period where the offset is -4 hours (-240 mins)
Grafically:
|------- (offset = -3) -------|------- (offset is -4) -------|
start of year mid end of year
My question is:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
This is important because I save every date in UTC timezone (offset = 0) in my database.
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
Here is an example of a problem I see by saving the dates with timezone UTC:
Lets say I have a system where people send messages.
I want to have a statistics section where I plot "messages sent v/s hour" (ie: "Messages sent by hour in a regular day")
Lets say there are just two messages in the whole database:
Message 1, sent in march 1, at UTC time 5 pm (local time 2 pm)
Message 2, sent in august 1, at UTC time 5 pm (local time 1 pm)
Then, if I create the plot on august 2, converting those UTC dates to local would give me: "2 messages where sent at 1 pm", which is erratic information!
From the timezone tag wiki here on StackOverflow:
TimeZone != Offset
A time zone can not be represented solely by an offset from UTC. Many
time zones have more than one offset due to "daylight savings time" or
"summer time" rules. The dates that offsets change are also part of
the rules for the time zone, as are any historical offset changes.
Many software programs, libraries, and web services disregard this
important detail, and erroneously call the standard or current offset
the "zone". This can lead to confusion, and misuse of the data. Please
use the correct terminology whenever possible.
There are two commonly used database, the Microsoft Windows time zone db, and the IANA/Olson time zone db. See the wiki for more detail.
Your specific questions:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
You have one "time zone". It includes two "offsets".
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
If you are recording the precise moment an event occurred or will occur, then you should store the offset of that particular time with it. In .Net and SQL Server, this is represented using a DateTimeOffset. There are similar datatypes in other platforms. It only contains the offset information - not the time zone that the offset originated from. Commonly, it is serialized in ISO8601 format, such as:
2013-05-09T13:29:00-04:00
If you might need to edit that time, then you cannot just store the offset. Somewhere in your system, you also need to have the time zone identifier. Otherwise, you have no way to determine what the new offset should be after the edit is made. If you desire, you can store this with the value itself. Some platforms have objects for exactly this purpose - such as ZonedDateTime in NodaTime. Example:
2013-05-09T13:29:00-04:00 America/New_York
Even when storing the zone id, you still need to record the offset. This is to resolve ambiguity during a "fall-back" transition from a daylight offset to a standard offset.
Alternatively, you could store the time at UTC with the time zone name:
2013-05-09T17:29:00Z America/New_York
This would work just as well, but you'd have to apply the time zone before displaying the value to anyone. TIMESTAMP WITH TIME ZONE in Oracle and PostgreSQL work this way.
You can read more about this in this post, while .Net focused - the idea is applicable to other platforms as well. The example problem you gave is what I call "maintaining the perspective of the observer" - which is discussed in the same article.
that is: my country has two timezones? or the timezone includes this information?
The term "timezone" usually includes that information. For example, in Java, "TimeZone represents a time zone offset, and also figures out daylight savings" (link), and on Unix-like systems, the tz database contains DST information.
However, for a single timestamp, I think it's more common to give just a UTC offset than a complete time-zone identifier.
[…] in my database.
Naturally, you should consult your database's documentation, or at least indicate what database you're using, and what tools (e.g., what drivers, what languages) you're using to access it.
Here's an example of a very popular format for describing timezones (though not what Windows uses).
You can see that it's more than a simple offset. More along the lines of offsets and the set of rules (changing over time) for when to use which offset.

Converting to UTC with known timezone offset

Not sure if the title of my question was accurate so sorry if it's misleading, here goes.
I am doing some work with that involves timezones and i just want to make sure i get this right... if i want something to start at 03:00:00 my time and my timezone offset is -5 all i need to do is add 5 to 03:00:00 giving me 08:00:00 and that is the UTC time?
It depends what you mean by "timezone offset". Usually an offset is expressed as the amount added to UTC to get to local time, in which case you need to subtract it from the local time in order to get back to UTC (so it would be 22:00 on the previous day in your case).
So for example, Pacific Daylight Time has an offset of -7 - it's 7 hours behind UTC.
However, there are situations (annoyingly) where the offset is expressed the other way round, so make sure you know which way is appropriate for your specific context.
Note that knowing the offset doesn't mean you know the time zone - there can be multiple time zones with the same offset for a particular moment, but different rules for when the offset changes.
...Yes. Depending on what language you're doing it in, it may or may not be as easy as writing
time-offset;

Russian time zone changes

As many of you know, Russia has eleven time zones, and has (or will) cut two of them. It is possible that they may end daylight savings time altogether.
Does anyone know if they have cut two time zones, and if daylight savings is now a thing of the past? If so, does daylight savings end in all time zones, or just some?
I maintain some software that may need to be patched and can't find two news sites that agree on if they have, or have not implemented these changes.
My biggest concern is daylight savings.
According to wiki article,
On February 8th 2011, Russian
President Dmitry Medvedev issued a
decree cancelling DST in Russia. Under
the decree, all clocks in Russia will
advance one hour on March 27th 2011
but will not change back the following
October, effectively making Moscow
Time UTC+4 permanently.1
According to Wikipedia:
On March 28, 2010, the following changes were introduced, which, in particular, led to abolition of two of the eleven time zones.
* The Udmurt Republic and Samara Oblast started using Moscow Time, thus eliminating Samara Time.[2][3]
* Kemerovo Oblast started using Omsk Time.[4]
* Chukotka Autonomous Okrug and Kamchatka Krai started using Magadan Time, thus eliminating Kamchatka Time.[5]
There is no mention of daylight savings being canceled, only that its abolition was proposed.

What is the common practice with regards to differentiating between UTC and GMT?

I finally found out the difference between UTC and GMT by making the effort to look it up on Wikipedia today. Technically speaking it appears that GMT != UTC because you do not know if it is UTC or UT1 being referred to. However practically, people use the terms interchangeably to indicate the same timezone.
A while ago, I suggested that we change the user interface of one of my companies apps to display UTC instead of GMT.
Just to be sure that our database was not calculating the potential seconds difference between GMT and UTC, I ran the below query and verified that they both are just acting as aliases for the same timezone.
select now() AT TIME ZONE 'GMT', now() AT TIME ZONE 'UTC';
timezone | timezone
----------------------------+----------------------------
2009-02-11 08:46:11.643032 | 2009-02-11 08:46:11.643032
(1 row)
What do you think? Do enough users out there understand UTC? Is it better to use the older but more common term? Or should I just do a UTC/GMT?
Normal humans don't need to worry about the few seconds difference between GMT and UTC. The difference only matters to astronomers and time nerds.
I have seen very little software that bothers to make the distinction. Most software ends up using the labels "GMT" and "UTC" interchangeably. Typically it just means "clock time after removing the local time zone offset in exact hours (or half/quarter hours)."
In most cases, nobody will be concerned about the sub-second technical difference between GMT and UTC.
However, writing that the time is expressed in UTC instead of GMT avoids one source of confusion:
Greenwich (and the UK in general) is currently GMT+01:00 because of the daylight saving time (DST).
GMT+01:00 does not mean 1 hour ahead of the time in the UK as one could mistakenly think. Because of the DST, GMT+01:00 is currently the exact time in England.
Stating it as UTC+01:00 helps to avoid this confusion.
Personally, I think of the term UTC before I think of GMT.
I think of GMT before UTC, but I am also living at GMT (+/-0)

Resources