so the case is to simply forecast some feature value Y (let it be type float) given specific time T.
Currently I've got simple 2 column data like
2019-10-18 10:00 | 1.0
2019-10-18 12:00 | 2.5
and so on.
Simple input data can represent changing values of sinusoid function f(x)=sin(x) in time.
I'm interested in how to convert date time series in ML.NET that later I want to ask engine to predict feature value Y for given date time T (maybe given in form of unix time stamp?)
I would recommend converting to Unix timestamps, yes. ML.NET algorithms use floats as features, so timestamps will work fine.
Related
Spreadsheet
I need to show the result of altering energy use during peak price times on the cost of energy over the year. To do this I want to change the amount of energy being used, value in G6 to be 8 if the time at B6 is between the morning or afternoon peak pricing times, or keep the value in C6 if B6 is not within the peak pricing times.
Morning Peak Start 06:30:00 Finish 07:45:00
Afternoon Peak Start 16:30:00 Finish 23:45:00
I have attempted to modify the code found here:
excel-if-and-formula-between-two-times
=IF(and(B6>$N$3,B6<$O$3),"8",IF(AND(B6>=$P$3,B6<$Q$3),8,C6))
however, this does not return "8" in the desired time ranges.
You can see in the Google Spreadsheet that I have experimented with code in N6,N7 and N8.
I appreciate your assistance.
Please examine your cells N3 and P3. They have an extra 05/01/1900 in them. Remove that, leaving the time part, and it should work fine.
The problem is that you were doing a comparison on a time string, e.g. 06:45:00 and a full date string, e.g. 05/01/1900 06:30:00, which results in a faulty comparison.
The values you analyse in a comparison must have the same data type and must have a consistent data within this data type. In your case both B6, N3,O3,P3 and Q3 are set to the data type Time. However, you are using the format Date Time in all the cells but B3. As you want to compare dates with time you must set all of these cells with the data type Date Time.
To do so, in your Spreadsheet's menu bar, after selecting all these cells select Format->Number->Date Time and add the right date to B6.
Why does Delphi use double (8 bytes) to store date and time instead of Int64 (8 byte as well)? As a double precision floating point is not an exact value, I'm curious wether the precision of a unix date and time stored in an Int64 value will be better than the precision of a Delphi date and time?
The simple explanation is that the Delphi TDateTime type maps directly to the OLE/COM date/time format.
Embarcadero chose to use an existing date/time representation rather than create yet another one, and selected, what was at the time, the most obvious platform native option.
A couple of useful articles on Windows date/time representations:
How to recognize different types of timestamps from quite a long way away, Raymond Chen
Eric's Complete Guide To VT_DATE, Eric Lippert
As far as precision goes, you would like to compare Unix time to TDateTime. Unix time has second precision for both 32 or 64 bit values. For values close to the epoch, a double has far greater precision. There are 86,400 seconds in a day, and there are many orders of magnitude more double values between 0 and 1, for instance. You need to go to around year 188,143,673 before the precision of Unix time surpasses that of TDateTime.
Although you have focused on the size of the types, the representation is of course crucially important. For instance, if instead of representing date as seconds after epoch, it was represented as milliseconds after epoch, then the precision of Unix time would be 1000 times greater. Of course the range would be reduced by 1000 times also.
You need to be wary of considering precision of these types in isolation. These types don't exist in isolation, and the source of the values is important. If the time is coming from a file system say, then that will in fact determine the precision of the value.
How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?
By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is only 10 minutes difference.
I found a solution here where the time feature is split in to two features using cosine and sine of the seconds after midnight feature. But will that be appropriate for random forest? With using random forest one can't be sure that all features will be present for every split. So often there will be half of the time information missing for a decisions.
Looking forward to you thoughts!
If you have a date variable, with values like '2019/11/09', you can extract individual features like year (2019), month (11), day (09), day of the week (Monday), quarter (4), semester (2). You can go ahead and add additional features like "is bank holiday", "is weekend", or "advertisement campaign", if you know the dates of specific events.
If you have a time variable with values like 23:55, you can extract hr (23), minutes (55) and if you had, seconds, nanoseconds etc. If you have info about the timezone, you can also get this.
If you have datetime variable with values like '2019/11/09 23:55', you can combine the above.
If you have more than 1 datetime variable, you can capture differences between them, for example if you have date of birth, and date of application, you can determine the feature "age at time of application".
More info about the options for datetime can be found in pandas dt module. Check methods here.
The cyclical transformation in your link is used to re-code circular variables like hrs of a day, or months of the year, where for example December (month 12) is closer to January (month 1) than to July (month 7), whereas if you encoded with numbers, this relationship is not captured. You would use this transformation if this is what you want to represent. But this is not the standard go method to transform this variables (to my knowledge).
You can check Scikit-learn's tutorial on time related feature engineering.
Random forests capture non-linear relationships between features and targets, so they should be able to handle both numerical features like month, or the cyclical variation.
To be absolutely sure, the best way is to try both engineering methods and see which feature returns better model performance.
You can apply the cyclical transformation straightaway with the open source package Feature-engine. Check the CyclicalTransformer.
I live in a country where they change the time twice a year. That is: there is a period in the year when the offset from UTC is -3 hours (-180 mins) and other period where the offset is -4 hours (-240 mins)
Grafically:
|------- (offset = -3) -------|------- (offset is -4) -------|
start of year mid end of year
My question is:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
This is important because I save every date in UTC timezone (offset = 0) in my database.
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
Here is an example of a problem I see by saving the dates with timezone UTC:
Lets say I have a system where people send messages.
I want to have a statistics section where I plot "messages sent v/s hour" (ie: "Messages sent by hour in a regular day")
Lets say there are just two messages in the whole database:
Message 1, sent in march 1, at UTC time 5 pm (local time 2 pm)
Message 2, sent in august 1, at UTC time 5 pm (local time 1 pm)
Then, if I create the plot on august 2, converting those UTC dates to local would give me: "2 messages where sent at 1 pm", which is erratic information!
From the timezone tag wiki here on StackOverflow:
TimeZone != Offset
A time zone can not be represented solely by an offset from UTC. Many
time zones have more than one offset due to "daylight savings time" or
"summer time" rules. The dates that offsets change are also part of
the rules for the time zone, as are any historical offset changes.
Many software programs, libraries, and web services disregard this
important detail, and erroneously call the standard or current offset
the "zone". This can lead to confusion, and misuse of the data. Please
use the correct terminology whenever possible.
There are two commonly used database, the Microsoft Windows time zone db, and the IANA/Olson time zone db. See the wiki for more detail.
Your specific questions:
the "timezone" is just the number representing the offset? that is: my country has two timezones? or the timezone includes this information?
You have one "time zone". It includes two "offsets".
Should I, instead, be saving the dates with local timezone and saving their offset (at the moment of saving) too?
If you are recording the precise moment an event occurred or will occur, then you should store the offset of that particular time with it. In .Net and SQL Server, this is represented using a DateTimeOffset. There are similar datatypes in other platforms. It only contains the offset information - not the time zone that the offset originated from. Commonly, it is serialized in ISO8601 format, such as:
2013-05-09T13:29:00-04:00
If you might need to edit that time, then you cannot just store the offset. Somewhere in your system, you also need to have the time zone identifier. Otherwise, you have no way to determine what the new offset should be after the edit is made. If you desire, you can store this with the value itself. Some platforms have objects for exactly this purpose - such as ZonedDateTime in NodaTime. Example:
2013-05-09T13:29:00-04:00 America/New_York
Even when storing the zone id, you still need to record the offset. This is to resolve ambiguity during a "fall-back" transition from a daylight offset to a standard offset.
Alternatively, you could store the time at UTC with the time zone name:
2013-05-09T17:29:00Z America/New_York
This would work just as well, but you'd have to apply the time zone before displaying the value to anyone. TIMESTAMP WITH TIME ZONE in Oracle and PostgreSQL work this way.
You can read more about this in this post, while .Net focused - the idea is applicable to other platforms as well. The example problem you gave is what I call "maintaining the perspective of the observer" - which is discussed in the same article.
that is: my country has two timezones? or the timezone includes this information?
The term "timezone" usually includes that information. For example, in Java, "TimeZone represents a time zone offset, and also figures out daylight savings" (link), and on Unix-like systems, the tz database contains DST information.
However, for a single timestamp, I think it's more common to give just a UTC offset than a complete time-zone identifier.
[…] in my database.
Naturally, you should consult your database's documentation, or at least indicate what database you're using, and what tools (e.g., what drivers, what languages) you're using to access it.
Here's an example of a very popular format for describing timezones (though not what Windows uses).
You can see that it's more than a simple offset. More along the lines of offsets and the set of rules (changing over time) for when to use which offset.
The Delphi functions EncodeDate/DecodeDate seem to be able to handle only dates after 1.1.0001. Are there some implementations of EncodeDate/DecodeDate that can handle B.C. tDateTime values?
AFAIK TDateTime is a Windows base type, common to COM, Variants, DotNet and Delphi. Negative values can be used for dates before 1899.
But that is not so simple - since with negative values comes some trouble, as stated by this page:
The integral part is the date, the fraction is the time. Date.time.
That's easy. Things get odd when the value is negative. This is on or
before #12/30/1899#.
With modern dates time always runs forwards, as you would suspect.
With negative historical dates time actually runs backwards!
Midnight #1/1/1800# equals −36522, but noon #1/1/1800# is −36522.5 (less than
midnight!) and one second before midnight is −36522.9999884259 (even
less). At midnight the clock jumps forward to -36521, which
equals #1/2/1800#. The decimal fraction still shows the time and the integral
part is the date, but each second decrements the clock while each new
day increments it, not just by 1, but by almost 2. Negative times are
really counterintuitive.
To make things worse, time values for #12/30/1899# are ambigous
in two ways. First, a time value without a date equals that time
on #12/30/1899#. This means that 0.5 is either noon or noon
on #12/30/1899#, depending on context. Zero is either
midnight, #12/30/1899# or midnight #12/30/1899#. The other ambiguity is that all
time values come in double for #12/30/1899#. 0.5 is noon #12/30/1899#,
but -0.5 is noon #12/30/1899# as well. The integral part is the date,
the fraction is the time. Another surprise is here: #12/30/1899
11:59:59 PM# - #12/29/1899 11:59:59 PM# = 2.99997685185185. Not 1,
what you normally would expect for a 24-hour period. Be careful when
working with historical dates.
To my knowledge, the current implementation of EncodeDate/DecodeDate will work, but you may go into troubles when working with negative or near to zero TDateTime values...
You should better use your own time format, e.g. ISO-8601 or a simple record as such:
TMyDateTime = packed record
Year: SmallInt;
Month: Byte;
Day: Byte;
end;
And when computing things about duration or displaying date/time, you must be aware that "our time" is not continuous. So calculation using the TDateTime=double trick won't always work as expected. For instance, I remember that Teresa of Avila died in 1582, on October 4th, just as Catholic nations were making the switch from the Julian to the Gregorian calendar, which required the removal of October 5–14 from the calendar. :)