Cyclic ordinal features in random forest - machine-learning

How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?
By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is only 10 minutes difference.
I found a solution here where the time feature is split in to two features using cosine and sine of the seconds after midnight feature. But will that be appropriate for random forest? With using random forest one can't be sure that all features will be present for every split. So often there will be half of the time information missing for a decisions.
Looking forward to you thoughts!

If you have a date variable, with values like '2019/11/09', you can extract individual features like year (2019), month (11), day (09), day of the week (Monday), quarter (4), semester (2). You can go ahead and add additional features like "is bank holiday", "is weekend", or "advertisement campaign", if you know the dates of specific events.
If you have a time variable with values like 23:55, you can extract hr (23), minutes (55) and if you had, seconds, nanoseconds etc. If you have info about the timezone, you can also get this.
If you have datetime variable with values like '2019/11/09 23:55', you can combine the above.
If you have more than 1 datetime variable, you can capture differences between them, for example if you have date of birth, and date of application, you can determine the feature "age at time of application".
More info about the options for datetime can be found in pandas dt module. Check methods here.
The cyclical transformation in your link is used to re-code circular variables like hrs of a day, or months of the year, where for example December (month 12) is closer to January (month 1) than to July (month 7), whereas if you encoded with numbers, this relationship is not captured. You would use this transformation if this is what you want to represent. But this is not the standard go method to transform this variables (to my knowledge).
You can check Scikit-learn's tutorial on time related feature engineering.
Random forests capture non-linear relationships between features and targets, so they should be able to handle both numerical features like month, or the cyclical variation.
To be absolutely sure, the best way is to try both engineering methods and see which feature returns better model performance.
You can apply the cyclical transformation straightaway with the open source package Feature-engine. Check the CyclicalTransformer.

Related

Time series- Not periodic, despite having included frequency

This is actually part of my thesis research, where I have to run a time series analysis on pollution and economic growth of a single country.
I have data of over 144 years of the two variables with each value representing a single year. I imported, set the values as numeric and attached the dataset through the console and ran:
ts_gdp= (data=`GDP per capita, start=1871,end=2014,frequency=1, names=gdp)
I get to see all the values for the first variable and then follow up with the stl() but I get this error. Any clues why this shows up, although I have set the frequency=1, which is the number of observations for the unit of time, in this case a year? Thank you in advance!
Error in stl(GDP, s.window = "periodic") :
series is not periodic or has less than two periods

ML.NET - Normalizing date time data

so the case is to simply forecast some feature value Y (let it be type float) given specific time T.
Currently I've got simple 2 column data like
2019-10-18 10:00 | 1.0
2019-10-18 12:00 | 2.5
and so on.
Simple input data can represent changing values of sinusoid function f(x)=sin(x) in time.
I'm interested in how to convert date time series in ML.NET that later I want to ask engine to predict feature value Y for given date time T (maybe given in form of unix time stamp?)
I would recommend converting to Unix timestamps, yes. ML.NET algorithms use floats as features, so timestamps will work fine.

Why does Delphi use double to store Date and Time instead of Int64?

Why does Delphi use double (8 bytes) to store date and time instead of Int64 (8 byte as well)? As a double precision floating point is not an exact value, I'm curious wether the precision of a unix date and time stored in an Int64 value will be better than the precision of a Delphi date and time?
The simple explanation is that the Delphi TDateTime type maps directly to the OLE/COM date/time format.
Embarcadero chose to use an existing date/time representation rather than create yet another one, and selected, what was at the time, the most obvious platform native option.
A couple of useful articles on Windows date/time representations:
How to recognize different types of timestamps from quite a long way away, Raymond Chen
Eric's Complete Guide To VT_DATE, Eric Lippert
As far as precision goes, you would like to compare Unix time to TDateTime. Unix time has second precision for both 32 or 64 bit values. For values close to the epoch, a double has far greater precision. There are 86,400 seconds in a day, and there are many orders of magnitude more double values between 0 and 1, for instance. You need to go to around year 188,143,673 before the precision of Unix time surpasses that of TDateTime.
Although you have focused on the size of the types, the representation is of course crucially important. For instance, if instead of representing date as seconds after epoch, it was represented as milliseconds after epoch, then the precision of Unix time would be 1000 times greater. Of course the range would be reduced by 1000 times also.
You need to be wary of considering precision of these types in isolation. These types don't exist in isolation, and the source of the values is important. If the time is coming from a file system say, then that will in fact determine the precision of the value.

Why does .weekday date component start at 1 in swift?

The .weekday component starts at 1 (sunday = 1, monday = 2 etc...) and I'm interested if anyone knows why. It seems that usually in programming things start at 0.
The reason for zero based indexing in programming dates back to the time when programs were written in machine language or assembly code. It is a reflexion of the base+displacement capability of memory access from CPU registers. It was maintained in low level programming languages (such as C) that were essentially a bridge to assembly code. Zero based indexing also provides much simpler index manipulation when processing a one dimensional array (or memory block) as a multidimensional matrix. That being said, it is still just a convention. Some languages (such as Pascal) use one based indexing and normal human beings don't start numbering things at zero.
I don't know the fundamental reason for the numbering of weekdays being based on 1 but I strongly suspect that it is more consistant (and practical) to use with calendars where day numbers within a month, and months with a year are also 1 based. It would be very confusing to manipulate days and months as zero based indexes. Given this, weekdays should follow the same conventions.

Do UNIX timestamps change across timezones?

As the subject asks; do UNIX timestamps change in each timezone?
For example, if I sent a request to another email the other side of the world saying, "Send out an email when the time is 1397484936", would the other server's timestamp be 12 hours behind my own?
The definition of UNIX timestamp is time zone independent. The UNIX timestamp is the number of seconds (or milliseconds) elapsed since an absolute point in time, midnight of Jan 1 1970 in UTC time. (UTC is Greenwich Mean Time without Daylight Savings time adjustments.)
Regardless of your time zone, the UNIX timestamp represents a moment that is the same everywhere. Of course you can convert back and forth to a local time zone representation (time 1397484936 is such-and-such local time in New York, or some other local time in Djakarta) if you want.
The article at http://en.wikipedia.org/wiki/Unix_time is pretty impressive if you'd like a longer read.
Unix time is defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. So the answer is no
Unix timestamps do not change accross timezones, they are created for the purpose of having a standard time across globe.
NOTE:-
Timestamps are calculated on the basis of current time in the computer thus do not rely on them until and unless you are very sure about the time settings in the participating machines.
Someone stated that "UTC is Greenwich Mean Time without Daylight Savings time adjustments." This is simply untrue. GMT does not have Dayllight Savings Time. GMT is measured in Greenwich, England (at the Naval Obeservatory) [0 longitude, but not 0 lattitude]. UTC is measured at the equator [0 longitude and 0 lattitude - which happens to lie in the ocean off the cost of Africa].
What difference does it make? It doesn't make a difference in terms of "what time of day is it?" It does, however, make a difference in terms of calculating a year. Now you'd think a year would be measured based upon the location of the center (the core) of the earth, right? When the earth's core is back in the same location it was ~365 days ago, it has been a year. It isn't measured that way. It is measured by a specific location on the earth getting back to the same location (relative to the sun) that it was ~365 days ago. But the period of a day and a year don't divide evenly. Once the earth is back to about where it was a year ago, the earth isn't facing the same direction it was last year, so that spot on the earth isn't facing the same direction it was a year ago. Being further north, Greenwich isn't going to get back to the same spot (relative to the sun) that it was last year at the same time that 0 Lat / 0 Long is. So if you base the definition on Greenwith vs. 0/0, you get a, albeit slightly, different answer to the question "how many days are in a year". To put it another way, when a given spot on the earth gets back to where it was a year ago (relative to the Sun), the core of the earth isn't in the same spot it was a year ago, so what spot you pick matters because the core of the earth is going to be in a different spot (relative to the sun) than it was one year ago, if you pick a different spot on the earth.
Neither UTC nor GMT have daylight savings time. Europe/London time, the timezone that Greenwich resides in, does. But GMT does not. GMT is, what Americans would call a "Standard Time" - i.e. without DST.
Getting back to the question, Epoch time doesn't technically have a timezone. It is based on a particular point in time, which just so happens to line up to an "even" UTC time (at the exact beginning of a year and a decade, etc.). If that concept doesn't fit well in your brain, and if it helps to think of Epoch time as being in UTC, go right ahead. You're in good company and in the grand scheme of things, it really doesn't matter. You ever see those law suits where somoene is awarded $1. It's kind of a "you're right, but it doesn't really matter" type of verdict. If someone sued you for saying Epoch time is in the UTC timezone, they would win $1. That wouldn't buy them a cup of coffee at any Starbucks in any timezone on the planet.
IF both computers are set up correctly with their clocks set for the correct timezone and UTC values, they should return the same value.
Of course that's a big IF. There's almost certain to be a difference of at least a second, more often minutes between the time reported by two computers. And many computers are set up to have incorrect timezone settings, and will report their local time when asked a timestamp rather than UTC.
And in that lies the difference between theory and practice. In theory it's all the same, in practice you should not rely on it.
No, epoch timestamp should not change, because it has a fixed timezone which is UTC.
If you want to use a time object in other time zone, just look it up in libraries of the language you use, but do NOT try to add/substract a couple of hours from epoch timestamp and assume it's in another time zone, which will make things very confusing to other people, especially when you expose it in your API.
If you use C++, I recommend this library. I heard it will soon be added into standard library.
For all, I understand sometimes time object is hard to deal with and it looks easier to add/substruct on epoch timestamp. Please don't do it and do not persuade others to do it. A time object is much easier once you get used to it and can take care of time zone conversion easily without messing up with historical time zone changes due to politics/law etc...

Resources