my question doesn't regard any particular software, it's more of a broad question that could concern every type of data mining problem.
I have a data set with daily data and a bunch of attributes, like the above. 'Sales' is numeric and represents the revenue of sales on a given day. 'Open' is categorical and retrieves if a store is open (=1) or closed (=0). And 'Promo' is categorical, stating if a type of promo is happening at the given day (it takes the values a, b and c).
day
sales
open
promo
06/12/2022
15
1
a
05/12/2022
0
0
a
04/12/2022
12
1
b
Now, my goal is to develop a model that predicts weekly sales. In order to do this, I will need to aggregate daily data into weekly data.
For the variable sales this is quite straight forward because the value of weekly sales is the sum of daily sales within a certain week.
My question regards the categorical variables (open and promo), what kind of aggregation function should I use? I have tried to convert the variables to numerical and use the weekly mean as an aggregation method for this attributes, but i don't know if this is a common approach.
I would like to know if anyone knows what is the best/usual way to tackle this?
Thanks, anyway!
I seek your valuable support in finding a way to calculate change rate over time with tabular dataset in google data studio. Here is the link to the dataset: https://docs.google.com/spreadsheets/d/1To1n5JJA6uVkLMgwjKhghJgCJpFmtXkqNog4DzfoEbE/edit?usp=sharing
There are many rows with data stamp and have different categories and sub categories. I have created a change rate table manually based on which I want to create charts in google data studio. The charts will be from the raw tabular data not the separate change rate table that is built only for example purpose.
So the chart could be based on a main category (as in the sample) and can also be viewed as sub-category and show change rate over time between the dates.
The dates can sometimes be months or years. I am not very savy with advanced formulas or scripting but I am hopeful someone here would be able to help me out on this. I will be ever so grateful for this :)
I can only provide you with the quotient of datasets between two days. If you need different mappings between dates (day, months, years), for each the following steps have to be done:
generate a new field "yesterday" with: DATETIME_SUB(date, INTERVAL 1 day)
blend this dataset with itself, using as dimension "date" and "yesterday".
Further dimensions are your categories fields A and B.
As metric, you can use the count of the date field.
I need to develop an application which identifies the date inside the given text using some NLP approach. Let's assume I have a data in DB with dates column "from", "to" and if the text is below,
Get data between 1st August and 15th August
I need to identify the dates and form the query to retrieve the data. I used Natty NLP and I was able to identify the dates. But I'm stuck for more complex time expressions like:
Get data uploaded next week
Get data uploaded last week
Here for the first one I need to identify next week Monday's date and Sunday's date and form the query same for the 2nd one. But with Natty it gives me next week from today's date. What other solutions exist? Or do I need to manipulate the expression by coding? I am using Java.
Your questions is a bit confusing, but I guess you want to achieve two things:
Identify words that represent a time expression
Map these words to a formal machine-readable representation
If that is what you need check the duckling framework, it identifies time expressions, and it normalise them into a single unique formal date representation.
Note that you need to pass a reference date, for ambiguous time expressions.
You can run it as a service and call it from your code.
I'm trying to create a spreadsheet that will allow me to quickly calculate the amount of time my trains were delayed on a daily basis.
I need a formula that will check for all trains on a particular route after a planned departure time (written in a cell),check these trains actual arrival time and then display the earliest possible time I could have arrived at my destination.
spreadsheet image
For example, in G4 I would like a formula that looks for all trains that depart after 7:49 (A4) and also match both of it's "From" and "To" (C4 & D4). It would then need to check these trains corresponding "actual arrival times" in column F and show the earliest possible train. So for row 4 this would be 9:36.
Any help would be really appreciated as I have been messing around with this for over a day and have gotten nowhere!
A link to the example is here - https://docs.google.com/spreadsheets/d/1eE8t4-_hKB6o5j3W57EHgKzsF9p1usm7nojerjmrDwY/edit#gid=0
Thanks
Oli
Not sure about 9:36 Do you mean 9:39 ?
It's a little difficult to do this but i think what you are looking for is a multiconditional lookup array. I have put below what I think you are trying to achieve.
If A2:A8 is greater than A4, C2:C8 = C4 and D2:D8 = D4, what is the lowest value in F2:F8
Is this correct?
If so then I came up with this formula:
=ArrayFormula(MIN(IF((A2:A8>A4),IF((C2:C8=C4),IF((D2:D8=D4),F2:F8)))))
If you get 0.402 or something, format the cell to time.
Otherwise, could you break it down for us a bit more?
I have a requirement to store dates and durations arising from multiple different calendars. In particular I need to store dates that:
Span the change to Gregorian calendars in different countries at different times
Cover a historic period of at least 500 years
Deal with multiple types of calendar - lunar, solar, Chinese, Financial, Christian, UTC, Muslim.
Deal with the change, in the UK, of the year end from 31st March to 31st December, and comparable changes in other countries.
I also need to store durations which I have defined as the difference between two timestamps (date and time). This implies the need to be able to store a "zero" date - so I can store durations of, say, three and a half hours; or 10 minutes.
I have details of the computations needed. Firebird's timestamp is based on a date function that starts at January 1st, 100 CE, so is not capable of being used for durations in the way I need to record them. In addition this data type is geared up (like most timestamp functions) to record the number of days since a base date; it is not geared up to record calendar dates.
Could anyone suggest:
A data structure to store dates and durations that meet the above requirements OR
A reference to such a data structure OR
Offer guidelines to approach the structuring of such storage OR
Any points that may help me to a solution.
EDIT:
#Warren P has provided some excellent work in his responses. I obviously have not explained what I am seeking clearly enough, as his work concentrates on the computations and how to go about calculating these. All valuable and useful stuff, but not what I intended my question to convey.
I do have details of all the computations needed to convert between various representations of dates, and I have a fairly good idea of how to implement them (using elements such as Warren suggests). However, my requirement is to STORE dates which meet the various criteria listed above. Example: date to be stored - 'Third June 13 Charles II'. I am trying to determine an appropriate structure within which to store such dates.
EDIT:
I have amended my proposed schema. I have listed the attributes on each table, and defined the tables and attributes by examples, given in the third section of the entity box. I have used the example given in this question and answer in my definition by example, and have amended the example in my question to correspond. Although I have proved my schema by describing somebody else's example, this schema may still be over complicated; over analysed; miss some obvious simplification and may prove very difficult to implement (Indeed, it may be plain wrong). Any comments or suggestions would be most welcome.
If you are writing your own, as I assume you intend to, I would make a class that contains a TDateTime, and other fields, and I would base it on the functionality in the very nicely written mxDateTime extension for Python, which is very easily readable, open source, C code, that you could use to extract the gregorian calendar logic you are going to need.
Within certain limits, TDateTime is always right. It's epoch value (0) is December 30, 1899 at midnight. From there, you can calculate other julian day numbers. It supports negative values, and thus it will support more than 400 years. I believe you will start having to do corrections, at the time of the last Gregorian calendar reforms. If you go from Friday, 15 October 1582, and figure out its julian day number, and the reforms before and after that, you should be able to do all that you require. Be aware that the time of day runs "backwards" before 1899, but that this is purely a problem in human heads, the computer will be accurate, and will calculate the number of minutes and seconds, up to the limit of double precision floating point math for you. Stick with TDateTime as your base.
I found some really old BorlandPascal/TurboPascal code that handles a really wide range of dates here.
If you need to handle arabic, jewish, and other calendars, again, I refer you to Python as a great source of working examples. Not just the mxdatetime extension, but stuff like this.
For database persistence, you might want to base your date storage around julian day numbers, and your time as C-like seconds since midnight, if the maximum resolution you need is 1 second.
Here's a snippet I would start with, and do code completion on:
TCalendarDisplaySubtype = ( cdsGregorian,cdsHebrew,cdsArabic,cdsAztec,
cdsValveSoftwareCompany, cdsWhoTheHeckKnows );
TDateInformation = class
private
FBaseDateTime:TDateTime;
FYear,FMonth,FDay:Integer; // if -1 then not calculated yet.
FCalendarDisplaySubtype:TCalendarDisplaySubtype;
public
function SetByDateInCE(Y,M,D,h,m,s:Integer):Boolean;
function GetAsDateInCE(var Y,M,D,h,m,s:Integer):Boolean;
function DisplayStr:String;
function SetByDateInJewishCalendar( ... );
property BaseDateTime:TDateTime read FDateTime write FDateTime;
property JulianDayNumber:Integer read GetJulianDayNumber write SetJulianDayNumber;
property CalendarDisplaySubType:TCalendarDisplaySubtype;
end;
I see no reason to STORE both the julian day number, and the TDateTime, just use a constant, subtract/add from the Trunc(FBaseDateTime) value, and return that, in the GetJulianDayNumber,SetJulianDayNumber functions. It might be worth having fields where you calculate the year, month, day, for the given calendar, once, and store them, making the display as string function much simpler and faster.
Update: It looks like you're better at ER Modelling than me, so if you posted that diagram, I'd upvote it, and that would be it. As for me, I'd be storing three fields; A Datetime field that is normalized to modern calendar standards, a text field (free form) containing the original scholarly date in whatever form, and a few other fields, that are subtype lookup table Foreign keys, to help me organize, and search on dates by the date and subtype. That would be IT for me.
Only a partial answer but an important piece.
Since you are going to store dates in a very broad range where a lot of things happened to calendars, you need to accommodate for those changes.
The timezone database TZ-database and the Delphi TZDB wrapper around the TZ-database will be of big help.
It has a database with rules how timezones historically behave.
I know they are based on the current calendar schemes, and you need to convert to UTC first.
You need to devise something similar for the other calendar schemes you want to support.
Edit:
The scheme I'd use would be like this:
find ways for all your calendars to convert to/from UTC
store the calendar type
store the dates in their original format, and the source of the date (just in case your source screwed up, and you need to recalculate).
use the UTC conversions to go from your original through UTC to the calendar types in your UI
--jeroen