So i have 2 dataset.
On the first one i have values for each hour of a day. Example:
Date Value
05/07/2017 01:00 5
05/07/2017 02:00 10
05/07/2017 03:00 5
In the second dataset i only have the total of each day
Date Value
05/07/2017 40
So i want to distribute the total of the second dataset by the same distribution of the first dataset. Something like this:
Date Value
05/07/2017 01:00 10
05/07/2017 02:00 20
05/07/2017 03:00 10
How can i do this? I'm using R and created a time series for the first dataset.
You may want to check the mice package for R which specialises in missing data imputation. In your case probably a knn method which would impute the missing values by regarding similar (times) attribute-wise samples might do the trick.
Having a second look, maybe a bit more sophisticated procedure would be possible to bootstrap the values across the different times and then to fill the missing value you would have to find a random (times) combination (assuming that you use a random sample of each time specific time pool or distribution) of these which would total to the sum that you have.
Related
I am trying to create a formula that will help me predict a future date based on an average time per day.
For example, I have a range of dates [1/12/2022, 5/12/2022, 15/12/2022], and each date has an amount of hours spent on that day [4, 2, 12]. At the moment I have a formula which will work out the average p/day by dividing the total by the start and current date.
What I want is to then predict the date based on this average hours (say 4 p/day) I will reach a goal of 2000.
An example sheet would look like this -
If below scenario is your input data then the following formula may help.
=C2+ROUNDUP(B2/A2,0)
I am using pytorch-forecasting for count time series. I have some date information such as hour of day, day of week, day of month etc...
when I assign these as categorical variables in TimeSeriesDataSet using time_varying_known_categoricals the training.data['categoricals'] values seem shuffled and not in the right order as the target. Why is that?
pandas dataframe is like below before going through TimeSeriesDataSet
After the following code
why has hour of day column changed to 0, 1, 12, 17?
Actually, the time_varying_known_categoricals are NOT shuffled. The categories assigned to them are not in order like 1 for 1st hour, 2 for 2nd hour etc.. that's why it feels like it has shuffled the time series. I tried to align "hour_of_day" categorical variable for 3 days. I noticed that the encoding for each hour matches correcly for each day so there is no shuffling. This information should be mentioned in the doc string atleast. It will save a lot of time and confusion.
I am trying to predict the bookings of a stand-up comedian cafe. There are a lot of features I can use which have an affect on the number of sales. (e.g. day of the year, weather, average sales last month, day of the week, average sales on the specific day of the week etc.)
However, one of the features that most correlates with the actual number of sales is the number of tickets already sold before the deadline. The customers are able to start making reservations 120hours (5 days) before the actual deadline of ordering (11:00 AM on the same day of the show).
I would prefer to use this data as input for my machine learning algorithm. Currently I created 120 columns in the dataframe. The columns define 120 hours before deadline untill the deadline itself. Column "hour_98" therefore shows the accumulated sales 4 days before the deadline. Column "hour_24" shows the accumulated sales 24 hours before deadline etc.
If I now would like to predict the sales 24 hours before deadline the columns "hour_24" until "hour_0" are all given "NaN" values. Since algorithms can't deal with NaN values I currently give these columns a value of 0. However, I tihnk this is too simplistic and will result in bad prediction model.
How do we deal with a changing input shape since we obtain more data if we get closer to the deadline of ordering?
Now from what I understand, you have a fixed number of columns, each representing the data from a predefined hour before the deadline. So in a sense the input data shape never changes, only the validity of some input features changes.
Provided you have a fixed input shape, with changing validity of the features (NaNs),
you can get around that issue by using a mask for each input feature.
For example a valid hour_24 can be represented as hour_24 = 20 and mask_24 = 1, and an invalid hour_24 can be represented as hour_24 = 0 (or whatever) and
mask_24 = 0.
The algorithm itself will need to learn where to ignore a given feature in respect to the related feature's mask.
This answer explains in more detail how to mask input.
Currently my data in Influx is the following
measurement: revenue_count
field: users
field: revenue #
timestamp: (auto generated by influx)
What i'm looking to do is find a way to get the average revenue for a day in the week. i.e What is the average revenue for Monday, Tuesday etc.
What's the best way to do this in influx?
You should use continuous queries to schedule automated rollups/downsampling and then select the data from these pre-calculated series.
If you don't have too many points, you might not need the CQ's. In that case an on-the-fly group by will most probably be enough.
I wasn't able to find info on whether you can "select all points for a certain day" by just specifying a date. As far as I know, this is currently not possible because if you specify something like time == '2016-02-22 what this will effectively mean is 2016-02-22 00:00:00 (it won't mean give me everything from 22nd Feb 2016).
What you may need to do is specify an interval (two time points) between which you expect your downsampled point to be placed.
InfluxDB has no concept of days of the week. You can get the average revenue per day, where a day is midnight to midnight UTC with the following:
SELECT MEAN(revenue) FROM revenue_count WHERE time > now() - 7d GROUP BY time(1d)
I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.