I have a dataset with daily sales in a company. The columns are, category codes (4 categories), item code (195 items), day ID (from 1st Sep 2021 - 1st Feb 2022), Daily sales in qty.
In val and test sets, I have to predict WEEKLY sales from 14th Feb 2022 - to 13th March 2022. Columns are category codes, item code, week numbers (w1, w2, w3, w4). In the val set, I have weekly sales in qty, and in the test set, I have to predict weekly sales in qty.
Because my train set has DAILY sales and no week number, I am confused about how to approach this problem. I don't have historical data on sales of months they have given in val and test sets.
Should I map days in the train set to weeks as w1, w2, w3, w4 for each month? Are there any other good methods?
I tried expanding val set by dividing weekly sales by 7 and replacing a week row with 7 new rows for each day in that week, but it gave me very bad resutls.
I have to use the MAPE metric.
Welcome to the community!
Since you are asked to predict on a weekly basis, then it is better to transform your training data to weeks.
A pandas method for this is resample(), you can learn more about it in the documentation here. You can change the offset string to the one that you need to match the way in which the validation set was built. All the available choices can be found here.
You may find this useful too.
Related
I am trying to create a formula that will help me predict a future date based on an average time per day.
For example, I have a range of dates [1/12/2022, 5/12/2022, 15/12/2022], and each date has an amount of hours spent on that day [4, 2, 12]. At the moment I have a formula which will work out the average p/day by dividing the total by the start and current date.
What I want is to then predict the date based on this average hours (say 4 p/day) I will reach a goal of 2000.
An example sheet would look like this -
If below scenario is your input data then the following formula may help.
=C2+ROUNDUP(B2/A2,0)
I am trying to predict the bookings of a stand-up comedian cafe. There are a lot of features I can use which have an affect on the number of sales. (e.g. day of the year, weather, average sales last month, day of the week, average sales on the specific day of the week etc.)
However, one of the features that most correlates with the actual number of sales is the number of tickets already sold before the deadline. The customers are able to start making reservations 120hours (5 days) before the actual deadline of ordering (11:00 AM on the same day of the show).
I would prefer to use this data as input for my machine learning algorithm. Currently I created 120 columns in the dataframe. The columns define 120 hours before deadline untill the deadline itself. Column "hour_98" therefore shows the accumulated sales 4 days before the deadline. Column "hour_24" shows the accumulated sales 24 hours before deadline etc.
If I now would like to predict the sales 24 hours before deadline the columns "hour_24" until "hour_0" are all given "NaN" values. Since algorithms can't deal with NaN values I currently give these columns a value of 0. However, I tihnk this is too simplistic and will result in bad prediction model.
How do we deal with a changing input shape since we obtain more data if we get closer to the deadline of ordering?
Now from what I understand, you have a fixed number of columns, each representing the data from a predefined hour before the deadline. So in a sense the input data shape never changes, only the validity of some input features changes.
Provided you have a fixed input shape, with changing validity of the features (NaNs),
you can get around that issue by using a mask for each input feature.
For example a valid hour_24 can be represented as hour_24 = 20 and mask_24 = 1, and an invalid hour_24 can be represented as hour_24 = 0 (or whatever) and
mask_24 = 0.
The algorithm itself will need to learn where to ignore a given feature in respect to the related feature's mask.
This answer explains in more detail how to mask input.
So i have 2 dataset.
On the first one i have values for each hour of a day. Example:
Date Value
05/07/2017 01:00 5
05/07/2017 02:00 10
05/07/2017 03:00 5
In the second dataset i only have the total of each day
Date Value
05/07/2017 40
So i want to distribute the total of the second dataset by the same distribution of the first dataset. Something like this:
Date Value
05/07/2017 01:00 10
05/07/2017 02:00 20
05/07/2017 03:00 10
How can i do this? I'm using R and created a time series for the first dataset.
You may want to check the mice package for R which specialises in missing data imputation. In your case probably a knn method which would impute the missing values by regarding similar (times) attribute-wise samples might do the trick.
Having a second look, maybe a bit more sophisticated procedure would be possible to bootstrap the values across the different times and then to fill the missing value you would have to find a random (times) combination (assuming that you use a random sample of each time specific time pool or distribution) of these which would total to the sum that you have.
This is a school assignment, though unfortunately I'm either overthinking the question or this is significantly easier than I think.
Starting off here is a link to my spreadsheet: https://docs.google.com/spreadsheets/d/1jDFzitEGi319i6hUjqjJDF8nYZ8qm09-ieMGHk2T7AA/edit?usp=sharing
I am trying to calculate a weekly average from [Point Spread], though column A, B only offer a year and a week number. What would be the most efficient to tackle this?
I guess you're supposed to calculate the average of Point Spread for each distinct values of the year and the week, so for week 1 of 1998, you would calculate the average of the Point Spread on the first 16 rows.
Currently my data in Influx is the following
measurement: revenue_count
field: users
field: revenue #
timestamp: (auto generated by influx)
What i'm looking to do is find a way to get the average revenue for a day in the week. i.e What is the average revenue for Monday, Tuesday etc.
What's the best way to do this in influx?
You should use continuous queries to schedule automated rollups/downsampling and then select the data from these pre-calculated series.
If you don't have too many points, you might not need the CQ's. In that case an on-the-fly group by will most probably be enough.
I wasn't able to find info on whether you can "select all points for a certain day" by just specifying a date. As far as I know, this is currently not possible because if you specify something like time == '2016-02-22 what this will effectively mean is 2016-02-22 00:00:00 (it won't mean give me everything from 22nd Feb 2016).
What you may need to do is specify an interval (two time points) between which you expect your downsampled point to be placed.
InfluxDB has no concept of days of the week. You can get the average revenue per day, where a day is midnight to midnight UTC with the following:
SELECT MEAN(revenue) FROM revenue_count WHERE time > now() - 7d GROUP BY time(1d)