Aggregating daily to weekly data with categorical variables - machine-learning

my question doesn't regard any particular software, it's more of a broad question that could concern every type of data mining problem.
I have a data set with daily data and a bunch of attributes, like the above. 'Sales' is numeric and represents the revenue of sales on a given day. 'Open' is categorical and retrieves if a store is open (=1) or closed (=0). And 'Promo' is categorical, stating if a type of promo is happening at the given day (it takes the values a, b and c).
day
sales
open
promo
06/12/2022
15
1
a
05/12/2022
0
0
a
04/12/2022
12
1
b
Now, my goal is to develop a model that predicts weekly sales. In order to do this, I will need to aggregate daily data into weekly data.
For the variable sales this is quite straight forward because the value of weekly sales is the sum of daily sales within a certain week.
My question regards the categorical variables (open and promo), what kind of aggregation function should I use? I have tried to convert the variables to numerical and use the weekly mean as an aggregation method for this attributes, but i don't know if this is a common approach.
I would like to know if anyone knows what is the best/usual way to tackle this?
Thanks, anyway!

Related

Machine learning model with varying input shape as time changes

I am trying to predict the bookings of a stand-up comedian cafe. There are a lot of features I can use which have an affect on the number of sales. (e.g. day of the year, weather, average sales last month, day of the week, average sales on the specific day of the week etc.)
However, one of the features that most correlates with the actual number of sales is the number of tickets already sold before the deadline. The customers are able to start making reservations 120hours (5 days) before the actual deadline of ordering (11:00 AM on the same day of the show).
I would prefer to use this data as input for my machine learning algorithm. Currently I created 120 columns in the dataframe. The columns define 120 hours before deadline untill the deadline itself. Column "hour_98" therefore shows the accumulated sales 4 days before the deadline. Column "hour_24" shows the accumulated sales 24 hours before deadline etc.
If I now would like to predict the sales 24 hours before deadline the columns "hour_24" until "hour_0" are all given "NaN" values. Since algorithms can't deal with NaN values I currently give these columns a value of 0. However, I tihnk this is too simplistic and will result in bad prediction model.
How do we deal with a changing input shape since we obtain more data if we get closer to the deadline of ordering?
Now from what I understand, you have a fixed number of columns, each representing the data from a predefined hour before the deadline. So in a sense the input data shape never changes, only the validity of some input features changes.
Provided you have a fixed input shape, with changing validity of the features (NaNs),
you can get around that issue by using a mask for each input feature.
For example a valid hour_24 can be represented as hour_24 = 20 and mask_24 = 1, and an invalid hour_24 can be represented as hour_24 = 0 (or whatever) and
mask_24 = 0.
The algorithm itself will need to learn where to ignore a given feature in respect to the related feature's mask.
This answer explains in more detail how to mask input.

What should I do with date column in dataset?

Actually I'm working on the Australia weather dataset to predict whether it will rain tomorrow or not?
I'm new to machine learning and I don't know what to do with the date column in my dataset because I know machines only take numerical values.
So plz tell me what I should do with this date how should I deal with it?
You can extract year, month and day number of date as categorical variable. It can be significant what month, year or whether it is the beginning or the end of the month.
If you tell me what kind of language you are using I can help you with code :)
Hey Navin Bondade,
When you are using onehotencoding and supposedly, the column contains too many values, then automatically too many columns pile up in your dataset. So, you have to pick 10 or 15 most frequent values in that particular feature and try to encode them and leave the rest out. I haven't used it personally, but I did see a team win the KDD Orange Cup with this technique.
Happy Learning !!

Predicting next 4 quater customer count based on last 3 years quarterly customer count

I am currently working on a project where i need to predict next 4 quarters customer count for a retail client based on previous customer count of last three years i.e. quarterly data means total 12 data points. please suggest a beat approach to predict customer count for next 4 quarters.
Note:-I can't share the data but Customer count has a declining trend YOY.
Please let me know if more information is required or question is not clear.
With only 12 data points you would be hard-pushed to justify anything more than a simple regression analysis.
If the declining trend was so strong that you were at risk of passing below 0 sales you could look at taking a log to linearise the data.
If there is a strong seasonal cycle you will need to factor that in, but doing so also reduces the effective sample size from 12 to 9 quarters of data (three degrees of freedom being used up by the seasonalisation).
Thats about it really.
You dont specify explicitly how far in the future you want to make your predictions, but rather you do that implicitly when you make sure your model is robust and does not over-fit.
What does that mean?
Make sure that distribution of labels with your available independent varaibles has similiar distributions of that what you expect in future. You cant expect your model to learn patterns that were not there in the first place. So variables that show same information for distinct customer count values 4 quarters in the future are what you want to include.

Modelling recurrent items (expenses) as records with Rails

I am writing what could be defined as an accountancy/invoicing app using Rails 5. I am in need of implementing a section that predicts the company's cashflow in the future. So far I've got the following:
Actual bank movements and balances (in the past), imported from the bank
Future invoices (income) which are expected to be paid on a certain date
Future one-time expenses which are expected to be paid on a certain date
Using these three sets of data, I can calculate, for any given date in the future, the sum of: the last known bank balance, plus all the future invoices values coming IN, minus all the future expenses going OUT, so I get, theoretically, the expected balance of the company for any given date.
My doubt arises when it comes to recurrent expenses (or potentially incomes). Given that all of the items I mentioned before (bank movements, invoices and expenses) are actual ActiveRecord records stored in my database, I'm not sure about how to treat the recurrent expenses, for example:
Let's imagine I want to enter a known future recurrent paycheck of a certain employee, which is $2000 every first day of the month.
1- Should I generate at some point the next X entries and treat them as normal future expenses (each with its own ID, date and amount)?
2- The other option I've thought of is having some kind of "declaration" on the nature of the recurrent expense, as in "it's $2000 every day 1 of month until -forever-", similarly to a cronjob. But, if I were to take this approach, I'd like to have an ActiveRecord - similar interface, so that I can do something like:
cashflow = []
last_movement = BankMovement.last
value = last_movement.balance
(last_movement.date..(last_movement.date + 12.months)).each do |day|
value += Invoice.pending.expected_on(day).sum(:gross_amount)
value -= Expense.pending.expected_on(day).sum(:gross_amount)
value -= RecurringExpense.expected_on(day).sum(:gross_amount)
cashflow.push( { date: day, balance: value } )
end
This feels almost right but, I'm not sure about how to link the actual expense when it comes with the recurrent/calculated one. How can I then change the date if the expense gets paid the day after it was supposed? I need to have an actual record of each one of those, at least whenever they are "consolidated".
I'm not really sure if I was clear enough with my trouble here, so, should anyone want and have some spare time to help me out, please feel free to ask for any extra relevant info, I'd really appreciate some help, especially if we can find a way of doing this "the Rails way"!

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.
First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.
This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited
Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Resources