I'm very new to the machine learning and currently i have a realtime project that needs a forecasting of the inventory required for the coming months.
I have a dataset for the last 3 years. Data set is very clean(no null, no duplicates)
Example of the data
year_month device_type person_type # of devices
2017-01 laptop employee 2
2017-01 desktop temp 5
like this i have the data till December 2019 now i need to predict for the feb 2020
Request someone to suggest which model need to be used?
As per me i'm look at Linear regression
Related
When training a time series forecasting model, I checked the option to "Export test dataset to BigQuery." I'm having a hard time understanding the meaning of the "predicted_on" timestamps that appear in the BigQuery table.
Some info about my model: the granularity is weekly. The context window is 26 weeks, and the forecast horizon is 26 weeks. The 10% test data split also contains exactly 26 weeks of data. In our training data, we have a submission_week column which is designated as the "timestamp" column.
In the BigQuery table, I see the submission_week column. It starts
on 06/05/2022, which is the first date of the 10% test data split.
The BigQuery table also contains a predicted_on_submission_week
column. (This is the column which I don't understand.)
When I sort the BigQuery table by submission_week and then predicted_on_submission_week, it looks like this:†
predicted_on_submission_week / submission_week
06/05/2022 06/05/2022
---
06/05/2022 06/12/2022
06/12/2022 06/12/2022
---
06/05/2022 06/19/2022
06/12/2022 06/19/2022
06/19/2022 06/19/2022
† (Note that for each row above, there actually are multiple rows in the BigQuery table - one for each time series.)
The pattern seen above proceeds until there are at most 6 predicted_on_submission_week timestamps for every submission_week timestamp.
My questions:
What is the meaning of the predicted_on_submission_week timestamps? Why are there multiple (at most 6) such timestamps for each submission_week timestamp?
(I suspect this may be related to how the context window and forecast horizon are used during training and forecasts as described here in Google's documentation, but I'm not sure...)
I have a question about Machine Learning and Names Entity Recognition.
My goal is to extract named entities from an invoice document. Invoices are typical structured text and this kind of data is usually not useful for Natural Language processing (NLP). I already tried to train a model with the NLP Library Spacy to detect invoice meta data like Date, Total, Customer name. This works more or less good. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP.
An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this:
“Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).”
NLP loves this kind of text. But text extracted form a Invoice PDF (using Apache Tika) usually looks more like this:
Client no: Invoice no: Invoice date: Due date:
1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022
Invoice to: Booking Reference
LOGISTCS GMBH Client Reference :
DEMOSTRASSE 2-6 Comments:
28195 BREMEN
Germany
Vessel : Voy : Place of Receipt : POL: B/LNo:
XXX JUBILEE NUBBBW SAV33NAH, GA ME000243
ETA: Final Destination : POD:
15-Jul-2022 ANTWERP, BELGIUM
Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT
STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00
So I guess NLP is in general the wrong approach to train the recognition of meta data from an invoice document. I think the problem is more like recognizing cats in a picture.
What could be a more promising approach for Named Entity Recognition to train structured text with a machine learning framework?
I have a dataset with daily sales in a company. The columns are, category codes (4 categories), item code (195 items), day ID (from 1st Sep 2021 - 1st Feb 2022), Daily sales in qty.
In val and test sets, I have to predict WEEKLY sales from 14th Feb 2022 - to 13th March 2022. Columns are category codes, item code, week numbers (w1, w2, w3, w4). In the val set, I have weekly sales in qty, and in the test set, I have to predict weekly sales in qty.
Because my train set has DAILY sales and no week number, I am confused about how to approach this problem. I don't have historical data on sales of months they have given in val and test sets.
Should I map days in the train set to weeks as w1, w2, w3, w4 for each month? Are there any other good methods?
I tried expanding val set by dividing weekly sales by 7 and replacing a week row with 7 new rows for each day in that week, but it gave me very bad resutls.
I have to use the MAPE metric.
Welcome to the community!
Since you are asked to predict on a weekly basis, then it is better to transform your training data to weeks.
A pandas method for this is resample(), you can learn more about it in the documentation here. You can change the offset string to the one that you need to match the way in which the validation set was built. All the available choices can be found here.
You may find this useful too.
The data contains 2016 Jan- Dec and 2017 Jan-June order data for each customers. But most of the order amount is 0, that means most customers only order one or two times a year. Some of customers are new for 2017 and there are no records for these customers in 2016.
Data looks like(in dollars): 0,0,0,100,0,0,0,0,70,0,0,0,0,0,0,0,0,0
How can I forecast the order amount for each customer for July- Dec 2017?
What you are looking for in called intermittent forecasting. Check out the R package tsintermittent.
This link will be helpful https://stats.stackexchange.com/questions/127337/explain-the-croston-method-of-r
We have a fact table which collects information detailing when an employee selected a benefit. The problem we are trying to solve is how to count the total benefits selected by all employee's.
We do have a BenefitSelectedOnDay flag and ordinarily, we can do a SUM on this to get a result, but this only works for benefit selections since we started loading the data.
For Example:
Suppose Client#1 has been using our analytics tool since October 2016. We have 4 months of data in the platform.
When the data is loaded in October, the Benefits source data will show:
Employee#1 selected a benefit on 4th April 2016.
Employee#2 selected a benefit on 3rd October 2016
Setting the BenefitSelectedOnDay flag for Employee#2 is very straight forward.
The issue is what to do with Employee#1 because we can’t set a flag on a day which doesn’t exist for that client in the fact table. Client#1's data will start on 1st October 2016.
Counting the benefit selection is problematic in some scenarios. If we’re filtering the report by date and only looking at benefit selections in Q4 2016, we have no problem. But, if we want a total benefit selection count, we have a problem because we haven’t set a flag for Employee#1 because the selection date precedes Client#1’s dataset range (Oct 1st 2016 - Jan 31st 2017 currently).
Two approaches seem logical in your scenario:
Load some historical data going back as far as the first benefit selection date that is still relevant to current reporting. While it may take some work and extra space, this may be your only solution if employees qualify for different benefits based on how long the benefit has been active.
Add records for a single day prior to the join date (Sept 30 in this case) and flag all benefits that were selected before and are active on the Client join date (Oct 1) as being selected on that date. They will fall outside of the October reporting window but count for unbounded queries. If benefits are a binary on/off thing this should work just fine.
Personally, I would go with option 1 unless the storage requirements are ridiculous. Even then, you could load only the flagged records into the fact table. Your client might get confused if he is able to select a period prior to the joining date and get broken data, but you can explain/justify that.