Multivariate timeseries using DARTS - time-series

All. I want to forecast sales, but the data is in the following format: different geo-locations have the same sales date, say I have three locations with three weeks of sales data, all three locations have the same dates. My question is whether I should create a unique N-BEATS model for each location.
It also has some categorical features, aside from one-hot, are there any other ways to use categorical features using the darts library like in PyTorch forecasting?

Related

How to fine-tune the number of cluster in k-means clustering and incremental way of building the model using BigQuery ML

Used K-means clustering Model for detecting anomaly using BigQuery ML.
Datasets information
date Date
trade_id INT
trade_name STRING
agent_id INT
agent_name String
total_item INT
Mapping - One trade has multiple agent based on date.
Model Trained with below information by sum(total_iteam)
trade_id
trade_name
agent_id
agent_name
Number of cluster: 4
Need to find the anomaly for each trades and agent based on date.
Model is trained with given set of data and distance_from_closest_centroid is calculated. for each trade and agent based on date distance is called. Rightest distance is consider as a anomaly. Using this information
Questions
1. How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).
Questions
2. How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis.
As the question was updated, I will sum up our discussion as an answer to further contribute to the community.
Regarding your first question "How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).".
According to the documentation, if you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. However, if you want to select the most optimal number, you can perform hyperarameter tunning, which is the process of selecting one (or a set) of optimal hyperparameter for a learning algorithm, in your case K-means within BigQuery ML. In order to determine the ideal number of clusters you would run CREATE MODEL query for different values of num_clusters. Then , finding the error measure and select the point which it is at the minimum value. You can select the error measure in the training tab Evaluation, it will show the Davies–Bouldin index and Mean squared distance.
Your second question was "How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis."
K-means is an unsupervised leaning algorithm. So you will train your model with your current data. Then store it in a data set. This model is already trained and can certainly be used with new data, using the ML.PREDICT. So it will use the model to predict which clusters the new data belong to.
As a bonus information, I would like to share this link for the documentation which explains how K-means in BigQuery ML can be used to detect data anomaly.
UPDATE:
Regarding your question about retraining the model:
Question: "I want to rebuild the model because new trade information has to be updated in my existing model. In this case is this possible to append the model with only two month of data or should we need to rebuild the entire model?"
Answer: You would have to retrain the whole model if new relevant data arrives. There is not the possibility to append the model with only two months of new data. Although, I must mention that you can and should use warm_start to retrain your already existing model, here.
As per #Alexandre Moraes
omiting the num_clusters using K-means, BigQuery ML will choose a reasonable amount based in the number of rows in the training data. In addition, you can also use hyperparameter tuning to determine a optimal number of clusters. Thus, you would have to run the CREATE MODEL query for different values of num_clusters, find the error measure and pick the point which the error is minimum, link. –

Time Series Model - Device/Customer wise

I wanted to build a time series model - ARIMA, LSTM for eg. where I am trying to forecast a single time series column. But I wanted to do it say each customer wise. For instance: customer_name, date, metric is my data. I wanted to forecast what will be the metric for customer1 tomorrow.
The actual question now is: How to include the categorical column customer_name to the model.
I am good building model only with metric column but what I need is the metric for a particular customer.

Before clustering should i do an analysis on time series?

I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions

Model that predict both categorical and numerical output

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

Can categorical information improve prediction for out-of-sample categories?

Suppose we have records with several features relating to a target number that we're trying to predict. All records follow the same general underlying pattern, and are learned quite well by a RandomForestRegressor. Let's now say that all records have added a categorical feature, which can be encoded as additional information to improve upon the model's prediction ability. So far, so good.
But now let's say we want to use our regressor that was trained on the data including the categorical feature to predict records with new categories not represented in the training data. In this context, does the categorical information become useless (or worse?) Should the model be retrained without categorical information available in order to get the best generalization performance (since it's been previously fit to categories not in this dataset)? Or, is there some possible way that knowing category membership in the training data could improve prediction ability for out-of-sample categories?
If these sets have no intersection, then you shouldn't include the variable. If you expect to see some of the original values in the test data, then you should use it.

Resources