Time Series Model - Device/Customer wise - machine-learning

I wanted to build a time series model - ARIMA, LSTM for eg. where I am trying to forecast a single time series column. But I wanted to do it say each customer wise. For instance: customer_name, date, metric is my data. I wanted to forecast what will be the metric for customer1 tomorrow.
The actual question now is: How to include the categorical column customer_name to the model.
I am good building model only with metric column but what I need is the metric for a particular customer.

Related

Multivariate timeseries using DARTS

All. I want to forecast sales, but the data is in the following format: different geo-locations have the same sales date, say I have three locations with three weeks of sales data, all three locations have the same dates. My question is whether I should create a unique N-BEATS model for each location.
It also has some categorical features, aside from one-hot, are there any other ways to use categorical features using the darts library like in PyTorch forecasting?

How to fine-tune the number of cluster in k-means clustering and incremental way of building the model using BigQuery ML

Used K-means clustering Model for detecting anomaly using BigQuery ML.
Datasets information
date Date
trade_id INT
trade_name STRING
agent_id INT
agent_name String
total_item INT
Mapping - One trade has multiple agent based on date.
Model Trained with below information by sum(total_iteam)
trade_id
trade_name
agent_id
agent_name
Number of cluster: 4
Need to find the anomaly for each trades and agent based on date.
Model is trained with given set of data and distance_from_closest_centroid is calculated. for each trade and agent based on date distance is called. Rightest distance is consider as a anomaly. Using this information
Questions
1. How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).
Questions
2. How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis.
As the question was updated, I will sum up our discussion as an answer to further contribute to the community.
Regarding your first question "How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).".
According to the documentation, if you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. However, if you want to select the most optimal number, you can perform hyperarameter tunning, which is the process of selecting one (or a set) of optimal hyperparameter for a learning algorithm, in your case K-means within BigQuery ML. In order to determine the ideal number of clusters you would run CREATE MODEL query for different values of num_clusters. Then , finding the error measure and select the point which it is at the minimum value. You can select the error measure in the training tab Evaluation, it will show the Davies–Bouldin index and Mean squared distance.
Your second question was "How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis."
K-means is an unsupervised leaning algorithm. So you will train your model with your current data. Then store it in a data set. This model is already trained and can certainly be used with new data, using the ML.PREDICT. So it will use the model to predict which clusters the new data belong to.
As a bonus information, I would like to share this link for the documentation which explains how K-means in BigQuery ML can be used to detect data anomaly.
UPDATE:
Regarding your question about retraining the model:
Question: "I want to rebuild the model because new trade information has to be updated in my existing model. In this case is this possible to append the model with only two month of data or should we need to rebuild the entire model?"
Answer: You would have to retrain the whole model if new relevant data arrives. There is not the possibility to append the model with only two months of new data. Although, I must mention that you can and should use warm_start to retrain your already existing model, here.
As per #Alexandre Moraes
omiting the num_clusters using K-means, BigQuery ML will choose a reasonable amount based in the number of rows in the training data. In addition, you can also use hyperparameter tuning to determine a optimal number of clusters. Thus, you would have to run the CREATE MODEL query for different values of num_clusters, find the error measure and pick the point which the error is minimum, link. –

Can I build a ML model with independent variables containing (time series+ categorical +numeric) and a classifier dependent variable (0,1)

Let's say I have with me data containing
salary,
job profile,
work experience,
number of people in household,
other demographic etc ..
of multiple persons who visited my car dealership and I also have the data if he/she has bought a car from me or not.
I can leverage this dataset to predict if a new customer coming in is likely to buy a car or not. And let's say currently I am doing it using xgboost.
NOW, I have got additional data but it is a time series data of the monthly expenditure the person makes. Say I get the data for my training data too. Now I want to build a model which uses this time series data and the old demographics data(+ salary, age etc) to get to know if a customer is likely to buy or not.
Note: In the second part I have time series data of the monthly expenditure only. The other variables are at a point in time. For example I do not have the time series for Salary or Age.
Note2: I also have categorical variables like job profile which I would like to use in the model. But for this I do not know if the person has been in the same job profile or he has changed over from some other job profile.
As most of the data are specific to the person; except expenditure time series, so it is better to bring time series data at person level. This can be done by feature engineering like:
As #cmxu suggested take various statistical measures. It will be even more beneficial to take these statistical measures at different time intervals like say mean at last 2 days, 5 days, 7 days, 15 days, 30 day, 90 days, 180 days etc.
Create mixed features like:
a) ratio of salary vs expenditure statistical summery created in point 1 (choose appropriate interval)
b) salary per person household or avg monthly expenditure per household. etc.
With similar ideas you can easily create 100s or 1000s of features with your data and then feed all this data to XGBoost (which is easy to train and debug) or NN (more complicated to train).

How to include variable attributes in Machine Learning models?

What machine learning techniques can be used to make a model if some attributes change over time? For example predicting prices of a hotel depends on the number of tourists in the city which is time dependent i.e. it changes from time to time.
Also, if we have a good trained model on some static data, then what are the ways to update the model if some data is changed except retraining the model on complete data again?
Regarding the first question, I would just add a feature indicating time. For instance, hotel X will appear in few data records, each one differs in the value of it's "Month" feature (the data-point of August might have an higher price from the one of December). This way the model will take into consideration the time of the year.
Regarding the second question, unless you're using reinforcement learning / online learning, which is used to train models from an oncoming sequences of samples, I don't see a way to change the data without having the train to model again.

Customer behaviour prediction when there are no negative examples

Imagine you own a postal service and you want to optimize your business processes. You have a history of orders in the following form (sorted by date):
# date user_id from to weight-in-grams
Jan-2014 "Alice" "London" "New York" 50
Jan-2014 "Bob" "Madrid" "Beijing" 100
...
Oct-2017 "Zoya" "Moscow" "St.Petersburg" 30
Most of the records (about 95%) contain positive numbers in the "weight-in-grams" field, but there are a few that have zero weight (perhaps, these messages were cancelled or lost).
Is it possible to predict whether the users from the history file (Alice, Bob etc.) will use the service in Nov., 2017? What machine learning methods should I use?
I tried to use simple logistic regression and decision trees, but they evidently give positive outcome for any user, as there are very few negative examples in the training set. I also tried to apply Pareto/NBD model (BTYD library in R), but it seems to be extremely slow for large datasets, and my data set contains more than 500 000 records.
I have another problem: if I impute negative examples (considering that the user, who didn't send a letter in the certain month is a negative example for this month) the dataset grows from 30 Mb up to 10 Gb.
The answer is yes you can try to predict.
You can approach this as a time series and run RNN:
Train your RNN on your set pivoted so each user is one sample.
You can also pivot your set so each user is a a row (observation) by aggregating each users' data. Then run multivariate logistic regression. You will loose information this way, but it might be simpler. You can add time related columns such as 'average delay between orders', 'average orders per year' etc.
You can use Bayesian methods to estimate the probability with which the user will return.

Resources