LSTM prediction output for multiple values given a certain time step - machine-learning

I came across a time series prediction problem where I have a dataset with multiple entries. Each entry represents a value of a certain category in a given time. All the entries are indexed by their timestamp. The entries are separated by a constant time (2 minutes in my case). My goal is to predict all or a subset of the dataset values given a timestamp in the future. However, the majority of the tutorials online are focusing on predicting a single value from the dataset.
My question: Can an LSTM be used to model such problem ?

If I understood correctly, you have the beginning (of some length) of a sequence of values and you need to predict how the sequence continues from there for potentially multiple steps.
You could do that with e.g. a sequence to sequence model. See
www.tensorflow.org/versions/r0.12/tutorials/seq2seq/
Or
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Related

What does the 'global_step' parameter refer to from the 'report_hyperparameter_tuning_metric' function in the hypertune package?

I am using Google Vertex AI to train models, and I am not sure what this parameter is specifying. I noticed that in some Vertex AI tutorials this value was also given a variable value called 'NUM_EPOCHS'. Looking at the Github for the package doesn't add much clarity.
I'm not sure how this can be referring to the number of epochs that the model is trained with, as I feel that can be done more easily just by writing code (and its default value, 1000, seems absurdly high). What does this parameter mean?
global_step in the Training Step is assigned into the report_hyperparameter_tuning_metric function which is used to define the number of batches that a graph can see as mentioned in this StackOverflow question. It represents how many batches has the model seen during training, from its start until now.
The function report_hyperparameter_tuning_metric is used to record and dump to the file the value of some metric (e.g. loss) in order to understand how well the model is performing. It takes the metric value and the step number (representing how many steps has passed which means how many batches did the model see and records this data point. This function needs to be called after every step (model sees the batch, updates the weights and the metrics values and calls this function), so that the training metrics will be recorded in a 2D plot (number of steps/metric). This step number equals the value of global_step which is used to keep track of the number of batches.
The global_step is used to keep track of the number of batches seen.It must be an integer variable.Each time a batch is provided, the weights are updated in a direction that minimizes the loss. When global_step is used with optimizer.minimize(), the variable is increased by one in the global_step argument.

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

Different scenario based queries on Imputing and Machine Learning

I am new to Data Science and learning to impute and about model training. Below are my few queries that I came across when training the datasets. Please provide answers to these.
Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different? Basically, if I train my already trained model on new data, what does it mean?
Imputing Related
Another question is related to imputing. Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
Suppose I have a dataset with 1000 observations. Now I train the model
on the complete dataset in one go. Another way I did it, I divided my
dataset in 80% and 20% and trained my model first at 80% and then on
20% data. Is it same or different?
It's hard to say: is it good or not. Generally, if your data (splits) are taken from the same distribution - you can perform additional training. However, not all model types are good for it. I advice you to run some kind of cross-validation with 80/20 splitting and error measurement checking before additional training and after.
Basically, if I train my already
trained model on new data, what does it mean?
If you take the datasets from the same distribution: you perform additional learning what theoretically should have positive influence on your model.
Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
You need clearly understand what do you want to do by imputation. If only first-class has values, how you can perform imputation for the second- or third-class? What do you need to find? Deck? Cabin number? Do you want to find new values or impute by already existing values?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
Very generally, you run imputation algorithm on the whole data you have (without target column).
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
If you have the finite number of cases, you just need to impute values as strings. If not, perform feature engineering: try to predict letter, number, first digit of the number, len(number) and so on.

Model that predict both categorical and numerical output

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

Resources