Predicting time series based on previous events using neural networks - machine-learning

I want to see if the following problem can be solved by using neural networks: I have a database containing over 1000 basketball events, where the total score has been recorded every second from minute 5 till minute 20, and where the basketball games are all from the same league. This means that the events are occurring on different time periods. The data is afterwards interpolated to have the exact time difference between two timesteps, and thus obtaining exactly 300 points between minute 5 and minute 20. This can be seen here:
Time series. The final goal is to have a model that can predict the y values between t=15 till t=20 and use as input data the y values between t=5 and t=15. I want to train the model by using the database containing the 1000 events. For this I tried using the following network:
input data vs output data
Neural network
The input data, that will be used to train the neural network model would have the shape (1000,200) and the output data, would have the shape (1000,100).
Can someone maybe guide me in the right direction for this and maybe give some feedback if this is a correct approach for such a problem, I have found some previous time series problems, but all of them were based on one large time series, while in this situation I have 1000 different time series.

There are a couple different ways to approach this problem. Based on the comments this sounds like a univariate/multi-step time series forecasting albeit across many different events.
First to clarify most deep learning for time series models/frameworks take data in the following format (batch_size, n_historical_steps, n_feature_time_series) and output the result in the format (batch_size, n_forecasted_steps, n_targets) .
Since this is a univariate forecasting problem n_feature_time_series would be one (unless I'm missing something). Now n_historical_steps is a hyper parameter we often optimize on as often the entire temporal history is not relevant to forecasting the next time n steps. You might want to try optimizing on that as well. However let say you choose to use the full temporal history then this would look like (batch_size, 200, 1). Following this approach you might then have output shape of (batch_size, 100, 1). You could then use a batch_size of 1000 to feed in all the different events at once (assuming of course you have a different validation/test set).This would give you an input shape of (1000, 200, 1) This is how you would likely do it for instance if you were going to use models like DA-RNN, LSTM, vanilla Transformer, etc.
There are some other models though that would create a learnable series embedding_id such as the Convolutional Transformer Paper or Deep AR. This is essentially a unique series identifier that would be associated with each event and the model would learn to forecast in the same pass on each.
I have models of both varieties implemented that you could use in Flow Forecast. Though I don't have any detailed tutorials on this type of problem at the moment. I will also say also that in all honesty given that you only have 1000 BB events (each with only 300 univariate time steps) and the many variables in play at Basketball I doubt that you will be able to accomplish this task with any real degree of accuracy. I would guess you probably need at least 20k+ basketball event data to be able to forecast this type of problem well with deep learning at least.

Related

Converting Multiple Time-series Signals into One Spectrogram

Is it possible to combine multivariate time series signals (from the same event) and convert them to the frequency domain to feed a spectrogram? I would like to convert these signals to the frequency domain so that I can perform a Convolutional Neural Network and predict classifications of events.
So far, I've only seen examples using just ONE (1) time series, not multidimensional. Such as pictured here below.
Time Series to Spectrogram
As an example, let's assume (in the figure below) this is the data I collected in multiple time series for 1 day in the year. I've collected similar data for 30 other days. I want to combine the signals in a way to create a frequency spectrogram.
Multivariate
Can this be done? What are some ways to perform this operation?
Can you please provide the code/github link of that one time series conversion

Impute time series using similar time series

I have a problem where I have a lot of data about 1 year recordings of thermostats where every hour it gives me the mean temperature in that household. But a lot of data is not available due to they only installed the thermostat in the middle of the year or they put out the thermostat for a week or ... But a lot of this thermostat data is really similar. What I want to do is impute the missing data using similar timeseries.
So lets say house A only started in july but from there they are very similar to household B I would want to then use the info from household B to predict what the data dould be before july in house A.
I was thinking about training a Recurrent Neural Network that could do this for me but I am not shure what is out there to do this and when I search for papers and such they almost exclusively work on data sets over multiple years and impute the data using the data of previous years. I do not have this data, so that is not an option.
Does anyone have a clue how to tackle this problem or a refference I could use that solves a similar problem ?
As I understand it you want to impute the data using cross-sectional data rather than time series information.
There are actually quite a lot of imputation packages that can do this for you in R. (if you are using R)
You'd need equally spaced data. So 1 values per hour and if it is not present, then it needs to be NA. So ideally you have then multiple time series of qual length.
Then you merge these time series according to the time stamp / hour.
Afterwards you can apply an imputation package like e.g. mice, missForest, imputeR with basically one line of code. These packages will use the correlations between the different time series to estimate the missing values in these series.

LSTM multi-step prediction with a single time-signal input

Say I have one-minute data sample collected from soccer matches with six features. 1500 games to train and test the model.
I have implemented LSTM model for multiple feature forecast. I trained/tested the model with lag 5 and got a score of 91%. I.e make a prediction for minute 6.
My question is, given only the first-minute data, is it possible to make a prediction for the remaining 89 minutes of the game? (Of course, I will design a new model with input shape(1,6) and output (89,6)
So my input_shape=(1,1,6) which always looks [[0,0,a,b,0,0]] where a and b are unique and pre-determined for each match.
And the expected output will have shape=(89,6).
I really appreciate any suggestion.
Yes it is possible to do that, the method to be used will be a slight variation of a method known as Sampling Novel sequences and it follows this model
What basically happens is that you use first minute to predict second minute, rather than randomly generating it, and use the generated result as the Input for the next step and thus you continue, remember, this is only for the sampling/predicting stage and not for generation step, I learned this from Andrew ng's deep learning course and this links to the video for same.
And I believe you can handle shapes and dimensions accordingly.
If you have any doubts or difficulties, comment below.
Image source: medium.com

Can I use a transfer learning in facebook prophet?

I want my prophet model to predict values for every 10 minute interval over the next 24h (e.g. 24*6=144 values).
Let's say I've trained a model on a huge (over 900k of rows) .csv file where sample row is
...
ds=2018-04-24 16:10, y=10
ds=2018-04-24 16:20, y=14
ds=2018-04-24 16:30, y=12
...
So I call mode.fit(huge_df) and wait for 1-2 seconds to receive 144 values.
And then an hour passes and I want to tune my prediction for the following (144 - 6) 138 values given a new data (6 rows).
How can I tune my existing prophet model without having to call mode.fit(huge_df + live_df) and wait for some seconds again? I'd like to be able to call mode.tune(live_df) and get an instant prediction.
As far as I'm aware this is not really a possibility. I think they use a variant of the BFGS optimization algorithm to maximize the posterior probability of the the models. So as I see it the only way to train the model is to take into account the whole dataset you want to use. The reason why transfer learning works with neural networks is that it is just a weight (parameter) initialization and back propagation is then run iteratively in the standard SGD training schema. Theoretically you could initialize the parameters to the ones of the previous model in the case of prophet, which might or might not work as expected. I'm however not aware that something of the likes is currently implemented (but since its open-source you could give it a shot, hopefully reducing convergence times quite a bit).
Now as far as practical advice goes. You probably don't need all the data, just tail it to what you really need for the problem at hand. For instance it does not make sense to have 10 years of data if you have only monthly seasonality. Also depending on how strongly your data is autocorrelated, you may downsample a bit without loosing any predictive power. Another idea would be to try an algorithm that is suitable for online-learning (or batch) - You could for instance try a CNN with dilated convolution.
Time Series problems are quite different from usual Machine Learning Problems. When we are training cat/dog classifier, the feature set of cats and dogs are not going to change instantly (evolution is slow). But when it comes to the time series problems, training should happen, every time prior to forecasting. This becomes even more important when you are doing univariate forecasting (as is your case), as only feature we're providing to the model is the past values and these value will change at every instance. Because of these concerns, I don't think something like transfer learning will work in time series.
Instead what you can do is, try converting your time series problem into the regression problem by use of rolling windowing approach. Then, you can save that model and get you predictions. But, make sure to train it again and again in short intervals of time, like once a day or so, depending upon how frequently you need a forecast.

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Resources