LSTM multi-step prediction with a single time-signal input - machine-learning

Say I have one-minute data sample collected from soccer matches with six features. 1500 games to train and test the model.
I have implemented LSTM model for multiple feature forecast. I trained/tested the model with lag 5 and got a score of 91%. I.e make a prediction for minute 6.
My question is, given only the first-minute data, is it possible to make a prediction for the remaining 89 minutes of the game? (Of course, I will design a new model with input shape(1,6) and output (89,6)
So my input_shape=(1,1,6) which always looks [[0,0,a,b,0,0]] where a and b are unique and pre-determined for each match.
And the expected output will have shape=(89,6).
I really appreciate any suggestion.

Yes it is possible to do that, the method to be used will be a slight variation of a method known as Sampling Novel sequences and it follows this model
What basically happens is that you use first minute to predict second minute, rather than randomly generating it, and use the generated result as the Input for the next step and thus you continue, remember, this is only for the sampling/predicting stage and not for generation step, I learned this from Andrew ng's deep learning course and this links to the video for same.
And I believe you can handle shapes and dimensions accordingly.
If you have any doubts or difficulties, comment below.
Image source: medium.com

Related

Predicting time series based on previous events using neural networks

I want to see if the following problem can be solved by using neural networks: I have a database containing over 1000 basketball events, where the total score has been recorded every second from minute 5 till minute 20, and where the basketball games are all from the same league. This means that the events are occurring on different time periods. The data is afterwards interpolated to have the exact time difference between two timesteps, and thus obtaining exactly 300 points between minute 5 and minute 20. This can be seen here:
Time series. The final goal is to have a model that can predict the y values between t=15 till t=20 and use as input data the y values between t=5 and t=15. I want to train the model by using the database containing the 1000 events. For this I tried using the following network:
input data vs output data
Neural network
The input data, that will be used to train the neural network model would have the shape (1000,200) and the output data, would have the shape (1000,100).
Can someone maybe guide me in the right direction for this and maybe give some feedback if this is a correct approach for such a problem, I have found some previous time series problems, but all of them were based on one large time series, while in this situation I have 1000 different time series.
There are a couple different ways to approach this problem. Based on the comments this sounds like a univariate/multi-step time series forecasting albeit across many different events.
First to clarify most deep learning for time series models/frameworks take data in the following format (batch_size, n_historical_steps, n_feature_time_series) and output the result in the format (batch_size, n_forecasted_steps, n_targets) .
Since this is a univariate forecasting problem n_feature_time_series would be one (unless I'm missing something). Now n_historical_steps is a hyper parameter we often optimize on as often the entire temporal history is not relevant to forecasting the next time n steps. You might want to try optimizing on that as well. However let say you choose to use the full temporal history then this would look like (batch_size, 200, 1). Following this approach you might then have output shape of (batch_size, 100, 1). You could then use a batch_size of 1000 to feed in all the different events at once (assuming of course you have a different validation/test set).This would give you an input shape of (1000, 200, 1) This is how you would likely do it for instance if you were going to use models like DA-RNN, LSTM, vanilla Transformer, etc.
There are some other models though that would create a learnable series embedding_id such as the Convolutional Transformer Paper or Deep AR. This is essentially a unique series identifier that would be associated with each event and the model would learn to forecast in the same pass on each.
I have models of both varieties implemented that you could use in Flow Forecast. Though I don't have any detailed tutorials on this type of problem at the moment. I will also say also that in all honesty given that you only have 1000 BB events (each with only 300 univariate time steps) and the many variables in play at Basketball I doubt that you will be able to accomplish this task with any real degree of accuracy. I would guess you probably need at least 20k+ basketball event data to be able to forecast this type of problem well with deep learning at least.

Can I use a transfer learning in facebook prophet?

I want my prophet model to predict values for every 10 minute interval over the next 24h (e.g. 24*6=144 values).
Let's say I've trained a model on a huge (over 900k of rows) .csv file where sample row is
...
ds=2018-04-24 16:10, y=10
ds=2018-04-24 16:20, y=14
ds=2018-04-24 16:30, y=12
...
So I call mode.fit(huge_df) and wait for 1-2 seconds to receive 144 values.
And then an hour passes and I want to tune my prediction for the following (144 - 6) 138 values given a new data (6 rows).
How can I tune my existing prophet model without having to call mode.fit(huge_df + live_df) and wait for some seconds again? I'd like to be able to call mode.tune(live_df) and get an instant prediction.
As far as I'm aware this is not really a possibility. I think they use a variant of the BFGS optimization algorithm to maximize the posterior probability of the the models. So as I see it the only way to train the model is to take into account the whole dataset you want to use. The reason why transfer learning works with neural networks is that it is just a weight (parameter) initialization and back propagation is then run iteratively in the standard SGD training schema. Theoretically you could initialize the parameters to the ones of the previous model in the case of prophet, which might or might not work as expected. I'm however not aware that something of the likes is currently implemented (but since its open-source you could give it a shot, hopefully reducing convergence times quite a bit).
Now as far as practical advice goes. You probably don't need all the data, just tail it to what you really need for the problem at hand. For instance it does not make sense to have 10 years of data if you have only monthly seasonality. Also depending on how strongly your data is autocorrelated, you may downsample a bit without loosing any predictive power. Another idea would be to try an algorithm that is suitable for online-learning (or batch) - You could for instance try a CNN with dilated convolution.
Time Series problems are quite different from usual Machine Learning Problems. When we are training cat/dog classifier, the feature set of cats and dogs are not going to change instantly (evolution is slow). But when it comes to the time series problems, training should happen, every time prior to forecasting. This becomes even more important when you are doing univariate forecasting (as is your case), as only feature we're providing to the model is the past values and these value will change at every instance. Because of these concerns, I don't think something like transfer learning will work in time series.
Instead what you can do is, try converting your time series problem into the regression problem by use of rolling windowing approach. Then, you can save that model and get you predictions. But, make sure to train it again and again in short intervals of time, like once a day or so, depending upon how frequently you need a forecast.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

HMM application in speech recognition

this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms,
i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training :
steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues :
first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then :
it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.
You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.

Leave one out accuracy for multi class classification

I am a bit confused about how to use the leave one out (LOO) method for calculating accuracy in the case of a multi-class, one v/s rest classification.
I am working on the YUPENN Dynamic Scene Recognition dataset which contains 14 categories with 30 videos in each category (a total of 420 videos). Lets name the 14 classes as {A,B,C,D,E,F,G,H,I,J,K,L,M,N}.
I am using linear SVM for one v/s rest classification.
Lets say I want to find the accuracy result for class 'A'. When I perform 'A' v/s 'rest', I need to exclude one video while training and test the model on the video I excluded. This video that I exclude, should it be from class A or should it be from all the classes.
In other words, for finding the accuracy of class 'A', should I perform SVM with LOO 30 times(leaving each video from class 'A' exactly once) or should I perform it 420 times(leaving videos from all the classes exactly once).
I have a feeling that I got this all mixed up ?? Can anyone provide me a short schematic of the right way to perform multi-class classification using LOO ??
Also how do I perform this using libsvm on Matlab ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The no of videos in the dataset is small, and thus I can't afford to create a separate TEST set (which was supposed to be sent to Neptune). Instead I have to ensure that I make full utilization of the dataset, because each video provides some new/unique information. In scenarios like this I have read that people use LOO as a measure of accuracy (when we can't afford an isolated TEST set). They call it as the Leave-One-Video-Out-experiment.
The people who have worked on Dynamic Scene Recognition have used this methodology for testing accuracy. In order to compare the accuracy of my method against their method, I need to use the same evaluation process. But they have just mentioned that they are using LOVO for accuracy. Not much detail apart from that is provided. I am a newbie in this field and thus it is a bit confusing.
According to what I can think of, LOVO can be done in two ways:
1) leave one video out of 420 videos. Train 14 'one-v/s-rest' classifiers using 419 videos as the training set.('A' v/s 'rest', 'B' v/s 'rest', ........'N' v/s 'rest').
Evaluate the left out video using the 14 classifiers. Label it with the class which gives maximum confidence score. Thus one video is classified. We follow the same procedure for labelling all the 420 videos. Using these 420 labels we can find the confusion matrix, find out the false positives/negatives, precision,recall, etc.
2) From each of the 14 classes I leave one video. Which means I choose 406 videos for training and 14 for testing. Using the 406 videos I find out the 14 'one-v/s-rest' classifiers. I evaluate each of the 14 videos in the test set and give them labels based on maximum confidence score. In the next round I again leave out 14 videos, one from each class. But this time the set of 14 is such that, none of them were left out in the previous round. I again train and evaluate the 14 videos and find out labels. In this way, I carry on this process 30 times, with a non-repeating set of 14 videos each time. In the end all 420 videos are labelled. In this case as well, I calculate confusion matrix, accuracy, precision, and recall, etc.
Apart from these two methods, LOVO could be done in many other different style. In the papers on Dynamic Scene Recognition they have not mentioned how they are performing the LOVO. Is it safe to assume that they are using the 1st method ? Is there any way of deciding which method would be better? Would there be significant difference in the accuracies obtained by the two methods ?
Following are some of the recent papers on Dynamic Scene Recognition for reference purpose. In the evaluation section they have mentioned about LOVO.
1)http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf
2)http://www.cse.yorku.ca/~wildes/wildesBMVC2013b.pdf
3)http://www.seas.upenn.edu/~derpanis/derpanis_lecce_daniilidis_wildes_CVPR_2012.pdf
4)http://webia.lip6.fr/~thomen/papers/Theriault_CVPR_2013.pdf
5)http://www.umiacs.umd.edu/~nshroff/DynScene.pdf
When using cross validation it is good to keep in mind that it applies to training a model, and not usually to the honest-to-god, end-of-the-whole-thing measures of accuracy, which are instead reserved for measures of classification accuracy on a testing set that has not been touched at all or involved in any way during training.
Let's focus just on one single classifier that you plan to build. The "A vs. rest" classifier. You are going to separate all of the data into a training set and a testing set, and then you are going to put the testing set in a cardboard box, staple it shut, cover it with duct tape, place it in a titanium vault, and attach it to a NASA rocket that will deposit it in the ice covered oceans of Neptune.
Then let's look at the training set. When we train with the training set, we'd like to leave some of the training data to the side, just for calibrating, but not as part of official Neptune ocean test set.
So what we can do is tell every data point (in your case it appears that a data point is a video-valued object) to sit out once. We don't care if it comes from class A or not. So if there are 420 videos which would be used in the training set for just the "A vs. rest" classifier, the yeah, you're going to fit 420 different SVMs.
And in fact, if you are tweaking parameters for the SVM, this is where you'll do it. For example, if you're trying to choose a penalty term or a coefficient in a polynomial kernel or something, then you will repeat the entire training process (yep, all 420 different trained SVMs) for all of the combinations of parameters you want to search through. And for each collection of parameters, you will associate with it the sum of the accuracy scores from the 420 LOO trained classifiers.
Once that's all done, you choose the parameter set with the best LOO score, and voila, that is you 'A vs. rest' classifier. Rinse and repeat for "B vs. rest" and so on.
With all of this going on, there is rightfully a big worry that you are overfitting the data. Especially if many of the "negative" samples have to be repeated from class to class.
But, this is why you sent that testing set to Neptune. Once you finish with all of the LOO-based parameter-swept SVMs and you've got the final classifier in place, now you execute that classifier across you actual test set (from Neptune) and that will tell you if the entire thing is showing efficacy in predicting on unseen data.
This whole exercise is obviously computationally expensive. So instead people will sometimes use Leave-P-Out, where P is much larger than 1. And instead of repeating that process until all of the samples have spent some time in a left-out group, they will just repeat it a "reasonable" number of times, for various definitions of reasonable.
In the Leave-P-Out situation, there are some algorithms which do allow you sample which points are left out in a way that represents the classes fairly. So if the "A" samples make up 40 % of the data, you might want them to take up about 40% of the leave-out set.
This doesn't really apply for LOO, for two reasons: (1) you're almost always going to perform LOO on every training data point, so trying to sample them in a fancy way would be irrelevant if they are all going to end up being used exactly once. (2) If you plan to use LOO for some number of times that is smaller than the sample size (not usually recommended), then just drawing points randomly from the set will naturally reflect the relative frequencies of the classes, and so if you planned to do LOO for K times, then simple taking a random size-K subsample of the training set, and doing regular LOO on those, would suffice.
In short, the papers you mentioned use second criteria, i.e. leaving one video from each class that makes 14 videos for testing and the rest for training.

Resources