Guessing next K values of a sequence - time-series

Say we have sampled a function in a constant rate and recieved x1,...,xn then we are requested to guess the next k values xn+1,...,xn+k, is it a known problem? is there a known algorithms or approaches to deal with that kind of a problem?

This problem is not well specified.
Who says, the next elements are not: 42, 42, 42, 42, 42, 42, 42, pi, ...
Any sequence of integers is mathematically equally likely, unless you specify your problem more precisely.
(Also, "data mining" is probably the wrong terminology. This is a TV "intelligence test" puzzle problem, not so much a real data problem.)

Related

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?
As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.
thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

Time series forecasting (DeepAR): Prediction results seem to have basic flaw

I'm using the DeepAR algorithm to forecast survey response progress with time. I want the model to predict the next 20 data points in the survey progress. Each survey is a time series in my training data. The length of each time series is the # days for which the survey ran. For example, the below series indicates that the survey started on 29-June-2011 and the last response was received on 24-Jul-2011 (25 days is the length).
{"start":"2011-06-29 00:00:00", "target": [37, 41.2, 47.3, 56.4, 60.6, 60.6,
61.8, 63, 63, 63, 63.6, 63.6, 64.2, 65.5, 66.1, 66.1, 66.1, 66.1, 66.1, 66.1,
66.1, 66.1, 66.1, 66.1, 66.7], "cat": 3}
As you can see the values in the time series can remain the same or increase. The training data would never indicate a downward trend. Surprisingly, when I generated predictions, I noticed that the predictions had a downward trend. When there is no trace of downward trend in the training data, I'm wondering how the model could have possibly learned this. To me, this seems to be a basic flaw in the predictions. Can someone please throw some light on why the model might behave in this way? I build the DeepAR model with the below hyper parameters. The model was tested and the RMSE is about 9. Would it help if I change any of the hyper parameters? Any recommendations for this.
time_freq= 'D',
context_length= 30,
prediction_length= 20,
cardinality= 8,
embedding_dimension= 30,
num_cells= 40,
num_layers= 2,
likelihood= 'student-T',
epochs= 20,
mini_batch_size= 32,
learning_rate= 0.001,
dropout_rate= 0.05,
early_stopping_patience= 10
If there is an up-trend in all time series, there should not be a problem learning this. If your time series usually have a rising and then falling period, then the algorithm may learn this and thus generate a similar pattern, even though the example you forecast only had an up-trend so far.
How many time series do you have and how long are they on average?
All your hyper parameters look reasonable and it is a bit hard to tell, what to improve without knowing more about the data. If you don't have that many time series, you can try to increasing the number of epochs (perhaps try a few hundred) and increase early stopping to 20 - 30.

Catboost: what are reasonable values for l2_leaf_reg?

Running catboost on a large-ish dataset (~1M rows, 500 columns), I get:
Training has stopped (degenerate solution on iteration 0, probably too small l2-regularization, try to increase it).
How do I guess what the l2 regularization value should be? Is it related to the mean values of y, number of variables, tree depth?
Thanks!
I don't think you will find an exact answer to your question because each data-set is different one from another.
However, based on my experience values form a range between 2 and 30, is a good starting point.

Can artificial neural networks work with mathematical sets?

I know that using neural networks for anything text-related is difficult as they have problems with non-numerical input data.
But I'm not sure about mathematical sets. And sets of sets.
Like [0, 1, 2] and [3, 4, 5] or [[0, 1], [2, 3]] and [[4, 5], [6, 7]]
It should be possible to compute distances between these by computing the distances between all corresponding elements, right? I can't really find any information on that and don't want to start using neural networks without being sure.
(Googling anything with 'set' just isn't promising because all you get as result is the term 'data set'..)
EDIT:
First: The assignment specifically asks for a neural network, so I can't use k-means or any other clustering methods.
So the original question wasn't really addressing the actual problem. I don't have to think of a distance metric but of a way to add the sets to the activation function and for that of how to map them to a single value. But, regarding the distance metric, I'm actually not really sure at what point of the neural network I need it.. I guess that's a basic comprehension problem.
I will just write down some thoughts now.
The thing that confuses me is standardization of categories. Having three categories 'red', 'green' and 'blue' you can map them to numbers 1 to 3, but that would mean that 'red' would have a larger distance to 'blue' than 'green' does and that's not the case. So the categories are encoded as (1, 0, 0) and (0, 1, 0) and (0, 0, 1) which gives them all the same distance.
So it must be possible to add these to the activation function somehow. I could imagine that they are interpreted as binary numbers, so that (1,0,0)=100=4, (0,1,0)=010=2 and (0,0,1)=001=1. That would be a distinct mapping. But numbers 1 to 3 are distinct to, so as mentioned above, the distance metric must be necessary at some point.
So the problem still is how to map a set to a single value. I can do that right before I add it to the function, so I don't have to choose a mapping that also maintains a logical distance between the sets because when getting to the point of applying the distance metric I can still apply it to the original sets and don't have to use the mapped value. Is that correct? Or am I still missing something?
Neural nets, in general, have no such problem. Image recognition and language translation are well within their domains. What you do need is the metrics and manipulations to relate your inputs to the ground truth in a well-ordered fashion -- which your distance metric will do quite nicely.
Go right ahead and build your neural network. Supply it with the appropriate distance function, and let it train away. Do make sure to put in some tracking instrumentation (e.g. print statements) to trace the operation for a few iterations before you turn it entirely loose.

(Bidirectional) RNN for simple text classification

TL;DR: Is Bidirectional RNN helpful for simple text classification and is padding evil?
In my recent work, I created a LSTM model and a BLSTM model for the same task, that is, text classification. The LSTM model did a pretty good job, yet I decided to give BLSTM a shot to see whether it may even push the accuracy further. In the end, I found BLSTM much slower to converge and surprisingly, it overfitted, even though I applied dropout with the probability of 50%.
In the implementation, I used unrolled RNN for both LSTM and BLSTM, expecting for faster training. To meet the requirement, I manually padded the input texts to a fixed length.
Let's say we have a sentence "I slept late in the morning and missed th interview with Nebuchadnezzar", which is then padded with 0 at its end when converted to an array of indices of pre-trained word embeddings. So we get something like [21, 43, 25, 64, 43, 25, 6, 234, 23, 0, 0, 29, 0, 0, 0, ..., 0]. Note that the "th" (should be "the") is a typo and the name "Nebuchadnezzar" is too rare so both of them are not present in the vocabulary so we replace it with also 0, which corresponds to a special full-zero word vector.
Here are my reflections:
Some people prefer changing unknown words into a special word like "< unk >" before feeding the corpus into a GloVe or Word2Vec model. Does it mean that we have to first build the vocabulary and change some low-frequency words (according to min count setting) into "< unk >" before training? Is it better than changing unknown words into 0 or just removing them when training RNN?
The trailing 0s fed into LSTM or BLSTM networks, as far as I'm concerned, mess the output up. Although there is no new information from outside, the cell state still gets updated for each time step that follows, so the output of the final cell will be heavily impacted by the long trailing 0s. And to my belief, BLSTM will be impacted even more as it also processes the text from the inverse order, that is something like [0, 0, 0, ..., 0, 321, 231], especially if we set the initial forget gate to 1.0 to foster memory at the beginning. I see a lot of people use padding but will not it cause a disaster if the texts are padded to a great length and in the case of BLSTM?
Any idea on these issues? :-o
I agree mostly with Fabrice's answer above, but to add a few comments:
You should NEVER use the same token with UNK and PAD. Most deep learning libraries mask over PAD, since it provides no information. UNK, on the other hand, does provide information to your model (there is a word here, we just don't know what it is, and it's probably a special word), so you should not be masking that over. Yes, this does mean that in a separate preprocessing step, you should go through your training/testing data, build a vocabulary of, say, the top 10,000 most common words, and switch everything else to UNKs.
As noted in 1, most libraries simply mask over (i.e. ignore) padding characters, so this is not an issue. But as you said, it isn't necessary for you to pad the sentences. For example, you can group them by length while training, or feed in the sentences one at a time into your model.
By experience, having a different embedding for UNKNOWN and PADDING is helpful. Since you're doing text classification, I guess removing them wouldn't be too harmful if there aren't too many but I'm not familiar enough with text classification to say that with certainty.
As for padding your sequences, have you tried padding your sequences differently? For example, pad the beginning of your sequence for a forward LSTM and the end of a sequence for a backward LSTM. Since you're padding with zeros, the activations are not going to be as strong (if any) and your LSTMs will now end with your sequence instead of zeros which could overwrite your LSTM memory.
Of course, these are just suggestions out of the top of my head (I do not have enough reputation to comment) so I do not have THE answer. You'll have to try it out yourself and see if it helps or not. I hope it does.
Cheers.

Resources