Keras LSTM Restore States from Respective Sequences - machine-learning

I have a task where training data comes from several long sequences. I want to train with randomly chosen sequence, but not change the order within those sequences (because long term dependencies might be there).
I think this means to choose a sequence number, restore the previous state from that sequence, train, save the new state from that sequence, rinse and repeat.
Is there any way to specify the state when you're training a layer created with keras' LSTM? Do I have to go to my backend? (which is tensorflow)

According to https://github.com/fchollet/keras/issues/1947
I could use K.set_value(lstm.states[i], val) or K.get_value(lstm.states[i])

Related

What exactly happens when we call IterativeProcess.next on federated training data?

I went through the Federated Learning tutorial. I was wondering how .next function work when we call it on an iterative process.
Assuming that we have train data which is a list of lists. The outer list is a list of clients and the inner lists are batches of data for each client. Then, we create an iterative process, for example, a federated averaging process and we initialize the state.
What exactly happens when we call IterativeProcess.next on this training data. Does it take from these data randomly in each round? Or just take data from each client one batch at a time?
Assume that I have a list of tf.data.Datasets each representing a client data. How can I add some randomness to sampling from this list for the next iteration of federated learning?
My datasets are not necessarily the same length. When one of them is completely iterated over, does this dataset waits for all other datasets to completely iterate over their data or not?
Does (the iterative process) take from these data randomly in each round? Or just take data from each client one batch at a time?
The TFF tutorials all use tff.learning.build_federated_averaging_process which constructs a tff.templates.IterativeProcess that implements the Federated Averaging algorithm (McMahan et al. 2017). In this algorithm each "round" (one invocation of IterativePocess.next()) processes as many batches of examples on each client as the tf.data.Dataset is setup to produce in one iteration. tf.data: Build TensorFlow input pipelines is a great guide for tf.data.Dataset.
The order in which examples are processed is determined by how the tf.data.Datasets that were passed into the next() method as arguments were constructed. For example, in the Federated Learning for Text Generation tutorial's section titled Load and Preprocess the Federated Shakespeare Data, each client dataset is setup with preprocessing pipeline:
def preprocess(dataset):
return (
# Map ASCII chars to int64 indexes using the vocab
dataset.map(to_ids)
# Split into individual chars
.unbatch()
# Form example sequences of SEQ_LENGTH +1
.batch(SEQ_LENGTH + 1, drop_remainder=True)
# Shuffle and form minibatches
.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
# And finally split into (input, target) tuples,
# each of length SEQ_LENGTH.
.map(split_input_target))
The next function will iterate over these datasets in its entirety once each invocation of next(), in this case since there is no call to tf.data.Dataset.repeat(), next() will have each client see all of its examples once.
Assume that I have a list of tf.data.Datasets each representing a client data. How can I add some randomness to sampling from this list for the next iteration of federated learning?
To add randomness to each client's dataset, one could use the tf.data.Dataset.shuffle() to first randomize the order of yielded examples, and then tf.data.Dataset.take() to take only a sample of that new random ordering. This could be added to the preprocess() method above.
Alternatively, randomness in the selection of clients (e.g. randomly picking which clients participate each round) can be done using any Python library to sub-sample the list of datasets, e.g. Python's random.sample.
My datasets are not necessarily the same length. When one of them is completely iterated over, does this dataset waits for all other datasets to completely iterate over their data or not?
Each dataset is only iterated over once on each invocation of .next(). This is in line with the synchronous communication "rounds" in McMahan et al. 2017. In some sense, yes, the datasets "wait" for each other.
Any tff.Computation (like next) will always run the entire specified computation. If your tff.templates.IterativeProcess is, for example, the result of tff.learning.build_federated_averaging_process, its next function will represent one round of the federated averaging algorithm.
The federated averaging algorithm runs training for a fixed number of epochs (let's say 1 for simplicity) over each local dataset, and averages the model updates in a data-weighted manner at the server in order to complete a round--see Algorithm 1 in the original federated averaging paper for a specification of the algorithm.
Now, for how TFF represents and executes this algorithm. From the documentation for build_federated_averaging_process, the next function has type signature:
(<S#SERVER, {B*}#CLIENTS> -> <S#SERVER, T#SERVER>)
TFF's type system represents a dataset as a tff.SequenceType (this is the meaning of the * above), so the second element in the parameter of the type signature represents a set (technically a multiset) of datasets with elements of type B, placed at the clients.
What this means in your example is as follows. You have a list of tf.data.Datasets, each of which represents the local data on each client--you can think of the list as representing the federated placement. In this context, TFF executing the entire specified computation means: TFF will treat every item in the list as a client to be trained on in this round. In the terms of the algorithm linked above, your list of datasets represents the set S_t.
TFF will faithfully execute one round of the federated averaging algorithm, with the Dataset elements of your list representing the clients selected for this round. Training will be run for a single epoch on each client (in parallel); if datasets have different amounts of data, you are correct that the training on each client is likely to finish at different times. However, this is the correct semantics of a single round of the federated averaging algorithm, as opposed to a parameterization of a similar algorithm like Reptile, which runs for a fixed number of steps for each client.
If you wish to select a subset of clients to run a round of training on, this should be done in Python, before calling into TFF, e.g.:
state = iterative_process.initialize()
# ls is list of datasets
sampled_clients = random.sample(ls, N_CLIENTS)
state = iterative_process.next(state, sampled_clients)
Generally, you can think of the Python runtime as an "experiment driver" layer--any selection of clients, for example, should happen at this layer. See the beginning of this answer for further detail on this.

How does Beam Search operate on the output of The Transformer?

According to my understanding (please correct me if I'm wrong), Beam Search is BFS where it only explores the "graph" of possibilities down b the most likely options, where b is the beam size.
To calculate/score each option, especially for the work that I'm doing which is in the field of NLP, we basically calculate the score of a possibility by calculating the probability of a token, given everything that comes before it.
This makes sense in a recurrent architecture, where you simply run the model you have with your decoder through the best b first tokens, to get the probabilities of the second tokens, for each of the first tokens. Eventually, you get sequences with probabilities and you just pick the one with the highest probability.
However, in the Transformer architecture, where the model doesn't have that recurrence, the output is the entire probability for each word in the vocabulary, for each position in the sequence (batch size, max sequence length, vocab size). How do I interpret this output for Beam Search? I can get the encodings for the input sequence, but since there isn't that recurrence of using the previous output as input for the next token's decoding, how do I go about calculating the probability of all the possible sequences that stems from the best b tokens?
The beam search works exactly in the same as with the recurrent models. The decoder is not recurrent (it's self-attentive), but it is still auto-regressive, i.e., generating a token is conditioned on previously generated tokens.
At the training time, the self-attention is masked, such that in only attend to words to the left from the word that is currently generated. It simulates the setup you have at inference time when you indeed only have the left context (because the right context has not been generated yet).
The only difference is that in the RNN decoder, you only use the last RNN state in every beam search step. With the Transformer, you always need to keep the entire hypothesis and do the self-attention over the entire left context.
Adding more information for your later question and for people who have the same question:
I guess what I really want to ask is that, with an RNN architecture, in the decoder, I can feed it the b tokens that are highest in probability, to get the conditional probabilities of subsequent tokens. However, as I understand, from this tutorial here: tensorflow.org/beta/tutorials/text/…, I can't really do that for the Transformer architecture. Is that right? The decoder takes in the encoder outputs, the 2 masks and the target -- what would I input in for the parameter target?
The tutorial on the website you mentioned is using teacher forcing in the training stage. And it's possible to apply beam-search for the decoder of transformers in the testing stage.
Using beam-search for modern architecture like transformers in the training stage is not so popular. (Check this link for more info)
while teacher forcing as the tutorial mentioned in the training stage, can offer you parallel computation and speed up training once you are dealing with a large vocabulary-list task.
As for testing such a decoder, you could try the following steps to do beam-search (Just offering a possibility based on my understanding and there may have more better solutions):
First, Instead of taking the entire ground truth sequence as input for the decoder, you could only provide "[SOS]" and pad the rest positions.
Although output of your decoder is still [batch_size, max_sequence_len, vocab_size], only the (batch_size, 0, vocab_size) is giving you useful information and that is the first token your model generated. Select top b token and add to your "[SOS]" sequence. Now you have "[SOS] token(1,1)", ... , "[SOS], token(1,b)" sequences.
Second, use the above sequences as input for the decoder and search for the top b token among b * vocab_size options. Add them to their corresponding sequence.
Repeat until sequcences meet some restriction (max_ouput_length or [EOS])
P.S: 1) [SOS] or [EOS] means the Start or the End of the sequence.
2) token(i,j) means the j-th token in top b tokens for the i-th token in sequence

LSTM: choice of batch size when State reused for generation later

I am building an LSTM model that generates symbols step-by-step. The task is to train the model up to some point of the data sequence and then to use the trained model to process the remaining pieces of the sequence in the test phase -- these remaining pieces weren't seen during Training.
For this task, I am attempting to re-use the latest state from the Training phase for the subsequent Prediction phase (i.e. not to start predicting with clean zero-state, but to sort-of continue where things were left off during training).
In this context, I am wondering how to best choose the Batch size for training.
My Training data is one long sequence of time-ordered observations. If that sequence is chopped up into N batches for Training, then my understanding is that the State tensor will be of shape [N, Network_Size] during Training, and [1, Network_Size] during Prediction. So for Prediction, I simply take the last element of the [N, Network_Size] tensor, which is of shape [1, Network_Size].
That seems to work in terms of mechanics, but this means that the value of N determines how many observations that last vector of the original State has seen during Training.
Is there a best practice for determining how to chose N? The network trains much faster with a larger batch size, but I am concerned that this way the last part of the State tensor may have not seen enough. Obviously I'm trying various combinations, but curious how others have dealt with it.
Also, I have seen a few examples where parameters like this (or Cell size/etc.) are set as powers-of-2 (i.e. 64, 128, etc.). Is there any theoretical reason behind that vs simple 50/100/etc.? Or just a quirky choice?
First, for your last question: for computers powers of two are simpler than powers of 10 (memory size and alignment constraints, for example, are likelier to be powers of two).
It is unclear from your question what you mean by training; if updating parameters or just computing RNN forward steps. Updating parameters doesn't make much sense because for RNNs (including LSTMs) you'd ideally update parameters only after seeing an entire batch of sequences (and you often need many updates until the model is at all reasonable). Similarly, RNN forward steps don't make much sense to me because the state for each example is independent of the batch size (ignoring any batch normalization you might be doing).

What algorithms can be used to build predictors from fractions of time series?

Let S and T be sets of time series labelled with a property. Each time series is highly periodic and in fact contains subsequent repeats of the same process (consider e.g. a gait recording, which is a time series of foot positions that repeat the same motion, which I'm calling a segment for simplicity's sake).
What is a good feature extractor if my objective is to build a model that from a sequence of such segments returns a similarity score to S or T? Ignore the model itself for now - just consider feature extraction for the time being,
What you described falls into below problem:
Given a sequence features.
Classify or recognize the hidden state.
For example, in machine-vision, the sequence could be images captured continuously against a moving human. The goal is to identify certain categories of gestures.
In your problem, the input is d-dimensional time series data and your output is the probability of two classes (S and T).
There are some general methods to handle such problem, namely, hidden markov model (HMM) and conditional random fields (CRF).

Updating a Neural Network input

I trained an 4 inputs by 1 output NN for 1 month and then the same NN was upgraded to become 5 I by 1 O. Should I repeat the training with the new configuration or I can still use the old training?
You'll almost definitely need to repeat the training, unless you can feed your five-input NN to your trained 4-input NN, in which case you might be able to get away with less. It depends on exactly what the new variable represents.
If remaining 4 inputs still represent the same thing, you do not have to start from scratch. Instead, add new neuron in the input layer, and edges between it and hidden units. Initialize them as usually, but leave remaining weights. In other words - you are using your previous network as a starting point of the optimization. It should converge way faster, and in general be better, if you do not have access to historical data anymore (or you do not have time to retrain everything).

Resources