What exactly happens when we call IterativeProcess.next on federated training data?

What exactly happens when we call IterativeProcess.next on federated training data? - tensorflow-federated

I went through the Federated Learning tutorial. I was wondering how .next function work when we call it on an iterative process.
Assuming that we have train data which is a list of lists. The outer list is a list of clients and the inner lists are batches of data for each client. Then, we create an iterative process, for example, a federated averaging process and we initialize the state.
What exactly happens when we call IterativeProcess.next on this training data. Does it take from these data randomly in each round? Or just take data from each client one batch at a time?
Assume that I have a list of tf.data.Datasets each representing a client data. How can I add some randomness to sampling from this list for the next iteration of federated learning?
My datasets are not necessarily the same length. When one of them is completely iterated over, does this dataset waits for all other datasets to completely iterate over their data or not?

Does (the iterative process) take from these data randomly in each round? Or just take data from each client one batch at a time?
The TFF tutorials all use tff.learning.build_federated_averaging_process which constructs a tff.templates.IterativeProcess that implements the Federated Averaging algorithm (McMahan et al. 2017). In this algorithm each "round" (one invocation of IterativePocess.next()) processes as many batches of examples on each client as the tf.data.Dataset is setup to produce in one iteration. tf.data: Build TensorFlow input pipelines is a great guide for tf.data.Dataset.
The order in which examples are processed is determined by how the tf.data.Datasets that were passed into the next() method as arguments were constructed. For example, in the Federated Learning for Text Generation tutorial's section titled Load and Preprocess the Federated Shakespeare Data, each client dataset is setup with preprocessing pipeline:
def preprocess(dataset):
return (
# Map ASCII chars to int64 indexes using the vocab
dataset.map(to_ids)
# Split into individual chars
.unbatch()
# Form example sequences of SEQ_LENGTH +1
.batch(SEQ_LENGTH + 1, drop_remainder=True)
# Shuffle and form minibatches
.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
# And finally split into (input, target) tuples,
# each of length SEQ_LENGTH.
.map(split_input_target))
The next function will iterate over these datasets in its entirety once each invocation of next(), in this case since there is no call to tf.data.Dataset.repeat(), next() will have each client see all of its examples once.
Assume that I have a list of tf.data.Datasets each representing a client data. How can I add some randomness to sampling from this list for the next iteration of federated learning?
To add randomness to each client's dataset, one could use the tf.data.Dataset.shuffle() to first randomize the order of yielded examples, and then tf.data.Dataset.take() to take only a sample of that new random ordering. This could be added to the preprocess() method above.
Alternatively, randomness in the selection of clients (e.g. randomly picking which clients participate each round) can be done using any Python library to sub-sample the list of datasets, e.g. Python's random.sample.
My datasets are not necessarily the same length. When one of them is completely iterated over, does this dataset waits for all other datasets to completely iterate over their data or not?
Each dataset is only iterated over once on each invocation of .next(). This is in line with the synchronous communication "rounds" in McMahan et al. 2017. In some sense, yes, the datasets "wait" for each other.

Any tff.Computation (like next) will always run the entire specified computation. If your tff.templates.IterativeProcess is, for example, the result of tff.learning.build_federated_averaging_process, its next function will represent one round of the federated averaging algorithm.
The federated averaging algorithm runs training for a fixed number of epochs (let's say 1 for simplicity) over each local dataset, and averages the model updates in a data-weighted manner at the server in order to complete a round--see Algorithm 1 in the original federated averaging paper for a specification of the algorithm.
Now, for how TFF represents and executes this algorithm. From the documentation for build_federated_averaging_process, the next function has type signature:
(<S#SERVER, {B*}#CLIENTS> -> <S#SERVER, T#SERVER>)
TFF's type system represents a dataset as a tff.SequenceType (this is the meaning of the * above), so the second element in the parameter of the type signature represents a set (technically a multiset) of datasets with elements of type B, placed at the clients.
What this means in your example is as follows. You have a list of tf.data.Datasets, each of which represents the local data on each client--you can think of the list as representing the federated placement. In this context, TFF executing the entire specified computation means: TFF will treat every item in the list as a client to be trained on in this round. In the terms of the algorithm linked above, your list of datasets represents the set S_t.
TFF will faithfully execute one round of the federated averaging algorithm, with the Dataset elements of your list representing the clients selected for this round. Training will be run for a single epoch on each client (in parallel); if datasets have different amounts of data, you are correct that the training on each client is likely to finish at different times. However, this is the correct semantics of a single round of the federated averaging algorithm, as opposed to a parameterization of a similar algorithm like Reptile, which runs for a fixed number of steps for each client.
If you wish to select a subset of clients to run a round of training on, this should be done in Python, before calling into TFF, e.g.:
state = iterative_process.initialize()
# ls is list of datasets
sampled_clients = random.sample(ls, N_CLIENTS)
state = iterative_process.next(state, sampled_clients)
Generally, you can think of the Python runtime as an "experiment driver" layer--any selection of clients, for example, should happen at this layer. See the beginning of this answer for further detail on this.

Related

What does the 'global_step' parameter refer to from the 'report_hyperparameter_tuning_metric' function in the hypertune package?

I am using Google Vertex AI to train models, and I am not sure what this parameter is specifying. I noticed that in some Vertex AI tutorials this value was also given a variable value called 'NUM_EPOCHS'. Looking at the Github for the package doesn't add much clarity.
I'm not sure how this can be referring to the number of epochs that the model is trained with, as I feel that can be done more easily just by writing code (and its default value, 1000, seems absurdly high). What does this parameter mean?

global_step in the Training Step is assigned into the report_hyperparameter_tuning_metric function which is used to define the number of batches that a graph can see as mentioned in this StackOverflow question. It represents how many batches has the model seen during training, from its start until now.
The function report_hyperparameter_tuning_metric is used to record and dump to the file the value of some metric (e.g. loss) in order to understand how well the model is performing. It takes the metric value and the step number (representing how many steps has passed which means how many batches did the model see and records this data point. This function needs to be called after every step (model sees the batch, updates the weights and the metrics values and calls this function), so that the training metrics will be recorded in a 2D plot (number of steps/metric). This step number equals the value of global_step which is used to keep track of the number of batches.
The global_step is used to keep track of the number of batches seen.It must be an integer variable.Each time a batch is provided, the weights are updated in a direction that minimizes the loss. When global_step is used with optimizer.minimize(), the variable is increased by one in the global_step argument.

How does Beam Search operate on the output of The Transformer?

According to my understanding (please correct me if I'm wrong), Beam Search is BFS where it only explores the "graph" of possibilities down b the most likely options, where b is the beam size.
To calculate/score each option, especially for the work that I'm doing which is in the field of NLP, we basically calculate the score of a possibility by calculating the probability of a token, given everything that comes before it.
This makes sense in a recurrent architecture, where you simply run the model you have with your decoder through the best b first tokens, to get the probabilities of the second tokens, for each of the first tokens. Eventually, you get sequences with probabilities and you just pick the one with the highest probability.
However, in the Transformer architecture, where the model doesn't have that recurrence, the output is the entire probability for each word in the vocabulary, for each position in the sequence (batch size, max sequence length, vocab size). How do I interpret this output for Beam Search? I can get the encodings for the input sequence, but since there isn't that recurrence of using the previous output as input for the next token's decoding, how do I go about calculating the probability of all the possible sequences that stems from the best b tokens?

The beam search works exactly in the same as with the recurrent models. The decoder is not recurrent (it's self-attentive), but it is still auto-regressive, i.e., generating a token is conditioned on previously generated tokens.
At the training time, the self-attention is masked, such that in only attend to words to the left from the word that is currently generated. It simulates the setup you have at inference time when you indeed only have the left context (because the right context has not been generated yet).
The only difference is that in the RNN decoder, you only use the last RNN state in every beam search step. With the Transformer, you always need to keep the entire hypothesis and do the self-attention over the entire left context.

Adding more information for your later question and for people who have the same question:
I guess what I really want to ask is that, with an RNN architecture, in the decoder, I can feed it the b tokens that are highest in probability, to get the conditional probabilities of subsequent tokens. However, as I understand, from this tutorial here: tensorflow.org/beta/tutorials/text/…, I can't really do that for the Transformer architecture. Is that right? The decoder takes in the encoder outputs, the 2 masks and the target -- what would I input in for the parameter target?
The tutorial on the website you mentioned is using teacher forcing in the training stage. And it's possible to apply beam-search for the decoder of transformers in the testing stage.
Using beam-search for modern architecture like transformers in the training stage is not so popular. (Check this link for more info)
while teacher forcing as the tutorial mentioned in the training stage, can offer you parallel computation and speed up training once you are dealing with a large vocabulary-list task.
As for testing such a decoder, you could try the following steps to do beam-search (Just offering a possibility based on my understanding and there may have more better solutions):
First, Instead of taking the entire ground truth sequence as input for the decoder, you could only provide "[SOS]" and pad the rest positions.
Although output of your decoder is still [batch_size, max_sequence_len, vocab_size], only the (batch_size, 0, vocab_size) is giving you useful information and that is the first token your model generated. Select top b token and add to your "[SOS]" sequence. Now you have "[SOS] token(1,1)", ... , "[SOS], token(1,b)" sequences.
Second, use the above sequences as input for the decoder and search for the top b token among b * vocab_size options. Add them to their corresponding sequence.
Repeat until sequcences meet some restriction (max_ouput_length or [EOS])
P.S: 1) [SOS] or [EOS] means the Start or the End of the sequence.
2) token(i,j) means the j-th token in top b tokens for the i-th token in sequence

Is it possible to cluster data with grouped rows of data in unsupervised learning?

I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.

In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.

I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

In the Orange data mining toolkit, how do I specify groups for cross-validation?

I'm using the Orange GUI, and trying to perform cross-validation. My data has 8 different groups (specified by a variable in the input data), and I'd like each fold to hold out a different group. Is this possible to do using Orange? I can select the number of folds for cross-validation, but I don't see any way of determining which data is in each one.

Cross-validation does random sampling. I don't think what you seek is possible out of the box.
If you really want to have it honor the splits you made beforehand (according to some input variable), and you aren't afraid of some manual labor, you can use Select Rows widget to select the rows of one group (i.e. Matching Data), pass that into Test & Score as Test Data, and have all the rest of data (i.e. Unmatched Data) as training Data. This way, you get the cross-validation for a single fold (group). Repeat, and finally average, to obtain results for all folds.
If you know some Python, there's always Orange scripting layer you can fall back to.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart