Full training set used by dask_lightgbm? - dask

I'm reading over the implementation of the dask-lightgbm estimators (specifically, the _train_part function in dask_lightgb.core.py), and I'm failing to see how the entirety of the training set gets used to fit the final estimator?
The _train_part function accepts the boolean argument return_model, and in the implementation of the train function (which uses client.submit to call _train_part on each worker), return_model is only true when the worker is the "master_worker" (which itself appears to be a randomly chosen Dask worker). Logically, each worker gets dispatched 1/n chunks of the overall model training set - where n = total number of workers - then each worker trains its own independent model on its own subset of the training set. The return_model parameter controls whether each worker's model gets returned by _train_part, so it returns None for all workers - and therefore, models - except for one worker.
Code:
def _train_part(params, model_factory, list_of_parts, worker_addresses, return_model, local_listen_port=12400,
time_out=120, **kwargs):
network_params = build_network_params(worker_addresses, get_worker().address, local_listen_port, time_out)
params.update(network_params)
# Concatenate many parts into one
parts = tuple(zip(*list_of_parts))
data = concat(parts[0])
label = concat(parts[1])
weight = concat(parts[2]) if len(parts) == 3 else None
try:
model = model_factory(**params)
model.fit(data, label, sample_weight=weight)
finally:
_safe_call(_LIB.LGBM_NetworkFree())
return model if return_model else None
Is this not equivalent to training a non-distributed version of a lightgbm estimator on a 1/n subsample of the training set? Am I missing something? I feel like I am missing a part where either the workers' independent models get combined into one, or where a single estimator is getting updated with the individual trees learned by separate workers.
Thank you!

Ah the answer is yes - dask_lightgbm uses all available training samples. Dask's responsibility is only to distribute data across workers. LightGBM handles all distributed learning once its network parameters are set. It's not that each worker is training its own independent model - LightGBM is training a single model - but each worker will get a copy of it. For this reason, only the chosen worker returns the fitted estimator, and everyone else returns None.

Related

How to use pretrained weights of a model for initializing the weights in next iteration?

I have a model architecture. I have saved the entire model using torch.save() for some n number of iterations. I want to run another iteration of my code by using the pre-trained weights of the model I saved previously.
Edit: I want the weight initialization for the new iteration be done from the weights of the pretrained model
Edit 2: Just to add, I don't plan to resume training. I intend to save the model and use it for a separate training with same parameters. Think of it like using a saved model with weights etc. for a larger run and more samples (i.e. a complete new training job)
Right now, I do something like:
# default_lr = 5
# default_weight_decay = 0.001
# model_io = the pretrained model
model = torch.load(model_io)
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
loss_new = BCELoss()
epochs = default_epoch
.
.
training_loop():
....
outputs = model(input)
....
.
#similarly for test loop
Am I missing something? I have to run for a very long epoch for a huge number of sample so can not afford to wait to see the results then figure out things.
Thank you!
From the code that you have posted, I see that you are only loading the previous model parameters in order to restart your training from where you left it off. This is not sufficient to restart your training correctly. Along with your model parameters (weights), you also need to save and load your optimizer state, especially when your choice of optimizer is Adam which has velocity parameters for all your weights that help in decaying the learning rate.
In order to smoothly restart training, I would do the following:
# For saving your model
state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict()
}
model_save_path = "Enter/your/model/path/here/model_name.pth"
torch.save(state, model_save_path)
# ------------------------------------------
# For loading your model
state = torch.load(model_save_path)
model = MyNetwork()
model.load_state_dict(state['model'])
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
optim.load_state_dict(state['optimizer'])
Besides these, you may also want to save your learning rate if you are using a learning rate decay strategy, your best validation accuracy so far which you may want for checkpointing purposes, and any other changeable parameter which might affect your training. But in most of the cases, saving and loading just the model weights and optimizer state should be sufficient.
EDIT: You may also want to look at this following answer which explains in detail how you should save your model in different scenarios.

How to use tf.metrics.accuracy?

I want to use tf.metrics.accuracy to track the accuracy of my predictions, but I am unsure of how to use the update_op (acc_update_op below) that the function returns:
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
tf.metrics.accuracy is one of the many streamed metric TensorFlow operations (another one of which is tf.metrics.recall). Upon creation, two variables (count and total) are created in order to accumulate all incoming results for one final outcome. The first returned value is a tensor for the calculation count / total. The second op returned is a stateful function which updates these variables. Streamed metric functions are useful when evaluating the performance of a classifier over multiple batches of data. A quick example of use:
# building phase
with tf.name_scope("streaming"):
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
test_fetches = {
'accuracy': accuracy,
'acc_op': acc_update_op
}
# when testing the classifier
with tf.name_scope("streaming"):
# clear counters for a fresh evaluation
sess.run(tf.local_variables_initializer())
for _i in range(n_batches_in_test):
fd = get_test_batch()
outputs = sess.run(test_fetches, feed_dict=fd)
print("Accuracy:", outputs['accuracy'])
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
That would not be a good idea unless you are only using the UPDATE_OPS collection for testing purposes. Usually, the collection will already have certain control operations for the training phase (such as moving batch normalization parameters) that are not meant to be run alongside the validation phase. It may be best to either keep them in a new collection or add these operations to the fetch dictionary manually.

Keras LSTM: Injecting already-known *future* values into prediction

I've built an LSTM In Keras with the goal of predicting future values of a time-series from a high-dimensional, time-index input.
However, there's a unique requirement: for certain time points in the future, we know with certainty what some values of the input series will be. For example:
model = SomeLSTM()
trained_model = model.train(train_data)
known_data = [(24, {feature: 2, val: 7.0}), (25, {feature: 2, val: 8.0})]
predictions = trained_model(look_ahead=48, known_data=known_data)
Which would train the model up to time t (the end of training), and predict forward 48 time periods from time t, but substituting known_data values for feature 2 at times 24 and 25.
How exactly can I explicitly inject this into the LSTM at some time?
For reference, here's the model:
model = Sequential()
model.add(LSTM(hidden, input_shape=(look_back, num_features)))
model.add(Dropout(dropout))
model.add(Dense(look_ahead))
model.add(Activation('linear'))
This may be a result of my un-intuitive grasp of LSTMs, and I'd appreciate any clarification. I've dived into the Keras source code, and my first guess is to inject it right into the LSTM state variable, but I'm unsure how to do that at time t (or even if that is correct.)
I think a clean way of doing this is to introduce 2*look_ahead new features, where for each 0 <= i < look_ahead 2*i-th feature is an indicator whether the value of the i-th time step is known and (2*i+1)-th is the value itself (0 if not known). Accordingly, you can generate training data with these features to make your model take into account these known values.
I am not exactly sure what you are trying to do, but maybe create your own layer to go at the end that sets the data to the known values, similar to how dropout sets random values to zero. As a side note, I have had better results with pooling than dropout, so maybe try switching that out and training it. Here is a good guide on how to do it. https://www.tutorialspoint.com/keras/keras_customized_layer.htm

Tensorflow RNN example limited to fixed batch size?

When looking at the RNN example at Tensorflow im having an issue with how the initial state is constructed. At build time of the graph we limit the graph to only handle input of one batch size. This is an issue for me since I want to be able feed in a single example and get a prediction for that single example.
The part of the code that restricts this is:
initial_state = state = tf.zeros([batch_size, lstm.state_size])
So my question is how can I expand the example so that I can use a variable batch size so that I can use the same model for training with batch size and then use single example for predictions?
This is how I'm doing this. You can pass the batch_size as a variable like this:
batch_size = tf.placeholder(tf.int32)
init_state = cell.zero_state(batch_size, tf.float32)
where cell is one of RNN cells (BasicLSTMCell, BasicGRUCell, MultiRNNCell, etc). However, if you're preserving the state over multiple batches that won't work since its' size has to be constant.
The Tensorflow text generation tutorial explains how to do this (now TF 2.0). It seems that the batch_size becomes part of the built model, so you have to rebuild/reload from the saved weights with a new batch size:
https://www.tensorflow.org/tutorials/text/text_generation#restore_the_latest_checkpoint
To keep this prediction step simple, use a batch size of 1.
Because of the way the RNN state is passed from timestep to timestep,
the model only accepts a fixed batch size once built.
To run the model with a different batch_size, we need to rebuild the
model and restore the weights from the checkpoint.
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()
I don't know for sure why you have to do this, but I always assumed it's because batching for recurrent layers requires management of multiple, parallel hidden state pipelines, so it preallocates them.

Parameter selection and k-fold cross-validation

I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.

Resources