How can Tensorboard files be merged/combined or appended? - machine-learning

If I have multiple Tensorboard files, how can they be combined into a single Tensorboard file?
Say in keras the following model.fit() was called multiple times for a single model, for example in a typical GAN implementation:
for i in range(num_epochs):
model.fit(epochs=1, callbacks=Tensorboard())
This will produce a new Tensorboard file each time, which is not useful. Not sure if there is way to have Tensorboard append, or not produce unique time-stamped files each callback call.

It seems in tensorboard 2.3, you can access the tensorboard file's logged data and load it into a pandas dataframe. This tutorial outlines the approach:
https://www.tensorflow.org/tensorboard/dataframe_api
At the time of writing, I couldn't find a tensorboard version 2.3, but the module the tutorial relies upon - tensorboard.data.experimental.ExperimentFromDev() - seems to be present in tensorboard 2.2: https://github.com/tensorflow/tensorboard/blob/master/tensorboard/data/experimental/experiment_from_dev.py#L71
You could load existing data from several tensorboard files into dataframes and then combine those dataframes in the desired manner to write the combined dataframe to a new tensorboard file. I had considered this approach to address the problem of having multiple tensorboard files for different training runs but haven't tried it yet in my project.

If anyone stumbles upon this question, you can use initial_epoch defined in model.fit()
Inside the for loop, initial_epoch can be changed every time model.fit() is called.

New link for the initial_epoch documentation: https://keras.io/api/models/model_training_apis/
Also note that epochs will also need to be changed.
for i in range(num_epochs):
model.fit(epochs=i, initial_epoch=i-1, callbacks=Tensorboard())
because
Note that in conjunction with initial_epoch, epochs is to be understood as "final epoch". The model is not trained for a number of iterations given by epochs, but merely until the epoch of index epochs is reached.

Related

ClearML multiple tasks in single script changes logged value names

I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger).
When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou
When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.
Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.

how to get Tensorflow session from only keras .h5 file without session

The motivation behind this question is I had saved a Keras model using Matterport's MaskRCNN and in the tf.keras.callbacks.ModelCheckpoint() had very explicitly set the save_weights_only argument to False, so that the entire model would be saved (not just the weights).
Turns out there's a bug in the ModelCheckpoint() callback where it sometimes does not save the full model.
This is obviously a problem when you go to load the model after closing your TF session, as the Graph, architecture, and optimizer state are gone, making it hard (if not impossible) to reload that saved model.
Therefore, I am asking whether it is possible to somehow extract the TF session retroactively, from just the .h5 weights file, after the session has closed (resulting from, for example, your Notebook kernel crashing).
Not much code to go on, but there it is:
Given a .h5 file that was saved after each epoch of training a model in Keras, is it possible to extract the Graph session from that .h5 file, and if so, how?
I have several models saved in .h5 format but never called tf.get_session() during the saving of the model weights in h5 format.
with tf.session() as sess:
how to load this model using Tensorflow
TF 2.0 makes this a cinch, but how to solve this on Tensorflow version 1.14?
The end goal of this is to take a model saved with Keras as a .h5 file and do inference with it on Tensorflow Serving, which needs, to my knowledge, a protobuf file in .pb format.
https://medium.com/#pipidog/how-to-convert-your-keras-models-to-tensorflow-e471400b886a
I've tried keras_to_tensorflow:
https://github.com/amir-abdi/keras_to_tensorflow
The code to convert ModelCheckPoint saved in .h5 format to .pb format is shown below:
import tensorflow as tf
# The export path contains the name and the version of the model
tf.keras.backend.set_learning_phase(0) # Ignore dropout at inference
model = tf.keras.models.load_model('./model.h5')
export_path = './PlanetModel/1'
# Fetch the Keras session and save the model
# The signature definition is defined by the input and output tensors
# And stored with the default serving key
with tf.keras.backend.get_session() as sess:
tf.saved_model.simple_save(
sess,
export_path,
inputs={'input_image': model.input},
outputs={t.name:t for t in model.outputs})
For more information, please refer this article.
For other ways to do it, please refer this Stack Overflow Answer.

model.predict_classes vs model.predict_generator in keras

I understand that predict_generator outputs probabilities. To get the class, I just then find the index for the greatest probability and that will be the most probable class. However I find that after doing this, I get a different output than if I were to call predict_classes. I do not understand why. Can someone explain this please?
Generator in Keras uses glob to list folders which are alphabetically sorted, you can get classes being used during training using
# save classes to JSON
class_json = json.dumps(train_generator.class_indices)
with open("class.json", "w") as class_file:
class_file.write(class_json)
The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.
Another possibility if this is being done on images is not to use rescale=1./255 in ImageDataGenerator as mentioned in https://github.com/fchollet/keras/issues/3477
Hope that help!

Tensorflow RNN example limited to fixed batch size?

When looking at the RNN example at Tensorflow im having an issue with how the initial state is constructed. At build time of the graph we limit the graph to only handle input of one batch size. This is an issue for me since I want to be able feed in a single example and get a prediction for that single example.
The part of the code that restricts this is:
initial_state = state = tf.zeros([batch_size, lstm.state_size])
So my question is how can I expand the example so that I can use a variable batch size so that I can use the same model for training with batch size and then use single example for predictions?
This is how I'm doing this. You can pass the batch_size as a variable like this:
batch_size = tf.placeholder(tf.int32)
init_state = cell.zero_state(batch_size, tf.float32)
where cell is one of RNN cells (BasicLSTMCell, BasicGRUCell, MultiRNNCell, etc). However, if you're preserving the state over multiple batches that won't work since its' size has to be constant.
The Tensorflow text generation tutorial explains how to do this (now TF 2.0). It seems that the batch_size becomes part of the built model, so you have to rebuild/reload from the saved weights with a new batch size:
https://www.tensorflow.org/tutorials/text/text_generation#restore_the_latest_checkpoint
To keep this prediction step simple, use a batch size of 1.
Because of the way the RNN state is passed from timestep to timestep,
the model only accepts a fixed batch size once built.
To run the model with a different batch_size, we need to rebuild the
model and restore the weights from the checkpoint.
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()
I don't know for sure why you have to do this, but I always assumed it's because batching for recurrent layers requires management of multiple, parallel hidden state pipelines, so it preallocates them.

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

Resources