Tuning a stable baselines agent for a custom env - machine-learning

I made a game and I am trying to make it work with stable baselines. I've tried different algorithms
I've tried reading stable baselines documentation, but i can't figure out where to start tuning.
My game is here: https://github.com/AbdullahGheith/BlockPuzzleGym
And this is the code i ran to train it:
import gym
import blockpuzzlegym
from stable_baselines import PPO2
env = gym.make("BlockPuzzleGym-v0")
model = PPO2('MlpPolicy', env, verbose=1)
model.learn(250000)
for _ in range(10000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
I tried 25000000 timesteps, but it still wouldn't work.
As you might tell, i am new to ML, and this is my first project. Any indications are welcome.
I tried using the parameters the MiniGrid parameters from stable baselines zoo (without the env wrapper)

You can see that you used simple default MlpPolicy that includes two fully connected layers with size 64. How to implement your own network?
Here is a docs how to implement your own network. https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html
You should understand that any model in stable-baselines corresponds to following structure:
feature_extractor -> mlp_extractor -> policy\value networks.
So, you could add policy_kwargs to the model, where policy_kwargs is a dict with:
policy_kwargs = dict(
feature extractor=custom_extractor,
net_arch=[64, dict(vf=[16], pi=[16])]
)
Here you can see net_arch corresponds to structure of mlp_extractor and for policy\value nets. custom_extractor is a funciton that you can implement that would process your input data to fully connected layer at the end. You can implement here any structure of a model that you want. For example, there is a default cnn_extractor from library:
def nature_cnn(scaled_images, **kwargs):
activ = tf.nn.relu
layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=8, stride=4, init_scale=np.sqrt(2), **kwargs))
layer_2 = activ(conv(layer_1, 'c2', n_filters=64, filter_size=4, stride=2, init_scale=np.sqrt(2), **kwargs))
layer_3 = activ(conv(layer_2, 'c3', n_filters=64, filter_size=3, stride=1, init_scale=np.sqrt(2), **kwargs))
layer_3 = conv_to_fc(layer_3)
return activ(linear(layer_3, 'fc1', n_hidden=512, init_scale=np.sqrt(2)))

Related

Pass information between pipeline steps in sklearn

I am working on a simple text generation problem with LSTMs. To make the preprocessing more compact and reproducible, I decided to implement everything in sklearn fashion, using custom sklearn transformers, and the KerasClassifier from scikeras to wrap the neural network definition in a sklearn-type estimator.
It almost works but I can't figure out how to pass information from within a certain custom transformer on to the KerasClassifier estimator. More precisely, for the method that creates the neural network, I need the number of outputs as an argument; but this depends on the number of words in the fitted vocabulary - which is an information that is currently encapsulated in ModelEncoder class.
(Note that in order to get the current logic work, I had to slightly modify the default sklearn Pipeline class, as it wouldn't allow modifying and returning both X and y. In other words, the default sklearn Pipeline only allows feature transformations but not target transformations. Modifying the custom Pipeline class was explained in this StackOverflow post.)
Example data:
train_data = ['o by no means honest ventidius i gave it freely ever and theres none can truly say he gives if our betters play at that game we must not dare to imitate them faults that are rich are fair'
'but was not this nigh shore'
'impairing henry strengthening misproud york the common people swarm like summer flies and whither fly the gnats but to the sun'
'what while you were there'
'chill pick your teeth zir come no matter vor your foins'
'thanks dear isabel' 'come prick me bullcalf till he roar again'
'go some of you knock at the abbeygate and bid the lady abbess come to me'
'an twere not as good deed as drink to break the pate on thee i am a very villain'
'beaufort it is thy sovereign speaks to thee'
'but say lucetta now we are alone wouldst thou then counsel me to fall in love'
'for being a bawd for being a bawd'
'all blest secrets all you unpublishd virtues of the earth spring with my tears'
'what likelihood' 'o find him']
Full code:
# Modify the sklearn Pipeline class to allow it to return tuples and hence enable both X and y modifications. (Current default implementation in sklearn only allows
# feature transformations, i.e. transformations on X, but not on y.)
class Pipeline(pipeline.Pipeline):
def _fit(self, X, y=None, **fit_params_steps):
self.steps = list(self.steps)
self._validate_steps()
memory = check_memory(self.memory)
fit_transform_one_cached = memory.cache(pipeline._fit_transform_one)
for (step_idx, name, transformer) in self._iter(
with_final=False, filter_passthrough=False
):
if transformer is None or transformer == "passthrough":
with _print_elapsed_time("Pipeline", self._log_message(step_idx)):
continue
try:
# joblib >= 0.12
mem = memory.location
except AttributeError:
mem = memory.cachedir
finally:
cloned_transformer = clone(transformer) if mem else transformer
X, fitted_transformer = fit_transform_one_cached(
cloned_transformer,
X,
y,
None,
message_clsname="Pipeline",
message=self._log_message(step_idx),
**fit_params_steps[name],
)
if isinstance(X, tuple): ###### unpack X if is tuple X = (X,y)
X, y = X
self.steps[step_idx] = (name, fitted_transformer)
return X, y
def fit(self, X, y=None, **fit_params):
fit_params_steps = self._check_fit_params(**fit_params)
Xt = self._fit(X, y, **fit_params_steps)
if isinstance(Xt, tuple): ###### unpack X if is tuple X = (X,y)
Xt, y = Xt
with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
if self._final_estimator != "passthrough":
fit_params_last_step = fit_params_steps[self.steps[-1][0]]
self._final_estimator.fit(Xt, y, **fit_params_last_step)
return self
class ModelTokenizer(TransformerMixin, BaseEstimator):
def __init__(self, max_len=100):
super().__init__()
self.max_len = max_len
def fit(self, X=None, y=None):
return self
def transform(self, X, y=None):
X_flattened = " ".join(X).split()
sequences = list()
for i in range(self.max_len+1, len(X_flattened)):
seq = X_flattened[i-self.max_len-1:i]
sequences.append(seq)
return sequences
class ModelEncoder(TransformerMixin, BaseEstimator):
def __init__(self):
super().__init__()
self.tokenizer = Tokenizer()
def fit(self, X=None, y=None):
self.tokenizer.fit_on_texts(X)
return self
def transform(self, X, y=None):
encoded_sequences = np.array(self.tokenizer.texts_to_sequences(X))
return (encoded_sequences[:,:-1], encoded_sequences[:,-1])
def create_nn(input_shape=(100,1), output_shape=None):
model = Sequential()
model.add(LSTM(64, input_shape=input_shape, return_sequences=True))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_shape, activation='softmax'))
metrics_list = [tf.keras.metrics.BinaryAccuracy(name='accuracy')]
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = metrics_list)
return model
pipe = Pipeline([
('tokenizer', ModelTokenizer()),
('encoder', ModelEncoder()),
('model', KerasClassifier(build_fn=create_nn, epochs=10, output_shape=vocab_size)),
])
# Question: how to pass 'vocab_size'?
Imports:
from sklearn import pipeline
from sklearn.base import clone
from sklearn.utils import _print_elapsed_time
from sklearn.utils.validation import check_memory
from sklearn.base import BaseEstimator, TransformerMixin
from keras.preprocessing.text import Tokenizer
from scikeras.wrappers import KerasClassifier
KerasClassifier has its own internal transformer (see here, it is used to provide one-hot encoding and such) which has an API to pass metadata to the model (see here, that's how arguments such as n_outputs_ are passed into the model building function). Could you override that to pass this extra metadata to the model? It's stepping a bit outside of the Scikit-Learn API, but as you've noted the Scikit-Learn API doesn't have this functionality built in. If you want to propagate that information from a Transformer in your pipeline into SciKeras you could encode it into a feature and then use the above-mentioned hooks along with a custom encoder to remove that feature and convert it into metadata that can be passed into the model, but now you'd be really pushing the Scikit-Learn API.

TFF : every client do a pretrain function instead of build_federated_averaging_process

I would like that every client train his model with a function pretrainthat I wrote below :
def pretrain(model):
resnet_output = model.output
layer1 = tf.keras.layers.GlobalAveragePooling2D()(resnet_output)
layer2 = tf.keras.layers.Dense(units=zdim*2, activation='relu')(layer1)
model_output = tf.keras.layers.Dense(units=zdim)(layer2)
model = tf.keras.Model(model.input, model_output)
iterations_per_epoch = determine_iterations_per_epoch()
total_iterations = iterations_per_epoch*num_epochs
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
checkpoint = tf.train.Checkpoint(step=tf.Variable(1), optimizer=optimizer, net=model)
manager = tf.train.CheckpointManager(checkpoint, pretrain_save_path, max_to_keep=10)
current_epoch = tf.cast(tf.floor(optimizer.iterations/iterations_per_epoch), tf.int64)
batch = client_data(0)
batch = client_data(0).batch(2)
epoch_loss = []
for (image1, image2) in batch:
loss, gradients = train_step(model, image1, image2)
epoch_loss.append(loss)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# if tf.reduce_all(tf.equal(epoch, current_epoch+1)):
print("Loss after epoch {}: {}".format(current_epoch, sum(epoch_loss)/len(epoch_loss)))
#print("Learning rate: {}".format(learning_rate(optimizer.iterations)))
epoch_loss = []
current_epoch += 1
if current_epoch % 50 == 0:
save_path = manager.save()
print("Saved model for epoch {}: {}".format(current_epoch, save_path))
save_path = manager.save()
model.save("model.h5")
model.save_weights("saved_weights.h5")
But as we know that TFF has a predefined function :
iterative_process = tff.learning.build_federated_averaging_process(...)
So please, how can I proceed ? Thanks
There are a few ways that one could proceed along similar lines.
First it is important to note that TFF is functional--one can use things like writing to / reading from files to manage state (as TF allows this), but it is not in the interface TFF exposes to users--while something involving writing to / reading from a file (IE, manipulating state without passing it through function parameters and results), this should at best be considered an implementation detail. It's something that TFF does not encourage.
By slightly refactoring your code above, however, I think this kind of application can fit quite nicely in TFF's programming model. We will want to define something like:
#tff.tf_computation
#tf.function
def pretrain_client_model(model, client_dataset):
# perhaps do dataset processing you want...
for batch in client_dataset:
# do model training
return model.weights() # or some tensor structure representing the trained model weights
Once your implementation looks something like this, you will be able to wire it in to a custom iterative process. The canned function you mention (build_federated_averaging_process) really just constructs an instance of tff.templates.IterativeProcess; you are always, however, free to write your own instance of this class.
Several tutorials take us through this process, this probably being the simplest. For a finished code example of a standalone iterative process implementation, see simple_fedavg.py.

FedProx with TensorFlow Federated

Would anyone know how to implement the FedProx optimisation algorithm with TensorFlow Federated? The only implementation that seems to be available online was developed directly with TensorFlow. A TFF implementation would enable an easier comparison with experiments that utilise FedAvg which the framework supports.
This is the link to the FedProx repo: https://github.com/litian96/FedProx
Link to the paper: https://arxiv.org/abs/1812.06127
At this moment, FedProx implementation is not available. I agree it would be a valuable algorithm to have.
If you are interested in contributing FedProx, the best place to start would be simple_fedavg which is a minimal implementation of FedAvg meant as a starting point for extensions -- see the readme there for more details.
I think the major change would need to happen to the client_update method, where you would add the proximal term depending on model_weights and initial_weights to the loss computed in forward pass.
I provide below my implementation of FedProx in TFF. I am not 100% sure that this is the right implementation; I post this answer also for discussing on actual code example.
I tried to follow the suggestions in the Jacub Konecny's answer and comment.
Starting from the simple_fedavg (referring to the TFF Github repo), I just modified the client_update method, and specifically changing the input argument for calculating the gradient with the GradientTape, i.e. instaead of just passing in input the outputs.loss, the tape calculates the gradient considering the outputs.loss + proximal_term previosuly (and iteratively) calculated.
#tf.function
def client_update(model, dataset, server_message, client_optimizer):
"""Performans client local training of "model" on "dataset".Args:
model: A "tff.learning.Model".
dataset: A "tf.data.Dataset".
server_message: A "BroadcastMessage" from server.
client_optimizer: A "tf.keras.optimizers.Optimizer".
Returns:
A "ClientOutput".
"""
def difference_model_norm_2_square(global_model, local_model):
"""Calculates the squared l2 norm of a model difference (i.e.
local_model - global_model)
Args:
global_model: the model broadcast by the server
local_model: the current, in-training model
Returns: the squared norm
"""
model_difference = tf.nest.map_structure(lambda a, b: a - b,
local_model,
global_model)
squared_norm = tf.square(tf.linalg.global_norm(model_difference))
return squared_norm
model_weights = model.weights
initial_weights = server_message.model_weights
tf.nest.map_structure(lambda v, t: v.assign(t), model_weights,
initial_weights)
num_examples = tf.constant(0, dtype=tf.int32)
loss_sum = tf.constant(0, dtype=tf.float32)
# Explicit use `iter` for dataset is a trick that makes TFF more robust in
# GPU simulation and slightly more performant in the unconventional usage
# of large number of small datasets.
for batch in iter(dataset):
with tf.GradientTape() as tape:
outputs = model.forward_pass(batch)
# ------ FedProx ------
mu = tf.constant(0.2, dtype=tf.float32)
prox_term =(mu/2)*difference_model_norm_2_square(model_weights.trainable, initial_weights.trainable)
fedprox_loss = outputs.loss + prox_term
# Letting GradientTape dealing with the FedProx's loss
grads = tape.gradient(fedprox_loss, model_weights.trainable)
client_optimizer.apply_gradients(zip(grads, model_weights.trainable))
batch_size = tf.shape(batch['x'])[0]
num_examples += batch_size
loss_sum += outputs.loss * tf.cast(batch_size, tf.float32)
weights_delta = tf.nest.map_structure(lambda a, b: a - b,
model_weights.trainable,
initial_weights.trainable)
client_weight = tf.cast(num_examples, tf.float32)
return ClientOutput(weights_delta, client_weight, loss_sum / client_weight)

Using a stateful Keras model in pure TensorFlow

I have a stateful RNN model with several GRU layers that was created in Keras.
I have to run this model now from Java, so I dumped the model as protobuf, and I'm loading it from Java TensorFlow.
This model must be stateful because features will be fed one timestep at-a-time.
As far as I understand, in order to achieve statefulness in a TensorFlow model, I must somehow feed in the last state every time I execute the session runner, and also that the run would return the state after the execution.
Is there a way to output the state in the Keras model?
Is there a simpler way altogether to get a stateful Keras model to work as such using TensorFlow?
Many thanks
An alternative solution is to use the model.state_updates property of the keras model, and add it to the session.run call.
Here is a full example that illustrates this solutions with two lstms:
import tensorflow as tf
class SimpleLstmModel(tf.keras.Model):
""" Simple lstm model with two lstm """
def __init__(self, units=10, stateful=True):
super(SimpleLstmModel, self).__init__()
self.lstm_0 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
self.lstm_1 = tf.keras.layers.LSTM(units=units, stateful=stateful, return_sequences=True)
def call(self, inputs):
"""
:param inputs: [batch_size, seq_len, 1]
:return: output tensor
"""
x = self.lstm_0(inputs)
x = self.lstm_1(x)
return x
def main():
model = SimpleLstmModel(units=1, stateful=True)
x = tf.placeholder(shape=[1, 1, 1], dtype=tf.float32)
output = model(x)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
res_at_step_1, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_1)
res_at_step_2, _ = sess.run([output, model.state_updates], feed_dict={x: [[[0.1]]]})
print(res_at_step_2)
if __name__ == "__main__":
main()
Which produces the following output:
[[[0.00168626]]]
[[[0.00434444]]]
and shows that the lstm state is preserved between batches.
If we set stateful to False, the output becomes:
[[[0.00033928]]]
[[[0.00033928]]]
Showing that the state is not reused.
ok, so I managed to solve this problem!
What worked for me was creating tf.identity tensors for not only the outputs, as is standard, but also for the state tensors.
In the Keras models, the state tensors can be found by doing:
model.updates
Which gives something like this:
[(<tf.Variable 'gru_1_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_1_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_2_1/Variable:0' shape=(1, 70) dtype=float32_ref>,
<tf.Tensor 'gru_2_1/while/Exit_2:0' shape=(1, 70) dtype=float32>),
(<tf.Variable 'gru_3_1/Variable:0' shape=(1, 4) dtype=float32_ref>,
<tf.Tensor 'gru_3_1/while/Exit_2:0' shape=(1, 4) dtype=float32>)]
The 'Variable' is used for inputting the states, and the 'Exit' for outputs of the new states.
So I created tf.identity out of the 'Exit' tensors. I gave them meaningful names, e.g.:
tf.identity(state_variables[j], name='state'+str(j))
Where state_variables contained only the 'Exit' tensors
Then used the input variables (e.g. gru_1_1/Variable:0) to feed the model state from TensorFlow, and the identity variables I created out of the 'Exit' tensors were used to extract the new states after feeding the model at each timestep

TensorBoard - Plot training and validation losses on the same graph?

Is there a way to plot both the training losses and validation losses on the same graph?
It's easy to have two separate scalar summaries for each of them individually, but this puts them on separate graphs. If both are displayed in the same graph it's much easier to see the gap between them and whether or not they have begin to diverge due to overfitting.
Is there a built in way to do this? If not, a work around way? Thank you much!
The work-around I have been doing is to use two SummaryWriter with different log dir for training set and cross-validation set respectively. And you will see something like this:
Rather than displaying the two lines separately, you can instead plot the difference between validation and training losses as its own scalar summary to track the divergence.
This doesn't give as much information on a single plot (compared with adding two summaries), but it helps with being able to compare multiple runs (and not adding multiple summaries per run).
Just for anyone coming accross this via a search: The current best practice to achieve this goal is to just use the SummaryWriter.add_scalars method from torch.utils.tensorboard. From the docs:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
r = 5
for i in range(100):
writer.add_scalars('run_14h', {'xsinx':i*np.sin(i/r),
'xcosx':i*np.cos(i/r),
'tanx': np.tan(i/r)}, i)
writer.close()
# This call adds three values to the same scalar plot with the tag
# 'run_14h' in TensorBoard's scalar section.
Expected result:
Many thanks to niko for the tip on Custom Scalars.
I was confused by the official custom_scalar_demo.py because there's so much going on, and I had to study it for quite a while before I figured out how it worked.
To show exactly what needs to be done to create a custom scalar graph for an existing model, I put together the following complete example:
# + <
# We need these to make a custom protocol buffer to display custom scalars.
# See https://developers.google.com/protocol-buffers/
from tensorboard.plugins.custom_scalar import layout_pb2
from tensorboard.summary.v1 import custom_scalar_pb
# >
import tensorflow as tf
from time import time
import re
# Initial values
(x0, y0) = (-1, 1)
# This is useful only when re-running code (e.g. Jupyter).
tf.reset_default_graph()
# Set up variables.
x = tf.Variable(x0, name="X", dtype=tf.float64)
y = tf.Variable(y0, name="Y", dtype=tf.float64)
# Define loss function and give it a name.
loss = tf.square(x - 3*y) + tf.square(x+y)
loss = tf.identity(loss, name='my_loss')
# Define the op for performing gradient descent.
minimize_step_op = tf.train.GradientDescentOptimizer(0.092).minimize(loss)
# List quantities to summarize in a dictionary
# with (key, value) = (name, Tensor).
to_summarize = dict(
X = x,
Y_plus_2 = y + 2,
)
# Build scalar summaries corresponding to to_summarize.
# This should be done in a separate name scope to avoid name collisions
# between summaries and their respective tensors. The name scope also
# gives a title to a group of scalars in TensorBoard.
with tf.name_scope('scalar_summaries'):
my_var_summary_op = tf.summary.merge(
[tf.summary.scalar(name, var)
for name, var in to_summarize.items()
]
)
# + <
# This constructs the layout for the custom scalar, and specifies
# which scalars to plot.
layout_summary = custom_scalar_pb(
layout_pb2.Layout(category=[
layout_pb2.Category(
title='Custom scalar summary group',
chart=[
layout_pb2.Chart(
title='Custom scalar summary chart',
multiline=layout_pb2.MultilineChartContent(
# regex to select only summaries which
# are in "scalar_summaries" name scope:
tag=[r'^scalar_summaries\/']
)
)
])
])
)
# >
# Create session.
with tf.Session() as sess:
# Initialize session.
sess.run(tf.global_variables_initializer())
# Create writer.
with tf.summary.FileWriter(f'./logs/session_{int(time())}') as writer:
# Write the session graph.
writer.add_graph(sess.graph) # (not necessary for scalars)
# + <
# Define the layout for creating custom scalars in terms
# of the scalars.
writer.add_summary(layout_summary)
# >
# Main iteration loop.
for i in range(50):
current_summary = sess.run(my_var_summary_op)
writer.add_summary(current_summary, global_step=i)
writer.flush()
sess.run(minimize_step_op)
The above consists of an "original model" augmented by three blocks of code indicated by
# + <
[code to add custom scalars goes here]
# >
My "original model" has these scalars:
and this graph:
My modified model has the same scalars and graph, together with the following custom scalar:
This custom scalar chart is simply a layout which combines the original two scalar charts.
Unfortunately the resulting graph is hard to read because both values have the same color. (They are distinguished only by marker.) This is however consistent with TensorBoard's convention of having one color per log.
Explanation
The idea is as follows. You have some group of variables which you want to plot inside a single chart. As a prerequisite, TensorBoard should be plotting each variable individually under the "SCALARS" heading. (This is accomplished by creating a scalar summary for each variable, and then writing those summaries to the log. Nothing new here.)
To plot multiple variables in the same chart, we tell TensorBoard which of these summaries to group together. The specified summaries are then combined into a single chart under the "CUSTOM SCALARS" heading. We accomplish this by writing a "Layout" once at the beginning of the log. Once TensorBoard receives the layout, it automatically produces a combined chart under "CUSTOM SCALARS" as the ordinary "SCALARS" are updated.
Assuming that your "original model" is already sending your variables (as scalar summaries) to TensorBoard, the only modification necessary is to inject the layout before your main iteration loop starts. Each custom scalar chart selects which summaries to plot by means of a regular expression. Thus for each group of variables to be plotted together, it can be useful to place the variables' respective summaries into a separate name scope. (That way your regex can simply select all summaries under that name scope.)
Important Note: The op which generates the summary of a variable is distinct from the variable itself. For example, if I have a variable ns1/my_var, I can create a summary ns2/summary_op_for_myvar. The custom scalars chart layout cares only about the summary op, not the name or scope of the original variable.
Here is an example, creating two tf.summary.FileWriters which share the same root directory. Creating a tf.summary.scalar shared by the two tf.summary.FileWriters. At every time step, get the summary and update each tf.summary.FileWriter.
import os
import tqdm
import tensorflow as tf
def tb_test():
sess = tf.Session()
x = tf.placeholder(dtype=tf.float32)
summary = tf.summary.scalar('Values', x)
merged = tf.summary.merge_all()
sess.run(tf.global_variables_initializer())
writer_1 = tf.summary.FileWriter(os.path.join('tb_summary', 'train'))
writer_2 = tf.summary.FileWriter(os.path.join('tb_summary', 'eval'))
for i in tqdm.tqdm(range(200)):
# train
summary_1 = sess.run(merged, feed_dict={x: i-10})
writer_1.add_summary(summary_1, i)
# eval
summary_2 = sess.run(merged, feed_dict={x: i+10})
writer_2.add_summary(summary_2, i)
writer_1.close()
writer_2.close()
if __name__ == '__main__':
tb_test()
Here is the result:
The orange line shows the result of the evaluation stage, and correspondingly, the blue line illustrates the data of the training stage.
Also, there is a very useful post by TF team to which you can refer.
For completeness, since tensorboard 1.5.0 this is now possible.
You can use the custom scalars plugin. For this, you need to first make tensorboard layout configuration and write it to the event file. From the tensorboard example:
import tensorflow as tf
from tensorboard import summary
from tensorboard.plugins.custom_scalar import layout_pb2
# The layout has to be specified and written only once, not at every step
layout_summary = summary.custom_scalar_pb(layout_pb2.Layout(
category=[
layout_pb2.Category(
title='losses',
chart=[
layout_pb2.Chart(
title='losses',
multiline=layout_pb2.MultilineChartContent(
tag=[r'loss.*'],
)),
layout_pb2.Chart(
title='baz',
margin=layout_pb2.MarginChartContent(
series=[
layout_pb2.MarginChartContent.Series(
value='loss/baz/scalar_summary',
lower='baz_lower/baz/scalar_summary',
upper='baz_upper/baz/scalar_summary'),
],
)),
]),
layout_pb2.Category(
title='trig functions',
chart=[
layout_pb2.Chart(
title='wave trig functions',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/cosine', r'trigFunctions/sine'],
)),
# The range of tangent is different. Let's give it its own chart.
layout_pb2.Chart(
title='tan',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/tangent'],
)),
],
# This category we care less about. Let's make it initially closed.
closed=True),
]))
writer = tf.summary.FileWriter(".")
writer.add_summary(layout_summary)
# ...
# Add any summary data you want to the file
# ...
writer.close()
A Category is group of Charts. Each Chart corresponds to a single plot which displays several scalars together. The Chart can plot simple scalars (MultilineChartContent) or filled areas (MarginChartContent, e.g. when you want to plot the deviation of some value). The tag member of MultilineChartContent must be a list of regex-es which match the tags of the scalars that you want to group in the Chart. For more details check the proto definitions of the objects in https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/custom_scalar/layout.proto. Note that if you have several FileWriters writing to the same directory, you need to write the layout in only one of the files. Writing it to a separate file also works.
To view the data in TensorBoard, you need to open the Custom Scalars tab. Here is an example image of what to expect https://user-images.githubusercontent.com/4221553/32865784-840edf52-ca19-11e7-88bc-1806b1243e0d.png
The solution in PyTorch 1.5 with the approach of two writers:
import os
from torch.utils.tensorboard import SummaryWriter
LOG_DIR = "experiment_dir"
train_writer = SummaryWriter(os.path.join(LOG_DIR, "train"))
val_writer = SummaryWriter(os.path.join(LOG_DIR, "val"))
# while in the training loop
for k, v in train_losses.items()
train_writer.add_scalar(k, v, global_step)
# in the validation loop
for k, v in val_losses.items()
val_writer.add_scalar(k, v, global_step)
# at the end
train_writer.close()
val_writer.close()
Keys in the train_losses dict have to match those in the val_losses to be grouped on the same graph.
Tensorboard is really nice tool but by its declarative nature can make it difficult to get it to do exactly what you want.
I recommend you checkout Losswise (https://losswise.com) for plotting and keeping track of loss functions as an alternative to Tensorboard. With Losswise you specify exactly what should be graphed together:
import losswise
losswise.set_api_key("project api key")
session = losswise.Session(tag='my_special_lstm', max_iter=10)
loss_graph = session.graph('loss', kind='min')
# train an iteration of your model...
loss_graph.append(x, {'train_loss': train_loss, 'validation_loss': validation_loss})
# keep training model...
session.done()
And then you get something that looks like:
Notice how the data is fed to a particular graph explicitly via the loss_graph.append call, the data for which then appears in your project's dashboard.
In addition, for the above example Losswise would automatically generate a table with columns for min(training_loss) and min(validation_loss) so you can easily compare summary statistics across your experiments. Very useful for comparing results across a large number of experiments.
Please let me contribute with some code sample in the answer given by #Lifu Huang. First download the loger.py from here and then:
from logger import Logger
def train_model(parameters...):
N_EPOCHS = 15
# Set the logger
train_logger = Logger('./summaries/train_logs')
test_logger = Logger('./summaries/test_logs')
for epoch in range(N_EPOCHS):
# Code to get train_loss and test_loss
# ============ TensorBoard logging ============#
# Log the scalar values
train_info = {
'loss': train_loss,
}
test_info = {
'loss': test_loss,
}
for tag, value in train_info.items():
train_logger.scalar_summary(tag, value, step=epoch)
for tag, value in test_info.items():
test_logger.scalar_summary(tag, value, step=epoch)
Finally you run tensorboard --logdir=summaries/ --port=6006and you get:

Resources