I need to train a CNN that will take 1-2 days to train on a remotely accessed GPU server.
Will I simply need to leave my laptop on overnight for the training to be complete or is there a way to save the state of the training and resume from there the next day?
(Implementation in pytorch)
If you need to keep training the model that you are about to save, you need to save more than just the model. You also need to save the state of the optimizer, epochs, score, etc. You would do it like this:
state = {
'epoch': epoch,
'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(),
...
}
torch.save(state, filepath)
To resume training you would do things like: state = torch.load(filepath), and then, to restore the state of each individual object, something like this:
model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(stata['optimizer'])
Since you are resuming training, DO NOT call model.eval() once you restore the states when loading.
To read more about this or see actual examples: https://www.programcreek.com/python/example/101175/torch.save
I assume you ssh into you remote server. When training the model by running your script, say, $ python train.py, simply pre-append nohup:
$ nohup python train.py
This tells your process to disregard the hangup signal when you exit the ssh session and shut down your laptop.
Related
I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger).
When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou
When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.
Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.
I would like to use the yolo architecture for object detection. Before training the network with my custom data, I followed these steps to train it on the Pascal VOC data: https://pjreddie.com/darknet/yolo/
The instructions are very clear.
But after the final step
./darknet detector train cfg/voc.data cfg/yolo-voc.cfg darknet19_448.conv.23
darknet immediately stops training and announces that weights have been written to the backups/ directory.
At first I thought that the pretraining was simply too good and that the stopping criteria would be reached at once.
So I've used the ./darknet detect command with these weights on one of the test images data/dog. Nothing is found.
If I don't use any pretrained weights, the network does train.
I've edited cfg/yolo-voc.cfg to use
# Testing
#batch=1
#subdivisions=1
# Training
batch=32
subdivisions=8
Now the training process has been runnning for many hours and is keeping my gpu warm.
Is this the intended way to train darknet ?
How can I use pretrained weights correctly, without training just breaking off ?
Is there any setting to create checkpoints, or get an idea of the progress ?
Adding -clear 1 at the end of your training command will clear the stats of how many images this model has seen in previous training. Then you can fine-tune your model on new data(set).
You can find more info about the usage in the function signature
void train_detector(char *datacfg, char *cfgfile, char *weightfile, int *gpus, int ngpus, int clear)
at https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/examples/detector.c
I doubt it that increasing the max number of iterations is a good idea, as the learning rates are usually associated with current # of iteration. We usually increase the max # of iterations, when we want to resume a previous training task that ended because of reaching the max # of iterations, but we believe that with more iterations, it will give better results.
FYI, when you have a small dataset, training on it from scratch or from a classification network may not be a great idea. You may still want to re-use the weights from a detection network trained on large dataset like Coco or ImageNet.
This is an old question so I hope you have your answer by now, but here is mine just in case it helps.
After working with darknet for about a month, I've run into most of the roadblocks that people have asked/posted about on forums. In your case, I'm pretty certain it's because the weights have been trained for the max number of batches already, and when the pre-trained weights were read in darknet assumed training was done.
Relevant personal experience: when I used one of the pretrained weights files, it started from iteration 40101 and ran until 40200 before cutting off.
I would stick to training from scratch if you have custom data, but if you want to try the pre-trained weights again, you might find that changing max batches in the cfg file helps.
Also if using AlexeyAB/darknet they might have a problem with -clear option,
in detector.c:
if (clear) *nets[k].seen = 0
should really be:
if (clear) {*nets[k].seen = 0;*nets[k].cur_iteration = 0;}
otherwise the training loop will exit immediately.
Modify OpenCV number in your darknet/Makefile to 0
OpenCV=0
In Tensorflow, how can I save the weights and all other variables of the program after it has finished training? I would like to be able to use the model I trained later on. Thanks in advance.
You can define a saver object like this:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
In this case, the saver is configured to keep the five most recent checkpoints and also to keep a checkpoint every hour during training.
The saver can then be called periodically in your main training loop with a call such as the following.
sess=tf.Session()
...
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, "./model", global_step=step)
In this example the saver is saving a checkpoint into the ./model subdirectory every 100 training steps. The optional parameter global_step appends this value to the checkpoint filenames.
The model weights and other values may be restored at a later time for additional training or inference by the following:
saver.restore(sess, path.model_checkpoint_path)
There are a variety of other useful variants and options. A good place to start learning about them is the TF how-to on variable creation, storage and retrieval here
I ran the demo tensorflow MNIST model(in models/image/mnist) by
python -m tensorflow.models.image.mnist.convolutional
Does it mean that after the model completes training, the parameters/weights are automatically stored on secondary storage? Or do we have to edit the code to include "saver" functions for parameters to be stored?
No they are not automatically saved. Everything is in memory. You have to explicitly add a saver function to store your model to a secondary storage.
First you create a saver operation
saver = tf.train.Saver(tf.all_variables())
Then you want to save your model as it progresses in the train process, usually after N steps. This intermediate steps are commonly named "checkpoints".
# Save the model checkpoint periodically.
if step % 1000 == 0:
checkpoint_path = os.path.join('.train_dir', 'model.ckpt')
saver.save(sess, checkpoint_path)
Then you can restore the model from the checkpoint:
saver.restore(sess, model_checkpoint_path)
Take a look at tensorflow.models.image.cifar10 for a concrete example
Using Vowpal Wabbit and its python interactor for active learning I've got up to the point of being able to send messages back and forth to the server from the client but I am having problems with seeding.
When I seed the model with the current command:
python active_interactor.py --verbose -o labelled.txt --seed data.seed localhost 12345 unlabelled.txt
The interactor sends the examples to the server (and I know this because the server updates the models and the debug information is produced) but when it feeds the unlabelled examples and asks for a label as a response, the predictions are always 0.
My question is: is the model not being seeded? If not, why are the predictions always 0 even though there is a model?
It should be noted that the same data can be successfully used to create a passive model that gives non-0 predictions, so I do not think the problem is with the training data.
---UPDATE---
Upon looking at the tests, we went ahead and changed the vw server to match the test with two parameters in mind that were left as their defaults beforehand, namely initial_t and l.
vw -f final.model --active_learning --active_mellowness 0.000001 --daemon --port 12345 --initial_t 10 -l 10
Once doing this, predictions are produced. This also works when -l is it's default. We will now do a grid search to find out the best possible parameters. One question though, what is the reason why low values of initial_t led to no predictions?