ClearML multiple tasks in single script changes logged value names - devops

I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger).
When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou
When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.

Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.

Related

Sagemaker - Distributed training

I can’t find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.
Specifically,
When SageMaker distributed data parallel is used via distribution=‘dataparallel’ , documents state that each instance processes different batches of data.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
role=role,
py_version="py37",
framework_version="2.4.1",
# For training with multinode distributed training, set this count. Example: 2
instance_count=4,
instance_type="ml.p3.16xlarge",
sagemaker_session=sagemaker_session,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below
estimator = TensorFlow(
py_version="py3",
entry_point="mnist.py",
role=role,
framework_version="1.12.0",
instance_count=4,
instance_type="ml.m4.xlarge",
)
Thanks!
In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.
The distribution parameters you pass in the estimator select the appropriate runner.
"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. Unless you have code purpose-built for distributed computation this is useless (simple duplication).
It gets really interesting when:
you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things
if you run the same code over input that is ShardedByS3Key, your code will run on different parts of your S3 data that is homogeneously spread over machines. Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.
Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP

Full training set used by dask_lightgbm?

I'm reading over the implementation of the dask-lightgbm estimators (specifically, the _train_part function in dask_lightgb.core.py), and I'm failing to see how the entirety of the training set gets used to fit the final estimator?
The _train_part function accepts the boolean argument return_model, and in the implementation of the train function (which uses client.submit to call _train_part on each worker), return_model is only true when the worker is the "master_worker" (which itself appears to be a randomly chosen Dask worker). Logically, each worker gets dispatched 1/n chunks of the overall model training set - where n = total number of workers - then each worker trains its own independent model on its own subset of the training set. The return_model parameter controls whether each worker's model gets returned by _train_part, so it returns None for all workers - and therefore, models - except for one worker.
Code:
def _train_part(params, model_factory, list_of_parts, worker_addresses, return_model, local_listen_port=12400,
time_out=120, **kwargs):
network_params = build_network_params(worker_addresses, get_worker().address, local_listen_port, time_out)
params.update(network_params)
# Concatenate many parts into one
parts = tuple(zip(*list_of_parts))
data = concat(parts[0])
label = concat(parts[1])
weight = concat(parts[2]) if len(parts) == 3 else None
try:
model = model_factory(**params)
model.fit(data, label, sample_weight=weight)
finally:
_safe_call(_LIB.LGBM_NetworkFree())
return model if return_model else None
Is this not equivalent to training a non-distributed version of a lightgbm estimator on a 1/n subsample of the training set? Am I missing something? I feel like I am missing a part where either the workers' independent models get combined into one, or where a single estimator is getting updated with the individual trees learned by separate workers.
Thank you!
Ah the answer is yes - dask_lightgbm uses all available training samples. Dask's responsibility is only to distribute data across workers. LightGBM handles all distributed learning once its network parameters are set. It's not that each worker is training its own independent model - LightGBM is training a single model - but each worker will get a copy of it. For this reason, only the chosen worker returns the fitted estimator, and everyone else returns None.

k-fold cross validation in RankLib

I want to do 5 fold cross validation on MQ2008 dataset. I am using RankLib to apply ML algo on the dataset. I am confused about the kcv option given in Ranklib for cross validation.
command used:
java - jar RankLib.jar -ranker 0 -train train.txt -test test.txt -validate vali.txt -kcv 5
here we are specifying different files for training,testing and validation.Then how it is dividing data for 5 fold cross validation.
To do k-fold cross-validation using ranklib, you only need to use one dataset.
The program itself divides the data to train, test and validate randomly.
When you use 5-fold cross-validation, the program will repeat the process 5 times and it gives you the average of the 5 analyses as the final result.
You need to choose a metric for your learning evaluation. See [ -metric2t <metric> ] on this How to use page.
For example, see the command below. I have only one dataset to feed my algorithm. I used NDCG#10 as my evaluation metric. Also, I used -kcvmd to save my models in a directory and -kcvmn to name the models.
java -jar RankLib-2.1-patched.jar -train trainingData.txt -ranker 8 -kcv 5 -kcvmd kcvModels/ -kcvmn txt -metric2t NDCG#10 -metric2T NDCG#10 -save Models/model.txt

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

How to make a non-static Caffe network architecture?

I would like to implement a neural network architecture in Caffe which will perform differently based on some iterable variable. For example: the full network might use 10 layers for 4 out of 5 training or testing iterations, but for all other iterations it will truncate the network and only use the last 5 layers. This would require that the input to the first layer and the input to the 5th layer have the same dimensionality of course, but my primary question is how to implement this switching between the two architectures during training/testing.
I guess you can do that using pycaffe and caffe.NetSpec(), but this is not going to be a very nice code...
On the other hand, why don't you train for i iterations the full net, save a snapshot, and then "warm start" the reduced model with the snapshot you saved?
That is: have 'full_trainval.prototxt' with 'full_solver.prototxt' configured to train the full net for i iterations, and 'top_trainval.prototxt' with 'top_solver.prototxt' configured to train only the top layers of the net. Then
~$ $CAFFE_ROOT/build/tools/caffe train -solver full_solver.prototxt
When this stage is through, make sure you have the final sanpshot saved, and then
~$ $CAFFE_ROOT/build/tools/caffe train -solver top_solver.prototxt -snapshot full_train_last_snapshot.solverstate
Finally, you could use net_surgery to merge the weights of the two phases into a single full net.

Resources