Sagemaker - Distributed training - machine-learning

I can’t find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.
Specifically,
When SageMaker distributed data parallel is used via distribution=‘dataparallel’ , documents state that each instance processes different batches of data.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
role=role,
py_version="py37",
framework_version="2.4.1",
# For training with multinode distributed training, set this count. Example: 2
instance_count=4,
instance_type="ml.p3.16xlarge",
sagemaker_session=sagemaker_session,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below
estimator = TensorFlow(
py_version="py3",
entry_point="mnist.py",
role=role,
framework_version="1.12.0",
instance_count=4,
instance_type="ml.m4.xlarge",
)
Thanks!

In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.
The distribution parameters you pass in the estimator select the appropriate runner.

"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. Unless you have code purpose-built for distributed computation this is useless (simple duplication).
It gets really interesting when:
you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things
if you run the same code over input that is ShardedByS3Key, your code will run on different parts of your S3 data that is homogeneously spread over machines. Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.
Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP

Related

Not sure which configuration to use with AWS EC2 Instance for ML project

I am working on a machine learning project and I need help with the type of instance I should use to train and test the machine learning models.
Following are the project details:
Methods used are heavy ensemble with LGB and NN
Train Data:
Size : 16 GB
Values : 459000
Type : CSV
Test Data:
Size : 33.82 GB
Values : 920000
Type: CSV
I have not worked with such huge amount of data previously and need help with choosing the AWS Instance for the project which will be cost effective and won't give any performance issues.
I haven't yet tried anything. But I am going to explore the types.

ClearML multiple tasks in single script changes logged value names

I trained multiple models with different configuration for a custom hyperparameter search. I use pytorch_lightning and its logging (TensorboardLogger).
When running my training script after Task.init() ClearML auto-creates a Task and connects the logger output to the server.
I log for each straining stage train, val and test the following scalars at each epoch: loss, acc and iou
When I have multiple configuration, e.g. networkA and networkB the first training log its values to loss, acc and iou, but the second to networkB:loss, networkB:acc and networkB:iou. This makes values umcomparable.
My training loop with Task initalization looks like this:
names = ['networkA', networkB']
for name in names:
task = Task.init(project_name="NetworkProject", task_name=name)
pl_train(name)
task.close()
method pl_train is a wrapper for whole training with Pytorch Ligtning. No ClearML code is inside this method.
Do you have any hint, how to properly use the usage of a loop in a script using completly separated tasks?
Edit: ClearML version was 0.17.4. Issue is fixed in main branch.
Disclaimer I'm part of the ClearML (formerly Trains) team.
pytorch_lightning is creating a new Tensorboard for each experiment. When ClearML logs the TB scalars, and it captures the same scalar being re-sent again, it adds a prefix so if you are reporting the same metric it will not overwrite the previous one. A good example would be reporting loss scalar in the training phase vs validation phase (producing "loss" and "validation:loss"). It might be the task.close() call does not clear the previous logs, so it "thinks" this is the same experiment, hence adding the prefix networkB to the loss. As long as you are closing the Task after training is completed you should have all experiments log with the same metric/variant (title/series). I suggest opening a GitHub issue, this should probably be considered a bug.

Random Cut Forest model evaluation strategies (Confusion matrix,Accuracy and precision_recall_fscore

I am using AWS sagemker random cut forest algorithm to detect the anomalies.
import boto3
import sagemaker
containers = {
'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/randomcutforest:latest',
'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:latest',
'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/randomcutforest:latest',
'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/randomcutforest:latest',
'ap-southeast-1':'475088953585.dkr.ecr.ap-southeast-1.amazonaws.com/randomcutforest:latest'
}
region_name = boto3.Session().region_name
container = containers[region_name]
session = sagemaker.Session()
rcf = sagemaker.estimator.Estimator(
container,
sagemaker.get_execution_role(),
output_path='s3://{}/{}/output'.format(bucket, prefix),
train_instance_count=1,
train_instance_type='ml.c5.xlarge',
sagemaker_session=session)
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
s3_train_input = sagemaker.session.s3_input(
s3_train_data,
distribution='ShardedByS3Key',
content_type='application/x-recordio-protobuf')
rcf.fit({'train': s3_train_input})
( referred from --> https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/)
used above code to train the model, didn't find the way to evaluate model.
how to get the Accuracy and F score after deploying the model.
In order to get evaluation metrics you need to provide an extra channel called "test" during training. The test channel must contained labeled data. It is explained in the official documentation, https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html :
Amazon SageMaker Random Cut Forest supports the train and test data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either application/x-recordio-protobuf or text/csv formats. For the test data, when using text/csv format, the content must be specified as text/csv;label_size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as recordIO-wrapped-protobuf or as CSV
Also note ... the test channel only supports S3DataDistributionType=FullyReplicated
Thanks,
Julio

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

How to make a non-static Caffe network architecture?

I would like to implement a neural network architecture in Caffe which will perform differently based on some iterable variable. For example: the full network might use 10 layers for 4 out of 5 training or testing iterations, but for all other iterations it will truncate the network and only use the last 5 layers. This would require that the input to the first layer and the input to the 5th layer have the same dimensionality of course, but my primary question is how to implement this switching between the two architectures during training/testing.
I guess you can do that using pycaffe and caffe.NetSpec(), but this is not going to be a very nice code...
On the other hand, why don't you train for i iterations the full net, save a snapshot, and then "warm start" the reduced model with the snapshot you saved?
That is: have 'full_trainval.prototxt' with 'full_solver.prototxt' configured to train the full net for i iterations, and 'top_trainval.prototxt' with 'top_solver.prototxt' configured to train only the top layers of the net. Then
~$ $CAFFE_ROOT/build/tools/caffe train -solver full_solver.prototxt
When this stage is through, make sure you have the final sanpshot saved, and then
~$ $CAFFE_ROOT/build/tools/caffe train -solver top_solver.prototxt -snapshot full_train_last_snapshot.solverstate
Finally, you could use net_surgery to merge the weights of the two phases into a single full net.

Resources