How do I set Dask autoscaling using Iguazio? - dask

I need to create a new Dask cluster in Iguazio. I want to take advantage of Dask's autoscaling features that are described here: https://docs.dask.org/en/stable/how-to/adaptive.html
Does Iguazio support Dask cluster autoscaling and, if so, how do I enable that?

In the Iguazio, when you create a Dask cluster, you don't need to worry about lower-level dask_kubernetes related stuff.
You just need to specify the min and max number of workers like below
# create an mlrun function which will init the dask cluster
dask_cluster_name = "dask-cluster"
dask_cluster = mlrun.new_function(dask_cluster_name, kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
# set range for # of replicas with replicas and max_replicas
dask_cluster.spec.min_replicas = 1
dask_cluster.spec.max_replicas = 100
Depending on your workload, the cluster will scale up and down between the min and max number of workers. We bake in the adaptive deployments of the Dask cluster so it results in both faster analyses that give users much more power, but with much less pressure on computational resources.

Related

Sagemaker - Distributed training

I can’t find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.
Specifically,
When SageMaker distributed data parallel is used via distribution=‘dataparallel’ , documents state that each instance processes different batches of data.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
role=role,
py_version="py37",
framework_version="2.4.1",
# For training with multinode distributed training, set this count. Example: 2
instance_count=4,
instance_type="ml.p3.16xlarge",
sagemaker_session=sagemaker_session,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below
estimator = TensorFlow(
py_version="py3",
entry_point="mnist.py",
role=role,
framework_version="1.12.0",
instance_count=4,
instance_type="ml.m4.xlarge",
)
Thanks!
In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.
The distribution parameters you pass in the estimator select the appropriate runner.
"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. Unless you have code purpose-built for distributed computation this is useless (simple duplication).
It gets really interesting when:
you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things
if you run the same code over input that is ShardedByS3Key, your code will run on different parts of your S3 data that is homogeneously spread over machines. Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.
Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP

Async - FL Model

How to perform asynchronous model training using TFF framework?
I review the iterative training process loop, however I am not sure how to know which clients models are received.
Its quite possible to simulate something akin to "asynchronous FL" in TFF. One way to think about this could be to conceptually decouple the simulation time from wall clock time.
Sampling different numbers of clients each round (rather than the uniform K clients that is commonly done), perhaps with some distribution that weights clients based on how long they are expected to train, could simulate asynchronous FL. Its possible to only process a portion of the selected clients first, the researcher has the freedom to slice up the data/computation as they desired.
Python-esque pseudo code demonstrates the two techniques, different client sampling, and delayed gradient application:
state = fed_avg_iter_proc.initialize()
for round_num in range(NUM_ROUNDS):
# Here we conceptualize a "round" as a block of time, rather than a synchronous
# round. We have a function that determines which clients will "finish" within
# our configured block of time. This might even return only a single client.
participants = get_next_clients(time_window=timedelta(minutes=30))
num_participants = len(participants)
# Here we only process the first half, and then updated the global model.
state2, metrics = fed_avg_iter_proc.next(state, participants[:num_participants/2])
# Now process the second half of the selected clients.
# Note: this is now apply the 'pseudo-gradient' that was computed on clients
# (the difference between the original `state` and their local training result),
# to the model that has already taken one step (`state2`). This possibly has
# undesirable effects on the optimisation process, or may be improved with
# techniques that handle "stale" gradients.
state3, metrics = fed_avg_iter_proc.next(state2, participants[num_participants/2:])
# Finally update the state for the next for-loop of the simulation.
state = state3

Is high label cardinality but low metric/label count and infrequent sampling an acceptable use-case for Prometheus?

I have a use-case of monitoring that I'm not entirely sure if it's a good
match for Prometheus or not, and I wanted to ask for opinions before I delve
deeper.
The numbers of what I'm going to store:
Only 1 metric.
That metric has 1 label with 1,000,000 to 2,000,000 distinct values.
The values are gauges (but does it make a difference if they are counters?)
Sample rate is once every 5 minutes. Retaining data for 180 days.
Estimated storage size if I have 1 million distinct label values:
(According to formula in Prometheus' documentation: retention_time_seconds *
ingested_samples_per_second * bytes_per_sample)
(24*60)/5=288 5-minute intervals in a day.
(180*288) * (1,000,000) * 2 = 103,680,000,000 ~= 100GB
samples/label-value label-value-count bytes/sample
So I assume 100-200GB will be required.
Is this estimation correct?
I read in multiple places about avoiding high-cardinality labels, and I would
like to ask about this. Considering I will be looking at one time-series at a time Is the problem with high-cardinality labels? Or
having a high number of time-series? As each label value produces another
time-series? I also read in multiple places that Prometheus can handle
millions of time-series at once, so even if I have 1 label with one million
distinct values, I should be fine in terms of time-series count, do I have to
worry about the labels having high cardinality in this case? I'm aware that
it depends on the strength of the server, but assuming average capacity, I
would like to know if Prometheus' implementation has a problem handling this
case efficiently.
And also, if it's a matter of time-series count, am I correct in assuming
that it will not make a significant difference between the following
options?
1 metric with 1 label of 1,000,000 distinct label values.
10 metrics each with 1 label of 100,000 distinct label values.
X metrics each with 1 label of Y distinct label values.
where X * Y = 1,000,000
Thanks for the help!
That might work, but it's not what Prometheus is designed for and you'll likely run into issues. You probably want a database rather than a monitoring system, maybe Cassandra here.
How the cardinality is split out across metrics won't affect ingestion performance, however it'll be relatively slow to have to read 1M series in a query.
Note that Victoria Metrics is an easy to configure backend for Prometheus which will reduce storage requirements significantly.

Dask: many small workers vs a big worker

I am trying to understand this simple example from the dask-jobqueue documentation:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory"100GB",
project='P48500028',
queue='premium',
walltime='02:00:00')
cluster.start_workers(100) # Start 100 jobs that match the description above
from dask.distributed import Client
client = Client(cluster) # Connect to that cluster
I think it means that there will be 100 jobs each using 36 cores.
Let's say I can use 48 cores on a cluster.
Should I use 1 worker with 48 cores or 48 workers of 1 core each?
If your computations mostly release the GIL, then you'll probably want several threads per process. This is true if you're doing mostly Numpy, Pandas, Scikit-Learn, Numba/Cython programming on numeric data. I might do something like six processes with eight cores each.
If your computations are mostly pure Python code, for example you process text data, or iterate heavily with Python for loops over dicts/list/etc then you'll want fewer threads per process, maybe two.

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

Resources