I am trying to understand this simple example from the dask-jobqueue documentation:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory"100GB",
project='P48500028',
queue='premium',
walltime='02:00:00')
cluster.start_workers(100) # Start 100 jobs that match the description above
from dask.distributed import Client
client = Client(cluster) # Connect to that cluster
I think it means that there will be 100 jobs each using 36 cores.
Let's say I can use 48 cores on a cluster.
Should I use 1 worker with 48 cores or 48 workers of 1 core each?
If your computations mostly release the GIL, then you'll probably want several threads per process. This is true if you're doing mostly Numpy, Pandas, Scikit-Learn, Numba/Cython programming on numeric data. I might do something like six processes with eight cores each.
If your computations are mostly pure Python code, for example you process text data, or iterate heavily with Python for loops over dicts/list/etc then you'll want fewer threads per process, maybe two.
Related
I need to create a new Dask cluster in Iguazio. I want to take advantage of Dask's autoscaling features that are described here: https://docs.dask.org/en/stable/how-to/adaptive.html
Does Iguazio support Dask cluster autoscaling and, if so, how do I enable that?
In the Iguazio, when you create a Dask cluster, you don't need to worry about lower-level dask_kubernetes related stuff.
You just need to specify the min and max number of workers like below
# create an mlrun function which will init the dask cluster
dask_cluster_name = "dask-cluster"
dask_cluster = mlrun.new_function(dask_cluster_name, kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
# set range for # of replicas with replicas and max_replicas
dask_cluster.spec.min_replicas = 1
dask_cluster.spec.max_replicas = 100
Depending on your workload, the cluster will scale up and down between the min and max number of workers. We bake in the adaptive deployments of the Dask cluster so it results in both faster analyses that give users much more power, but with much less pressure on computational resources.
When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.
Metrics:
In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.
TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0
In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.
TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0
Hardware:
CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU # 2.20GHz
Is there a cache in Tensorflow Serving?
Why is TPS with single data testing larger thrice than with various data testing in stress testing task?
I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.
I'm trying to read a big (will not fit in memory) parquet dataset, amd then sample from it. Each partition of the dataset fits perfectly in memory.
The dataset is about 20Gb of data on disk, divided in 104 partitions of about 200Mb each. I don't want to use more than 40Gb of memory at any point, so i'm setting the n_workers and memory_limit accordingly.
My hypothesis was that Dask would load as many partitions as it could handle, sample from them, scrap them from memory and then continue loading the next ones. Or something like that.
Instead, judging by the execution graph (104 load operations in parallel, after each a sample), it looks like it tries to load all partitions simultaneously, and therefore the workers keep getting killed for running out of memory.
Am I missing something?
This is my code:
from datetime import datetime
from dask.distributed import Client
client = Client(n_workers=4, memory_limit=10e9) #Gb per worker
import dask.dataframe as dd
df = dd.read_parquet('/path/to/dataset/')
df = df.sample(frac=0.01)
df = df.compute()
To reproduce the error you can create a mock dataset 1/10th the size of the one I was trying to load using this code, and try my code with 1GB memory_limit=1e9 to compensate.
from dask.distributed import Client
client = Client() #add restrictions depending on your system here
from dask import datasets
df = datasets.timeseries(end='2002-12-31')
df = df.repartition(npartitions=104)
df.to_parquet('./mock_dataset')
Parquet is an efficient binary format, with encoding and compression. There is a very good chance that in memory, it takes up far more space than you think.
In order to sample the data at 1%, each partition is being loaded and expanded into memory in entirety, before being sub-selected. This comes with considerable memory overhead of buffer copies. Each worker thread will need to accommodate the currently-processed chunk, as well as results that have been accumulated so far on that worker, and then a task will copy all of these for the final concat operation (which also involves copies and overhead).
The general recommendation is that each worker should have access to "several times" the in-memory size of each partition, and in your case, those are ~2GB on-disc and bigger in memory.
I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.
I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.