ML-Engine with GPUs workers errors - google-cloud-ml-engine

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers.
This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!

In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.
GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).
The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.
For more information about building high performance models, see this guide.
In the meantime, I recommend using a single machine with more GPUs to see if that helps:
{
"scaleTier": "CUSTOM",
"masterType": "complex_model_l_gpu",
...
}
Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

Related

Running itself multithreaded functions on a dask cluster

I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate containers, which will use Dask LocalCluster with a certain number of workers, threads, processes passed into the LocalCluster object.
Our journey on Dask will look very much like this:
Continue using single machine LocalCluster until we out grow max cpu/memory allowed
When we out grow a single container, spawn additional worker containers on the initial container (a la dask-kubernetes) and join them to the LocalCluster.
We're currently starting with containers that have 256 cpu(.25 vCPU) and 512 memory and pinning the LocalCluster to 1 n_workers and 3 threads_per_worker to get a reasonable amount of parallelism. However, this really is guess work. 1 n_workers since its a machine with less than 1 vcpu and 3 threads because that doesn't sound crazy to me based on my previous experience running other python based applications in Fargate. This seems to work fine in a very simply example that just maps a function against a list of items.
RENEWAL_TABLES = [
'Activity',
'CurrentPolicyTermStaus',
'PolicyRenewalStatus',
'PolicyTerm',
'PolicyTermStatus',
'EndorsementPolicyTerm',
'PolicyLifeState'
]
RENEWAL_TABLES_PAIRS = [
(i, 1433 + idx) for idx, i in enumerate(RENEWAL_TABLES)
]
#task(state_handlers=[HANDLER])
def dummy_step():
LOGGER.info('Dummy Step...')
sleep(15)
#task(state_handlers=[HANDLER])
def test_map(table):
LOGGER.info('table: {}...'.format(table))
sleep(15)
with Flow(Path(__file__).stem, SCHEDULE, state_handlers=[HANDLER]) as flow:
first_step = dummy_step()
test_map.map(RENEWAL_TABLES_PAIRS).set_dependencies(upstream_tasks=[first_step])
I see no more than 3 tasks executed at once.
I would really like to understand how to best configure n_workers(single machinne), threads, processes as we expand the size of the single machine out to adding remote workers. I know it depends on my workload, but you could see a combination of things in a single flow where one task does an extract from a database to a csv and another task run a pandas computation. I have seen things online where it seems like it should be threads = number of cpus requested for the documentation, but it seems like you can still achieve parallelism with less than one cpu in Fargate.
Any feedback would be appreciated and could help others looking to leverage Dask in a more ephemeral nature.
Given that Fargate increments from .25 -> .50 -> 1 -> 2 -> 4 for vCPU, I think it’s safe to go with a 1 worker to 1 vcpu setup. However, would be helpful to understand how to choose a good upper limit for number of threads per worker given how Fargate vcpu allotment works.

Why does dask worker fails due to MemoryError on "small" size task? [Dask.bag]

I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

Openwhisk: scaling issues on the distributed setup

I setup a distributed openwhisk installation on a few vitual machines as described here https://github.com/apache/incubator-openwhisk/blob/master/ansible/README_DISTRIBUTED.md (had to also install some dependencies on VMs manually because they were expected but not installed by default).
My host file looks like this:
; the first parameter in a host is the inventory_hostname
; used for local actions only
ansible ansible_connection=local
[registry]
xxx.xx.xx.173 ansible_host=xxx.xx.xx.173
[edge]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[apigateway:children]
edge
[redis:children]
edge
[controllers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[kafkas]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[zookeepers:children]
kafkas
[invokers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[db]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
Everything seems to be running fine in general, I can create actions, invoke them etc.
On the two VMs which host invokers and controllers I turned on htop to check the CPU usage and tried running a python script invoking the same action (prime number calculation which takes time for a large enough input) multiple times in parallel.
The result seems to be that the first invoker works on 100% CPU while the calculation is happening, while the second one is still idling on 5-7% CPU. I also tried different ways of distributing components across multiple VMs, e.g. setting invokers on two machines and one controller separately on another machine, but the result is the same.
What could be the reason for this? And what could be the proper use case to make Openwhisk get the second invoker involved?
In a small deployment, there is a fraction of the invoker pool allocated strictly for docker actions. This is called the blackbox fraction which is 10% by default (with a minimum of 1 invoker, which is why you see one loaded invoker and one idle).
This recent pull request allows all the invokers to be used for small numbers of invokers (up to the reciprocal of the blackbox fraction): https://github.com/apache/incubator-openwhisk/pull/3751

Resources