I'm trying to use Dask local cluster to manage system wide memory usage,
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
I connect with:
client = Client('tcp://127.0.0.1:5272')
I have 8 cores and 32 GB. The local cluster distributes 4GB * 4 = 16GB memory (I have another task that required about 10GB memory) into the local cluster. However, previously there are some tasks I could finish well without calling client = Client('tcp://127.0.0.1:5272'). After I call client = Client('tcp://127.0.0.1:5272'), memory error triggered. What can i do in this scenario? Thanks!
I'm thinking if it is because each worker is only allocated 4GB memory... but if I assign memory_limit='16GB'. If it uses all of the resources it would take 64GB. I don't have that much memory. What can I do?
It's not clear what you are trying to achieve, but your observation on memory is correct. If a worker is constrained by memory, then they won't be able to complete the task. What are ways out of this?
getting access to more resources, if you don't have access to additional hardware, then you can check coiled.io or look into the various dask cloud options
optimizing your code, perhaps some calculations could be done in smaller chunks, data could be compressed (e.g. categorical dtype) or there are other opportunities to reduce memory requirements (really depends on the functions, but let's say some internal calculation could be done at a smaller accuracy with fewer resources)
using all available resources with a non-distributed code (which would add some overhead to the resource requirements).
Related
PCIE bus bandwidth latencies force constraints on how and when applications should copy data to and from GPUs.
When working with cuDF directly, I can efficiently move a single large chunk of data into a single DataFrame.
When using dask_cudf to partition my DataFrames, does Dask copy partitions into GPU memory one at a time? In batches? If so, is there significant overhead from multiple copy operations instead of a single larger copy?
This probably depends on the scheduler you're using. As of 2019-02-19 dask-cudf uses the single-threaded scheduler by default (cudf segfaulted for a while there if used in multiple threads), so any transfers would be sequential if you're not using some dask.distributed cluster. If you're using a dask.distributed cluster, then presumably this would happen across each of your GPUs concurrently.
It's worth noting that dask.dataframe + cudf doesn't do anything special on top of what cudf would do. It's as though you called many cudf calls in a for loop, or in one for-loop per GPU, depending on the scheduler choice above.
Disclaimer: cudf and dask-cudf are in heavy flux. Future readers probably should check with current documentation before trusting this answer.
What do I get by running multiple nodes on a single host? I am not getting availability, because if the host is down, the whole cluster goes with it. Does it make sense regarding performance? Doesn't one instance of ES take as many resources from the host as it needs?
Generally no, but if you have machines with ridiculous amounts of CPU and memory, you might want that to properly utilize the available resources. Avoiding big heaps with Elasticsearch is a good thing generally since garbage collection on bigger heaps can become a problem and in any case above 32 GB you lose the benefit of pointer compression. Mostly you should not need big heaps with ES. Most of the memory that ES uses is through memory mapped files, which relies on the OS cache. So just because you aren't assigning memory to the heap doesn't mean it is not being used: more memory available for caching means you'll be able to handle bigger shards or more shards.
So if you run more nodes, that advantage goes away and you waste memory on redundant heaps, and you'll have nodes competing for resources. Mostly, you should base these decisions on actual memory, cache, and cpu usage of course.
It depends on your host and how you configure your nodes.
For example, Elastic recommends allocating up to 32GB of RAM (because of how Java compresses pointers) to elasticsearch and have another 32GB for the operating system (mostly for disk caching).
Assuming you have more than 64GB of ram on your host, let's say 128, it makes sense to have two nodes running on the same machine, having both configured to 32GB ram each and leaving another 64 for the operating system.
I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)
I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.
I am confused by hadoop namenode memory problem.
when namenode memory usage is higher than a certain percentage (say 75%), reading and writing hdfs files through hadoop api will fail (for example, call some open() will throw exception), what is the reason? Does anyone has the same thing?
PS.This time the namenode disk io is not high, the CPU is relatively idle.
what determines namenode'QPS (Query Per Second) ?
Thanks very much!
Since the namenode is basically just a RPC Server managing a HashMap with the blocks, you have two major memory problems:
Java HashMap is quite costly, its collision resolution (seperate chaining algorithm) is costly as well, because it stores the collided elements in a linked list.
The RPC Server needs threads to handle requests- Hadoop ships with his own RPC framework and you can configure this with the dfs.namenode.service.handler.count for the datanodes it is default set to 10. Or you can configure this dfs.namenode.handler.count for other clients, like MapReduce jobs, JobClients that want to run a job. When a request comes in and it want to create a new handler, it go may out of memory (new Threads are also allocating a good chunk of stack space, maybe you need to increase this).
So these are the reasons why your namenode needs so much memory.
What determines namenode'QPS (Query Per Second) ?
I haven't benchmarked it yet, so I can't give you very good tips on that. Certainly fine tuning the handler counts higher than the number of tasks that can be run in parallel + speculative execution.
Depending on how you submit your jobs, you have to fine tune the other property as well.
Of course you should give the namenode always enough memory, so it has headroom to not fall into full garbage collection cycles.