how to configure dask to use swap space to avoid memory error - dask

I am some relatively large task on DASK on a machine with (120 GB RAM, 280GB Swap space)
Using this setting, every thing is a bit slow but it works:
client = Client(memory_limit='90GB',n_workers=1,threads_per_worker=14)
I think there is some IO issue when observing the computation. However, if I use
client = Client()
In the end, there are memory error causing everything to break. But I figured it didn't try to use the swap space at all... Is there a way to configure dask to use swap space to avoid memory error?
Any other advise? For example, is it ok if i do?
client = Client(memory_limit='90GB',n_workers=4,threads_per_worker=14)
Since theoretically, I have 120GB RAM+ 280 GB swap > 4*90GB.... Am I making any sense here?

Related

Why does Prometheus consume so much memory?

I'm using Prometheus 2.9.2 for monitoring a large environment of nodes.
As part of testing the maximum scale of Prometheus in our environment, I simulated a large amount of metrics on our test environment.
My management server has 16GB ram and 100GB disk space.
During the scale testing, I've noticed that the Prometheus process consumes more and more memory until the process crashes.
I've noticed that the WAL directory is getting filled fast with a lot of data files while the memory usage of Prometheus rises.
The management server scrapes its nodes every 15 seconds and the storage parameters are all set to default.
I would like to know why this happens, and how/if it is possible to prevent the process from crashing.
Thank you!
The out of memory crash is usually a result of a excessively heavy query. This may be set in one of your rules. (this rule may even be running on a grafana page instead of prometheus itself)
If you have a very large number of metrics it is possible the rule is querying all of them. A quick fix is by exactly specifying which metrics to query on with specific labels instead of regex one.
This article explains why Prometheus may use big amounts of memory during data ingestion. If you need reducing memory usage for Prometheus, then the following actions can help:
Increasing scrape_interval in Prometheus configs.
Reducing the number of scrape targets and/or scraped metrics per target.
P.S. Take a look also at the project I work on - VictoriaMetrics. It can use lower amounts of memory compared to Prometheus. See this benchmark for details.
Because the combination of labels lies on your business, the combination and the blocks may be unlimited, there's no way to solve the memory problem for the current design of prometheus!!!! But i suggest you compact small blocks into big ones, that will reduce the quantity of blocks.
Huge memory consumption for TWO reasons:
prometheus tsdb has a memory block which is named: "head", because head stores all the series in latest hours, it will eat a lot of memory.
each block on disk also eats memory, because each block on disk has a index reader in memory, dismayingly, all labels, postings and symbols of a block are cached in index reader struct, the more blocks on disk, the more memory will be cupied.
in index/index.go, you will see:
type Reader struct {
b ByteSlice
// Close that releases the underlying resources of the byte slice.
c io.Closer
// Cached hashmaps of section offsets.
labels map[string]uint64
// LabelName to LabelValue to offset map.
postings map[string]map[string]uint64
// Cache of read symbols. Strings that are returned when reading from the
// block are always backed by true strings held in here rather than
// strings that are backed by byte slices from the mmap'd index file. This
// prevents memory faults when applications work with read symbols after
// the block has been unmapped. The older format has sparse indexes so a map
// must be used, but the new format is not so we can use a slice.
symbolsV1 map[uint32]string
symbolsV2 []string
symbolsTableSize uint64
dec *Decoder
version int
}
We used the prometheus version 2.19 and we had a significantly better memory performance. This Blog highlights how this release tackles memory problems. i will strongly recommend using it to improve your instance resource consumption.

Why interfaced DDR3 memory working slow as compared to on-chip memory in FPGA?

I am using MAX10 FPGA and have interfaced DDR3 memory. I have noticed that my DDR3 Memory is working slow as compared to on-chip memory. I came to know about this, as I wrote a blinking LEDs program, and for same delay function with on-chip memory it is working faster as compared to DDR3 memory. What can be done possibly to increase speed? And what might possibly be wrong? My system clock is running at 50MHz.
P.S. There are no Instruction or Data Caches in my system.
First,your function is not pipeline function as your description.Because you do something with memory and then blinking the LED.Every thing run in sequence.
In this case,you should estimate the response time and throughout of your memory.For example,you read a data from memory and then do a add function,and do this 10 times.If you always read memory after add function,your sum time consumption is about 10*response time + 10 add function time.
The difference is memory response time.Inner ram's response time can be 1 cycle at 50MHz.But DDR3 memory should be about 80 ns. That's the difference.
But you can change your module to pipeline pattern.Read/write data and do your other function parallel.And r/w DDR ahead.That's like cache in PC. This can save some time.
By the way,DDR throughout is highly depends on your function pattern.If you read or write data at the sequence order address, then you will get a bigger throughout.
After all,external memory's throughout and response time can never greater then internal memory.
Forgive my English.

CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES.
I just had the same problem you had (took me a whole day to fix).
I'm sure people with the same problem will stumble upon this, that's why I'm posting to this old question.
You propably didn't check for the maximum work group size of the kernel.
This is how you do it:
size_t kernel_work_group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
My devices (2x NVIDIA GTX 460 & Intel i7 CPU) support a maximum work group size of 1024, but the above code returns something around 500 when I pass my Path Tracing kernel.
When I used a workgroup size of 1024 it obviously failed and gave me the CL_OUT_OF_RESOURCES error.
The more complex your kernel becomes, the smaller the maximum workgroup size for it will become (or that's at least what I experienced).
Edit:
I just realized you said "clEnqueueReadBuffer" instead of "clEnqueueNDRangeKernel"...
My answer was related to clEnqueueNDRangeKernel.
Sorry for the mistake.
I hope this is still useful to other people.
From another source:
- calling clFinish() gets you the error status for the calculation (rather than getting it when you try to read data).
- the "out of resources" error can also be caused by a 5s timeout if the (NVidia) card is also being used as a display
- it can also appear when you have pointer errors in your kernel.
A follow-up suggests running the kernel first on the CPU to ensure you're not making out-of-bounds memory accesses.
Not all available memory can necessarily be supplied to a single acquisition request. Read up on heap fragmentation 1, 2, 3 to learn more about why the largest allocation that can succeed is for the largest contiguous block of memory and how blocks get divided up into smaller pieces as a result of using the memory.
It's not that the resource is exhausted... It just can't find a single piece big enough to satisfy your request...
Out of bounds acesses in a kernel are typically silent (since there is still no error at the kernel queueing call).
However, if you try to read the kernel result later with a clEnqueueReadBuffer(). This error will show up. It indicates something went wrong during kernel execution.
Check your kernel code for out-of-bounds read/writes.

GC and memory limit issues with R

I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.

Why is Erlang crashing on large sequences?

I have just started learning Erlang and am trying out some Project Euler problems to get started. However, I seem to be able to do any operations on large sequences without crashing the erlang shell.
Ie.,even this:
list:seq(1,64000000).
crashes erlang, with the error:
eheap_alloc: Cannot allocate 467078560 bytes of memory (of type "heap").
Actually # of bytes varies of course.
Now half a gig is a lot of memory, but a system with 4 gigs of RAM and plenty of space for virtual memory should be able to handle it.
Is there a way to let erlang use more memory?
Your OS may have a default limit on the size of a user process. On Linux you can change this with ulimit.
You probably want to iterate over these 64000000 numbers without needing them all in memory at once. Lazy lists let you write code similar in style to the list-all-at-once code:
-module(lazy).
-export([seq/2]).
seq(M, N) when M =< N ->
fun() -> [M | seq(M+1, N)] end;
seq(_, _) ->
fun () -> [] end.
1> Ns = lazy:seq(1, 64000000).
#Fun<lazy.0.26378159>
2> hd(Ns()).
1
3> Ns2 = tl(Ns()).
#Fun<lazy.0.26378159>
4> hd(Ns2()).
2
Possibly a noob answer (I'm a Java dev), but the JVM artificially limits the amount of memory to help detect memory leaks more easily. Perhaps erlang has similar restrictions in place?
This is a feature. We do not want one processes to consume all memory. It like the fuse box in your house. For the safety of us all.
You have to know erlangs recovery model to understand way they let the process just die.
Also, both windows and linux have limits on the maximum amount of memory an image can occupy
As I recall on linux it is half a gigabyte.
The real question is why these operations aren't being done lazily ;)

Resources