I have two tiers: MEM+SSD. The MEM layer is almost always at 90% full and sometimes the SSD tier is also full.
Now this (kind of) message is sometimes spamming my log:
2022-06-14 07:11:43,607 WARN TieredBlockStore - Target tier: BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM} has no available space to store 67108864 bytes for session: -4254416005596851101
2022-06-14 07:11:43,607 WARN BlockTransferExecutor - Transfer-order: BlockTransferInfo{TransferType=SWAP, SrcBlockId=36401609441282, DstBlockId=36240078405636, SrcLocation=BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM}, DstLocation=BlockStoreLocation{TierAlias=SSD, DirIndex=0, MediumType=SSD}} failed. alluxio.exception.WorkerOutOfSpaceException: Failed to find space in BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM} to move blockId 36240078405636
2022-06-14 07:11:43,607 WARN AlignTask - Insufficient space for worker swap space, swap restore task called.
Is my setup flawed? What can I do to get rid of these warnings?
looks like alluxio worker is trying to move/swap some blocks but there is no enough space to finish the operation. I guess it might be caused by both the ssd and mem tiers are full. Have you tried this property? alluxio.worker.tieredstore.free.ahead.bytes This can help us determine whether the swap failed due to insufficient storage space.
Related
I'm trying to run a relatively large DASKLightGBM task on relatively small machine (32GB RAM, 8 cores), so I cap the memory usage to 20GB... The dataset is about 100M rows with 50 columns. I know it is large, but aren't we trying to do out of core ML?
client = Client(memory_limit='20GB', processes=False,
n_workers=1, threads_per_worker=7)
params = {"max_depth":4,"n_estimators":800,"client":client}
learner = lightgbm.DaskLGBMRegressor(**params)
learner.fit(dd_feature_009a013a_train[x_columns],dd_price_solely_y_train[y_column_now])
However, errors are output and process are dead:
/home/ubuntu/anaconda3/lib/python3.8/site-packages/lightgbm/dask.py:317: UserWarning: Parameter n_jobs will be ignored.
_log_warning(f"Parameter {param_alias} will be ignored.")
Finding random open ports for workers
distributed.worker - WARNING - Worker is at 85% memory usage. Pausing worker. Process memory: 15.93 GiB -- Worker memory limit: 18.63 GiB
distributed.comm.inproc - WARNING - Closing dangling queue in <InProc local=inproc://172.31.91.159/37355/1 remote=inproc://172.31.91.159/37355/9>
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.48 GiB -- Worker memory limit: 18.63 GiB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.48 GiB -- Worker memory limit: 18.63 GiB
Without a MCVE it's difficult to answer your question precisely.
The "Memory use is high" error could be thrown for a few different reasons. I found this resource by a core Dask maintainer helpful in diagnosing the exact issue.
To summarise, consider:
Breaking your data into smaller chunks.
Manually triggering garbage collection and/or tweak the gc settings on the workers through a Worker Plugin
Trimming memory using malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)
I'd also advise you to make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
My jenkins instance has been running for over two years without issue but yesterday quit responding to http requests. No errors, just clocks and clocks.
I've restarted the service, then restarted the entire server.
There's been a lot of mention of a thread dump. I attempted to get that but I'm not sure that this is that.
Heap
PSYoungGen total 663552K, used 244203K [0x00000000d6700000, 0x0000000100000000, 0x0000000100000000)
eden space 646144K, 36% used [0x00000000d6700000,0x00000000e4df5f70,0x00000000fde00000)
from space 17408K, 44% used [0x00000000fef00000,0x00000000ff685060,0x0000000100000000)
to space 17408K, 0% used [0x00000000fde00000,0x00000000fde00000,0x00000000fef00000)
ParOldGen total 194048K, used 85627K [0x0000000083400000, 0x000000008f180000, 0x00000000d6700000)
object space 194048K, 44% used [0x0000000083400000,0x000000008879ee10,0x000000008f180000)
Metaspace used 96605K, capacity 104986K, committed 105108K, reserved 1138688K
class space used 12782K, capacity 14961K, committed 14996K, reserved 1048576K
Ubuntu 16.04.5 LTS
I prefer looking in the jenkins log file. There you can see error and then fix them.
I want to run thousands of identical single-thread simulations with different random seeds (which I pass to my program). Some of them have gone out of memory, yet I don't know why. I call run_batch_job as sbatch --array=0-999%100 --mem=200M run_batch_job, where run_batch_job contains:
#!/bin/env bash
#SBATCH --ntasks=1 # Number of cores
#SBATCH --nodes=1 # All cores on one machine
srun my_program.out $SLURM_ARRAY_TASK_ID
For a single thread, 200M should be more than enough memory, yet for some simulations, I get the error:
slurmstepd: error: Exceeded step memory limit at some point.
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: cluster-cn002: task 0: Out Of Memory
slurmstepd: error: Exceeded job memory limit at some point.
Am I allocating 200M to each of the thousand threads, or am I doing something wrong?
EDIT: I've tried specifying --cpus-per-task=1 and --mem-per-cpu=200M instead of --ntasks=1, --nodes=1 and --mem=200M, with the same results.
Your submission is correct, but 200M might be low depending on the libraries you use or the files you read. Request at least 2G as virtually all clusters have at least 2GB of memory per core.
The machine's status is describe as blow:
the machine has 96 physical memory. And the real use memory is about 64G, and the page cache use about 32G. we also use the swap area, at that time we have about 10G(we set the swap max size to 32G). At that moment, we find xfs report
Apr 29 21:54:31 w-openstack86 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
after reading the source code. This message is display from this line
ptr = kmalloc(size, lflags);
if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
return ptr;
if (!(++retries % 100))
xfs_err(NULL,
"possible memory allocation deadlock in %s (mode:0x%x)",
__func__, lflags);
congestion_wait(BLK_RW_ASYNC, HZ/50);
The error is cause by the kmalloc() function, there is not enough memory in the system. But there is still 32G page cache.
So I run
echo 3 > /proc/sys/vm/drop_caches
to drop the page cache.
Then the system is fine. But I really don't know the reason.
Why after I run drop_caches operation the kmalloc() function will success? I think even we use whole physical memory, but we only use 64 real momory, the 32G memory are page cache, further we have enough swap space. So why the kernel don't flush the page cache or the swap to reserved the kmalloc operation.
It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES.
I just had the same problem you had (took me a whole day to fix).
I'm sure people with the same problem will stumble upon this, that's why I'm posting to this old question.
You propably didn't check for the maximum work group size of the kernel.
This is how you do it:
size_t kernel_work_group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
My devices (2x NVIDIA GTX 460 & Intel i7 CPU) support a maximum work group size of 1024, but the above code returns something around 500 when I pass my Path Tracing kernel.
When I used a workgroup size of 1024 it obviously failed and gave me the CL_OUT_OF_RESOURCES error.
The more complex your kernel becomes, the smaller the maximum workgroup size for it will become (or that's at least what I experienced).
Edit:
I just realized you said "clEnqueueReadBuffer" instead of "clEnqueueNDRangeKernel"...
My answer was related to clEnqueueNDRangeKernel.
Sorry for the mistake.
I hope this is still useful to other people.
From another source:
- calling clFinish() gets you the error status for the calculation (rather than getting it when you try to read data).
- the "out of resources" error can also be caused by a 5s timeout if the (NVidia) card is also being used as a display
- it can also appear when you have pointer errors in your kernel.
A follow-up suggests running the kernel first on the CPU to ensure you're not making out-of-bounds memory accesses.
Not all available memory can necessarily be supplied to a single acquisition request. Read up on heap fragmentation 1, 2, 3 to learn more about why the largest allocation that can succeed is for the largest contiguous block of memory and how blocks get divided up into smaller pieces as a result of using the memory.
It's not that the resource is exhausted... It just can't find a single piece big enough to satisfy your request...
Out of bounds acesses in a kernel are typically silent (since there is still no error at the kernel queueing call).
However, if you try to read the kernel result later with a clEnqueueReadBuffer(). This error will show up. It indicates something went wrong during kernel execution.
Check your kernel code for out-of-bounds read/writes.