What is global memory size of my device? - memory

I have Tesla C2075. I wanted to know global memory size. So I ran deviceQuery SDK sample. It reports me 4GB of global memory but when I run nvidia-smi -q, it reports 6GB of global memory. Why this mismatch occurs? Is some memory specially dedicated for OS?
./deviceQuery reports:
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "Tesla C2075"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 4096 MBytes (4294967295 bytes)
nvidia-smi -q output:
Memory Usage
Total : 5375 MB
Used : 39 MB
Free : 5336 MB

You're running 32-bit Linux, so you will only have 4GB of device memory available to your process.
The device still has 6GB, so if you have two processes sharing the device then between them they can occupy the full 6GB, but each process can only use 4GB.

Related

PyTorch running under WSL2 getting "Killed" for Out of memory even though I have a lot of memory left?

I'm on Windows 11, using WSL2 (Windows Subsystem for Linux). I recently upgraded my RAM from 32 GB to 64 GB.
While I can make my computer use more than 32 GB of RAM, WSL2 seems to be refusing to use more than 32 GB. For example, if I do
>>> import torch
>>> a = torch.randn(100000, 100000) # 40 GB tensor
Then I see the memory usage go up until it hit's 30-ish GB, at which point, I see "Killed", and the python process gets killed. Checking dmesg, it says that it killed the process because "Out of memory".
Any idea what the problem might be, or what the solution is?
According to this blog post, WSL2 is automatically configured to use 50% of the physical RAM of the machine. You'll need to add a memory=48GB (or your preferred setting) to a .wslconfig file that is placed in your Windows home directory (\Users\{username}\).
[wsl2]
memory=48GB
After adding this file, shut down your distribution and wait at least 8 seconds before restarting.
Assume that Windows 11 will need quite a bit of overhead to operate, so setting it to use the full 64 GB would cause the Windows OS to run out of memory.

RTX 3080 LHR Missing gpu__dram_throughput CUDA metric

As part of a machine learning project, we are optimizing some custom CUDA kernels.
We are trying to profile them using Nsight Compute, but encounter the following error running on the LHR RTX 3080 when running a simple wrapper around the CUDA Kernel:
==ERROR== Failed to access the following 4 metrics: dram__cycles_active.avg.pct_of_peak_sustained_elapsed, dram__cycles_elapsed.avg.per_second, gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed, gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
==ERROR== Failed to profile kernel "kernel" in process 20204
Running a diff against the metrics available on an RTX 3080 TI (non-LHR) vs an RTX-3080 (LHR) via nv-nsight-cu-cli --devices 0 --query-metrics, We notice the following metrics are missing in the RTX 3080 LHR version:
gpu__compute_memory_request_throughput
gpu__compute_memory_throughput
gpu__dram_throughput
All of these are required for even basic memory profiling using Nsight Compute. All other metrics are correctly present, except for these. Is this a limitation of LHR cards? Why would they not be present?
Details:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.5
Driver version: 497.29.
Windows 10
I saw your post on the nvidia developer forums and from what it looks like, nvidia didn't intend on this, so I'd either just go with what works (non-lhr) for now until they fix it. Quadro and tesla cards are supported by Nsight Compute so they might be a holdover solution.
So to answer the main question:
Will buying a non-LHR GPU address this problem?
for right now, yes, buying a non-lhr 3080 should fix the issue.
As per Nvidia forums, this is an unintended bug that is fixed by upgrading from CUDA 11.5 to CUDA 11.6, under which all profiling is working correctly with all metrics available.
Successful conditions:
Gigabyte RTX 3080 Turbo (LHR)
Cuda Version: 11.6
Driver version: 511.23.
Windows 10
We don't know why these metrics were unavailable, but the version update is definitely the correct fix.

Huge memory usage in Xilinx Vivado

Vivado consumes all of the free memory space in my machine during synthesis and for this reason, the machine either hangs or crashes after a while.
I encountered this issue when using Vivado 2018.1 on Windows 10 (w/ 8GB RAM) and Vivado 2020.1 on CentOS 7 (w/ 16GB RAM).
Is there any option in Vivado to limit its memory usage?
If this problem happens when you are synthesizing multiple out of context modules, try reducing the Number of Jobs when you start the run.

Reducing Valgrind memory usage for embedded target

I'm trying to use Valgrind to debug a crashing program on an embedded Linux target. The system has roughly 31 MB of free memory when nothing is running, and my program uses about 2 MB of memory, leaving 29 MB for Valgrind. Unfortunately, when I try to run my program under Valgrind, Valgrind reports an error:
Valgrind's memory management: out of memory:
initialiseSector(TC)'s request for 27597024 bytes failed.
50,388,992 bytes have already been mmap-ed ANONYMOUS.
Valgrind cannot continue. Sorry.
Is there any way I can cut down Valgrind's memory usage so it will run successfully in this environment? Or am I just out of luck?
valgrind can be tuned to decrease (increase) its cpu/memory usage,
with an effect to decrease (increase) the information about problems/bugs.
See e.g. https://archive.fosdem.org/2015/schedule/event/valgrind_tuning/attachments/slides/743/export/events/attachments/valgrind_tuning/slides/743/tuning_V_for_your_workload.pdf
Note however that running valgrind within 31MB (or so) seems an impossible task.

Performance issue in Sun OS

I transforming an XML to XML using SAXON9EE. Size of source file is 200MB, Size of XSL is 65KB
The translation time is different on different machines.
On Windwows Vista, 64bit, 24GB RAM takes 4hrs
On Windows XP, 32bit, 4GB RAM takes 6hrs
On Linux 32 bit, 8GB RAM takes 14hrs
On Sun OS, 64bit, 32GB RAM takes 48hrs. Here are some output from Sun OS TOP command 309 processes: 290 sleeping, 3 zombie, 3 stopped,
13 on cpu CPU states: 63.6% idle, 35.0% user, 1.4% kernel, 0.0%
iowait, 0.0% swap Memory: 32G phys mem, 10G free mem, 4005M total
swap, 4005M free swap
On another Sun OS, 64bit, 8GB RAM, takes 48hrs.
My requirement is to run on Sun OS and reduce the time. Why does Sun OS take such a long time. I have tried changing the heap size, no luck. Should I try changing any other parameter.
Please advice.
Thanks
Regards
Siva

Resources