How many bytes is a GB of GPU memory? - memory

NVIDIA specifies that their GeForce RTX 3090 (Ti) has 24 GB of memory. How are you supposed to know how much data you can fit on it, when some sources use 1 GB = 1,0243 bytes, while other sources use 1 GB = 1,0003 bytes? Can you assume that hardware manufacturers always use 1,000-base since that means they can write higher number in the specifications, or do some hardware manufacturers still use 1,024-base?

I would expect 1 GB = 1,024^3 as I only encountered 1 GB = 1,000^3 in the context of storage as in SSD.

Related

data bus and memory unit addressing confusion [duplicate]

This question already has answers here:
Size of a Word and addressing
(2 answers)
How does 32-bit address 4GB if 2³² bits = 4 Billion bits not Bytes?
(4 answers)
Closed 2 years ago.
I have a question regarding (RAM) memory units:
for a cpu architecture x32,
we will have 32bit size cpu registers, as well as data bus to ram of 32 wire and address bus of 32 wire.
so the maximum memory address unit is 2^32 = 4,294,967,296
on other words we have 4,294,967,296 memory units, and for each memory unit, data bus (32) size should be writable so for each memory unit its size should be 32bit to handle the data bus
if i concluded right which i doubt, (Ram) total memory size should be = no of memory units * size each =
4,294,967,296 * 32 = 137,438,953,472‬ bit. which is not true
after research, i found out, rams unit memory are standardized to be 8 bit per each memory unit,
so if this is the case, how come single memory unit (8bit) can store (32bit data bus) ?
Indeed what is true, is that if you won't have 32 bit data, what this tells you is that the memory can hold 2^32 addresses for memory. However for each memory what you have is a byte, or eight bits, then you will have about 4 GiB = 4*2^30 Bytes = 2^32 bits.

Understanding docker stats cpu utilization

When I do docker stats I see that usage is greater than 100% most of the times. I have a machine which has 8 cores. So, does below output mean that 100% CPU means one core is totally occupied. So, 690% means close to 7 cores is totally occupied ?
d99e067cfffc 690.00% 5.517 GiB / 12.7 GiB 43.46% 1.47 GB / 1.03 GB 9.15 MB / 0 B 338
Exactly as you stated. You can have up to N * 100% CPU usage when N is the number of cores you have.
By the way, you can run the container with a --cpus <your_num> flag to limit the usage of CPU cores if you like.
More details in the official docs.

Compute max memory bandwidth of a processor

I'm reading this CPU specification: http://ark.intel.com/products/67356/Intel-Core-i7-3612QM-Processor-6M-Cache-up-to-3_10-GHz-rPGA
It says the CPU has 2 channels. So I think it has 2 memory controller inside. Then the max memory bandwidth should be 1.6GHz * 64bits * 2 * 2 = 51.2 GB/s if the supported DDR3 RAM are 1600MHz. But the specification says its max memory bandwidth is 25.6 GB/s.
I multiplied two 2s here, one for the Double Data Rate, another for the memory channel.
Is it the problem of the specification? or I have some miss understanding?
Double data rate memory specs usually already take into account that its effective frequency is doubled. "1600 MHz memory" really runs on 800 Mhz, so you can leave out one factor of 2 from your calculation.

Maximum memory size a system can support

Suppose that I have a computer with an address register of size 16 bits (MAR, for example). The smallest addressable unit in this computer is a word and each word is of size 2 bytes. What is the maximum memory size (in bytes) this system can support?
I thought it would be 2^16 = 65536 bytes, but the part about the smallest addressable unit implies that this is not the way to solve it.
Thanks in advance
There is no direct correlation to the maximum amount of memory a system can support, and the size of address registers.
16bit computers 30 years ago could very well support more than 64 kilobytes. On the other hand, modern 64bit processors typilcally only have lanes for 52 bits (or less), but even so a typical computer cannot nearly support 2^52 bytes of memory.
Typical 64bit computers today could in theory address 16 exibytes, but present-time CPUs only support 4 petabytes of phyisical and 256 terabytes of per-process virtual memory. Typical desktop mainboards support 128GiB maximum, if you buy extra expensive DIMMS. With affordable DIMMS, you're limited to about half as much (there are only so and so many slots).
Operating systems typically allow for main memory sizes in the hundreds of gigabytes only (e.g. 512 GiB for Windows 8 enterprise/professional, and 128GiB otherwise, or as little as 16GiB for Windows 7 Home Premium)
Generally the smallest addressable size is one byte, as you have calculated it, if it were one byte it would be 2^16*1 = 65536 bytes. However, because on this system there are two bytes per address, it is actually 2^16*2 = 131072 bytes.

Understanding memory usage in CUDA

I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0.
I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly.
However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations.
I want to understand this by an explicit example.
Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.
Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks.
Number of blocks : 16384
Number of threads per block : 32 ( => total threads 32*16384 = 524288)
Each thread-block is given a 32 x 32 two-d integer array of shared memory
to work with.
Within a thread I would like to store some numbers of type double. But I am not sure
how many such double numbers I can store without any register spilling into local memory (which is on device). Can someone tell
me how many doubles can be stored per thread for this kernel configuration?
Also is the above mentioned configuration for shared-memory for each of my blocks valid?
A sample computation about how one would go about deducing these things would be very
illustrative and helpful
Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce GTX 570"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1279 MBytes (1341325312 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.46 GHz
Memory Clock rate: 1900.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570
[deviceQuery] test results...
PASSED
Press ENTER to exit...
So, the kernel configuration is a little complicated. You should use the CUDA OCCUPANCY CALCULATOR. And the other hand you have to study how warps work. Once a block is assigned to a SM, it is further divided into 32-thread units called warps. We can say that a warp is a unit of thread scheduling in SMs. We can calculate the number of warps that reside in a SM for a given block size and given number of blocks assigned to each SM. In your case a warp consists in 32 threads, so if you have a block with 256 threads then you have 8 warps. Now choosing a correctly kernel setting depends of your data and operations, remember that you have to full occupy a SM, that is: you have to get full thread capacity in each SM and the maximal number of warps for scheduling around the long-latency operations. Another important thing is dont exceed the limitations of up to maximum threads per blocks, in your case 1024.

Resources