Which CPU architectures have invalidate queues, (x86 has only store buffer because it is feature of TSO, ARM has weak memory model)? May be Alpha has?
Related
So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.
Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.
CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.
I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.
Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.
Is there any direct relation between the OpenCL memory architecture:
Local/Global/Constant/Private memory
And the physical GPU's memory and caches.
For example a GPU card that have 1GB memory/L1 cache/L2 cache. Are these related to local/global.. memory?
Or are Local/Constant/Private memory allocated from Global memory?
-Thanks
OpenCL doesn't really discuss caching of memory. Most modern graphics cards do have some sort of caching protocols for global memory, but these are not guaranteed in older cards. However here is an overview of the different memories.
Private Memory - This memory is kept as registers per work-item. GPUs have very large register files per compute unit. However this memory can spill into local memory if needed. Private memory is allocated by default when you create variables.
Local Memory - Memory local to and shared by the workgroup. This memory system typically is on the compute unit itself and cannot be read or written to by other workgroups. This memory has typically very low latency on GPU architectures (on CPU architectures, this memory is simply part of your system memory). This memory is typically used as a manual cache for global memory. Local memory is specified by the __local attribute.
Constant Memory - Part of the global memory, but read only and can therefore be aggressively cached. __constant is used to define this type of memory.
Global Memory - This is the main memory of the GPU. __global is used to place memory into the global memory space.
We believe that texture memory is part of the global memory, is this true? If so, how much can you allocate? (Indirectly, how much is there?)
And is it true that all multiprocessors can read from the texture memory at the same time?
Texture data is contained in CUDA arrays, and CUDA arrays are allocated out of global memory; so however much global memory is still free (you can call cuMemGetInfo() to see how much free memory is left) is available for allocation as textures.
It's impossible to know how much memory is consumed by a given CUDA array - obviously it has to be at least Width*Height*Depth*sizeof(Texel), but it may take more because the driver has to do an allocation that conforms to the hardware's alignment requirements.
Texture limits for different compute capabilities can be found in the CUDA Programming Guide, available at the NVIDIA CUDA website.
For a given device, one can query device capabilities including texture limits using the cudaGetDeviceProperties function.
Allocation depends on the amount of available global memory and the segmentation of the memory, so there is no easy way to tell whether a given allocation will be successfull or not, especially when working with large textures.
I am reading some articles discussing about "memory subsystems". What is the definition for memory subsystems?
As far as I understand by googling or reading other documents, it kind of indicates a group of main memory and processor cache. Is that correct?
There are three types of memory subsystem comoponents, RAM (R) components, single access (S) components, and dual-access (D) components. All memory subsystem components are for automatically retrieving operands from and storing results in their associated memory modules. All memory subsystem components have an output data connection and an input data connection. Therefore, they must be capable of handling both an output data stream and an input data stream. In addition, a D component includes a second pair of input and output connections. All memory subsystem components have a queue in each of their input and output data streams.
A significant difference between the memory subsystem components and the other components is that a Number of Operands In (NumOpsIn) register as well as a NumOpsOut register must be included. The NumOpsIn register serves the same purpose for the input data stream as NumOpsOut does for the output stream. Both NumOpsIn and NumOpsOut must be zero before new instructions can be distributed to the component's programmable registers.
The term "memory subsystem" seems to refer to a DRAM module, not CPU caches, and not CPU registers. I claim this based on the usages of the term listed below.
However, each usage is in the context of CPU caches, cache misses, and performance.
Here's a sample of usages:
Ulrich Drepper in What Every Programmer Should Know About Memory:
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers.
Intel in Intel 64 and IA-32 Architectures Optimization Reference Manual:
The manual also distinguishes between cache and the memory subsystems
Memory Bound corresponds to execution stalls related to the cache and
memory subsystems.
Intel's VTune Profiler lumps CPU caches in with DRAM when listing things related to "memory subsystem issues", but doesn't claim that CPU caches are part of a "memory subsystem":
This metric shows how memory subsystem issues affect the performance. [...]
The manual contrasts between an on-package memory subsystem and an off-package memory subsystem ("main memory") in Knights Landing:
MCDRAM is an on-package, high bandwidth memory subsystem that provides
peak bandwidth for read traffic, but lower bandwidth for write traffic
(compared to reads). The aggregate bandwidth provided by MCDRAM is
higher than the off-package memory subsystem (i.e., DDR memory).
I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.