Intel VTune - Estimate data offload to GPU - memory

I'm interested in estimate the data transfer, in terms of bytes, of an algorithm or function to be executed on a GPU using Intel VTune. For example, if my algorithm computes a multiplication between two vectors of 10 floats elements each, the result after the offloading would be: 10+10 float elements sent to the GPU and 1, the result, is sent back, so we have 84 bytes in total (21*4). Keep in mind that I'm interested in an estimation, not the actual result on a GPU, since I don't have one available.
With Intel Advisor is possible to do so and is called "Estimated data transfer with reuse", as I attach in the following screen:
Intel Advisor Data estimation result example
In Intel VTune the only way I found is via the "Memory Access" analysis but It express the result as number of loads and stores and probably using hardware counters, so if there are multiple readings from main memory caused by huge data structures, they will be taken into account and does not returns the number of bytes.
Intel VTune Memory access analysis results example
Is there a way to perform a similar analysis with Intel VTune? Thanks

If you have a core CPU in your system, it will have UHD graphics. When you try GPU offload using the Intel Vtune profiler, you can see GPU memory access (read and write) metrics in GB/sec. I have attached a screen shot for your reference. To try GPU offload analysis, you need to have a sample that runs on GPU as well as a system with Intel GPU.
You can find the answer to your question here.

Related

How to allocate GpuMat in shared (not dedicated) GPU memory? [duplicate]

So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.
Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.
CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.
I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.
Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.

How do I calculate memory bandwidth on a given (Linux) system, from the shell?

I want to write a shell script/command which uses commonly-available binaries, the /sys fileystem or other facilities to calculate the theoretical maximum bandwidth for the RAM available on a given machine.
Notes:
I don't care about latency, just bandwidth.
I'm not interested in the effects of caching (e.g. the CPU's last-level cache), but in the bandwidth of reading from RAM proper.
If it helps, you may assume a "vanilla" Intel platform, and that all memory DIMMs are identical; but I would rather you not make this assumption.
If it helps, you may rely on root privileges (e.g. using sudo)
#einpoklum you should have a look at Performance Counter Monitor available at https://github.com/opcm/pcm. It will give you the measurements that you need. I do not know if it supports kernel 2.6.32
Alternatively you should also check Intel's EMON tool which promises support for kernels as far back as 2.6.32. The user guide is listed at https://software.intel.com/en-us/download/emon-user-guide, which implies that it is available for download somewhere on Intel's software forums.
I'm not aware of any standalone tool that does it, but for Intel chips only, if you know the "ARK URL" for the chip, you could get the maximum bandwidth using a combination of a tool to query ARK, like curl, and something to parse the returned HTML, like xmllint --html --xpath.
For example, for my i7-6700HQ, the following works:
curl -s 'https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz' | \
xmllint --html --xpath '//li[#class="MaxMemoryBandwidth"]/span[#class="value"]/span/text()' - 2>/dev/null
This returns 34.1 GB/s which is the maximum theoretical bandwidth of my chip.
The primary difficulty is determining the ARK URL, which doesn't correspond in an obvious way to the CPU brand string. One solution would be to find the CPU model on an index page like this one, and follow the link.
This gives you the maximum theoretical bandwidth, which can be calculated as (number of memory channels) x (trasfer width) x (data rate). The data rate is the number of transfers per unit time, and is usually the figure given in the name of the memory type, e.g., DDR-2133 has a data rate of 2133 million transfers per second. Alternately you can calculate it as the product of the bus speed (1067 MHz in this case) and the data rate multiplier (2 for DDR technologies).
For my CPU, this calculation gives 2 memory channels * 8 bytes/transfer * 2133 million transfers/second = 34.128 GB/s, consistent with the ARK figure.
Note that theoretical maximum as reported by ARK might be lower or higher than the theoretical maximum on your particular system for various reasons, including:
Fewer memory channels populated than the maximum number of channels. For example, if I only populated one channel on my dual channel system, theoretical bandwidth would be cut in half.
Not using the maximum speed supported RAM. My CPU supports several RAM types (DDR4-2133, LPDDR3-1866, DDR3L-1600) with varying speeds. The ARK figure assumes you use the fastest possible supported RAM, which is true in my case, but may not be true on other systems.
Over or under-clocking of the memory bus, relative to the nominal speed.
Once you get the correct theoretical figure, you won't actually reach this figure in practice, due to various factors including the following:
Inability to saturate the memory interface from one or more cores due to limited concurrency for outstanding requests, as described in the section "Latency Bound Platforms" in this answer.
Hidden doubling of bandwidth implied by writes that need to read the line before writing it.
Various low-level factors relating the DRAM interface that prevents 100% utilization such as the cost to open pages, the read/write turnaround time, refresh cycles, and so on.
Still, using enough cores and non-termporal stores, you can often get very close to the theoretical bandwidth, often 90% or more.

Memory management when using GPU in TensorFlow

I have some doubts about using GPU in Tensorflow. I was following convolutional neural network tutorial here (tensorflow/models/image/cifar10/cifar10_train.py). As in the tutorial, all parameters (e.g., weights) are stored and updated in CPU memory and GPUs are only used to compute gradients or inference.
Since the weights are stored in CPU, they should be synchronized every iteration and it seems that GPU is underutilized (about 60% according to nvidia-smi). In case of using multiple GPUs, I understand that weights should be stored in CPU memory to synchronize between the GPUs. However, why does this tutorial store all weights in CPU even in single GPU? Is there any way to store and update them in GPU memory?
In case of inference, does the weights copied to GPU once and reuse them? or should they copied every time they are used?
How about image data? It seems that those data reside in GPU (not sure). When does this data transferred to GPU? When they are loaded from disk? or when they are required in the GPU?
If they are copied to GPU right after they are loaded from disk, what happens if size of the image data is too large to fit in the GPU memory? In such case, there is any way to copy data separately (something like prefetching)?
If they are copied to GPU on demand, is there any way to prefetch them before they are actually used by GPU to avoid idle time?
EDIT: It would be helpful if there is any way to check where the send/recv nodes are inserted between CPU and GPU (as in the white paper).
Those tutorials are meant to show off the API, so they don't optimize for performance. It's faster to keep variable on GPU for single tower model, and also faster for multi-tower model when you have p2p communication enabled between GPU. To pin variables to GPU, use the same tf.device('/gpu:0') approach as for any other op.
You can see all the memory copies between GPUs if you enable partition graphs, ie do something like this:
metadata = tf.RunMetadata()
sess.run(x, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE,
output_partition_graphs=True),
run_metadata=metadata)
timeline = Timeline(metadata.step_stats)
with open("dynamic_stitch_gpu_profile.json", "w") as f:
f.write(timeline.generate_chrome_trace_format())
with open("dynamic_stitch_gpu_profile.pbtxt", "w") as f:
f.write(str(metadata))
See this issue for an example of using this technique to track down copies:
https://github.com/tensorflow/tensorflow/issues/7251#issuecomment-277385212
For prefetching to GPU, see this issue
There are new stage_op ops that have been added that allow prefetching to GPU and are dramatically faster than using Python queue runner approach. They are in process of being documented.

Maximum data a GPU can take?

I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?
A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.

HLSL: Memory coalescing with structured buffers

I'm currently working on an HLSL shader that is limited by global memory bandwidth. I need to coalesce as much memory as possible in each memory transaction. Based on the guidelines from NVIDIA for CUDA and OpenCL (DirectCompute documentation is quite lacking), the largest memory transaction size for compute capability 2.0 is 128 bytes, while the largest word that can be accessed is 16 bytes. Global memory accesses can be coalesced when the data being accessed by the threads in a warp fall into the same 128 byte segment. With this in mind, wouldn't structured buffers be detrimental for memory coalescing if the structure is larger than 16 bytes?
Suppose you have a structure of two float4's, call them A and B. You can access either A or B, but not both in a single memory transaction for an instruction issued in a non-divergent warp. The layout of the memory would look like ABABABAB. If you're trying to read consecutive structures into shared memory, wouldn't memory bandwidth be wasted by storing the data in this manner? For example, you can only access the A elements, but the hardware coalesces the memory transaction so it reads in 128 bytes of consecutive data, half of which is the B elements. Essentially, you're wasting half of your memory bandwidth. Wouldn't it be better to store the data like AAAABBBB, which is a structure of buffers instead of a buffer of structures? Or is this handled by the L1 cache, where the B elements are cached so you can access them faster when the next instruction is to read in the B elements? The only other solution would be to have even numbered threads access the A elements, while odd numbered elements access the B elements.
If memory bandwidth is indeed wasted, I don't see why anyone would use structured buffers other than for convenience. Hopefully I explained this well enough so someone could understand. I would ask this on the NVIDIA developer forums, but I think they're still down. Visual Studio keeps crashing when I try to run the NVIDIA Nsight frame profiler, so it's difficult to see how the memory bandwidth is affected by changes in how the data is stored. P.S., has anyone been able to successfuly run the NVIDIA Nsight frame profiler?

Resources