Low memory bandwidth

Low memory bandwidth - memory

I have ddr2-667 ram and I measured my memory bandwidth via STREAM tool.
Here is my results:
Function Rate (MB/s) Avg time Min time Max time
Copy: 2229.0490 0.0158 0.0144 0.0206
Scale: 2208.1095 0.0160 0.0145 0.0216
Add: 2620.2118 0.0196 0.0183 0.0208
Triad: 2358.1446 0.0217 0.0204 0.0246
But theoretically my memory bandwidth is 5333 Mb/s.
Why my bandwidth results is very low ? is there a solution to increase (e.g overclock)

First, as SamGamgee said, reaching theoretical memory bandwidth is hard.
Using multiple threads could increase the measured bandwidth, though. STREAM by default disables multi-thread support, though. You can enable it by adding -fopenmp (if you are using GCC) to the compile option to enable multi-thread support.

Related

How to check bus utilization / bus load for GPU during ML inference?

I am running an ML inference for image recognition on the GPU using onnxruntime and I am seeing an upper limit for how much performance improvement batching of images is giving me - there is reduction in inference time upto around batch_size of 8, beyond that the time remains constant. I assume this must be because of some max utilization of the GPU resources, as I dont see any such limitation mentioned in the onnx documentation.
I tried using the package pynmvl.smi to get nvidia_smi and printed some utilization factors during inference as such -
utilization_percent = nvidia_smi.getInstance().DeviceQuery()['gpu'][0]['utilization']
gpu_util.append(utilization_percent ['gpu_util'])
mem_util.append(utilization_percent ['memory_util'])
What I do see is that the gpu_util and the memory_util are within 25% for the entire run of my inference, even at batch size like 32 or 64, so these are unlikely to be the cause of the bottleneck.
I assume then, that it must be the bus load limitation that might be causing this. I did not find any option within nvidia-smi to print the GPU bus load.
How can I find the bus load during the inference?

What is maximum (ideal) memory bandwidth of an OpenCL device?

My OpenCL device memory-relevant specs are:
Max compute units 20
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Global Memory cache line size 64 bytes
Does it mean that to utilize my device at full memory-wise potential it needs to have 8 work items on different CUs constantly reading memory chunks of 64 bytes? Are memory channels arranged so that they allow different CUs access memory simultaneously? Are memory reads of 64 bytes always considered as single reads or only if address is % 64 == 0?
Does memory banks quantity/width has anything to do with memory bandwidth and is there a way to reason about memory performance when writing kernel with respect to memory banks?

Memory bank quantity is useful to hint about strided access pattern performance and bank conflicts.
Cache line width must be the L2 cache line between L2 and CU(L1). 64 bytes per cycle means 64GB/s per compute unit (assuming there is only 1 active cache line per CU at a time and 1GHz clock). There can be multiple like 4 of them per L1 too.). With 20 compute units, total "L2 to L1" bandwidth must be 1.28TB/s but its main advantage against global memory must be lower clock cycles to fetch data.
If you need to utilize global memory, then you need to approach bandwidth limits between L2 and main memory. That is related to memory channel width, number of memory channels and frequency.
Gddr channel width is 64 bits, HBM channel width is 128 bits. A single stack of hbm v1 has 8 channels so its a total of 1024 bits or 128 bytes. 128 bytes per cycle means 128GB/s per GHz. More stacks mean more bandwidth. If 8GB memory is made of two stacks, then its 256 GB/s.
If your data-set fits inside L2 cache, then you expect more bandwidth under repeated access.
But the true performance (instead of on paper) can be measured by a simple benchmark that does pipelined memory copy between two arrays.
Total performance by 8 work items depends on capability of compute unit. If it lets only 32 bytes per clock per work item then you may need more work items. Compute unit must have some optimization phase like packing of similar addresses into one big memory access by each CU. So you can even achieve max performance using only single work group (but using multiple work items, not just 1, the number depends on how big of an object each work item is accessing and its capability). You can benchmark this on an array-summation or reduction kernel. Just 1 compute unit is generally more than enough to utilize global memory bandwidth unless its single L2-L1 bandwidth is lower than the global memory bandwidth. But may not be true for highest-end cards.
What is the parallelism between L2 and L1 for your card? Only 1 active line at a time? Then you probably rewuire 8 workitems distributed on 8 work groups.
According to datasheet from amd about rdna, each shader is capable to do 10-20 requests in flight so if 1 rdna compute unit L1-L2 communication is enough to use all bw of global mem, then even just a few workitems from single work group should be enough.
L1-L2 bandwidth:
It says 4 lines active between each L1 nad the L2. So it must have 256GB/s per compute unit. 4 workgroups running on different CU should be enough for a 1TB/s main memory. I guess OpenCL has no access to this information and this can change for new cards so best thing would be to benchmark for various settings like from 1 CU to N CU, from 1 work item to N work items. It shouldn't take much time to measure under no contention (i.e. whole gpu server is only dedicated to you).
Shader bandwidth:
If these are per-shader limits, then a single shader can use all of its own CU L1-L2 bandwidth, especially when reading.
Also says L0-L1 cache line size is 128 bytes so 1 workitem could be using that wide data type.
N-way-set-associative cache (L1, L2 in above pictures) and direct-mapped cache (maybe texture cache?) use the modulo mapping. But LRU (L0 here) may not require the modulo access. Since you need global memory bandwidth, you should look at L2 cache line which is n-way-set-associative hence the modulo. Even if data is already in L0, the OpenCL spec may not let you do non-modulo-x access to data. Also you don't have to think about alignment if the array is of type of the data you need to work with.
If you dont't want to fiddle with microbenchmarking and don't know how many workitems required, then you can use async workgroup copy commands in kernel. The async copy implementation uses just the required amount of shaders (or no shaders at all? depending on hardware). Then you can access the local memory fast, from single workitem.
But, a single workitem may require an unrolled loop to do the pipelining to use all the bandwidth of its CU. Just a single read/write operation will not fill the pipeline and make the latency visible (not hidden behind other latencies).
Note: L2 clock frequency can be different than main memory frequency, not just 1GHz. There could be a L3 cache or something else to adapt a different frequency in there. Perhaps its the gpu frequency like 2GHz. Then all of the L1 L0 bandwidths are also higher, like 512 GB/s per L1-L2 communication. You may need to query CL_DEVICE_MAX_CLOCK_FREQUENCY for this. In any way, just 1 CU looks like capable of using bandwidth of 90% of high-end cards. An RX6800XT has 512GB/s main memory bandwidth and 2GHz gpu so likely it can use only 1 CU to do it.

frequency sampling limit for beaglebone adc

I intend to use the beaglebone to sample a shaped signal of the order of 1 microsec. I need to fit the signal after and therefore i would like to have a sampling rate of let's 10 MHZ. Something that seems feasible with PRU and libpruio. The point is, looking to the adc specifications it seems there is a limit at 200KHz. Is my reasoning correct?
thanks

You'll need additional hardware for a sampling rate of 10 MHz! libpruio isn't designed to work at that speed, as well as the BBB hardware.
The ADC subsystem in the AM335x CPU is clocked at 24 MHz and needs 15 cycles for a sample (14 in continous mode). This leads to a maximum sample rate of 1.6 (1.74) MSamples/s. See SRM, chapter 12 for details.
The problem is to get the samples in to the host memory. I couldn't get this working faster than ~250 kSamples/s (by CPU access - I didn't try DMA).
As long as you don't need more values than the FIFO can hold, you can sample a single line at maximum 1.7 MHz.
BR

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:

The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.

I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.

IOPS versus Throughput

What is the key difference between IOPS and Throughput in large data storage?
Does file size have an effect on IOPS? Why?

IOPS measures the number of read and write operations per second, while throughput measures the number of bits read or written per second.
Although they measure different things, they generally follow each other as IO operations have about the same size.
If you have large files, you simply need more IO operations to read the entire file. The file size has no effect on the IOPS as it measures the number of clusters read or written, not the number of files.
If you have small files, there will be more overhead, so while the IOPS and throughput look good, you may experience a lower actual performance.

This is the analogy I came up with when talking about Throughput and IOPS.
Think of it as:
You have 4 buckets (Disk blocks) of the same size that you want to fill or empty with water.
You'll be using a jug to transfer the water into the buckets. Now your question will be:
At a given time (per second), how many jugs of water can you pour (write) or withdraw (read)? This is IOPS.
At a given time (per second) what's the amount (bit, kb, mb, etc) of water the jug can transfer into/out of the bucket continuously? This is throughput.
Additionally, there is a delay in the process of you pouring and/or withdrawing the water. This is Latency.
There's 3 things to consider when talking about IOPS and Throughput:
Size (file size/block size)
Patterns (Random/Sequential)
Mix (Read/Write) percentage

The Disk IOPS Describes the count of input/output operations on the disk per seconds, regardless block size.
The disk throughput describes how many data may be transferred per second, so the block size play a huge role upon calculating the throughput required by app
Let's consider as the sample the 3000 IOPS and SQL database engine, the block size in terms of db engine is called the page size and for SQL Server it's equal to 8 KB. If you wish to calculate the actual throughput, if the IOPS defined, you will end up with the formula below:
throughput = [IOPS] * [block size] = 3000 * 8 = 24 000 KB/s = 24 MB/s

IOPS - Number of read write operations mostly useful for OLTP transactions used in AWS for DBs like Cassandra.
Throughput - Is the number of bit transferred per sec. i.e.data transferred per sec.
Mainly a unit for high data transfer applications like big data hadoop,kafka streaming

IOPS- The time taken for a storage system to perform an Input/Output operation per second from start to finish constitutes IOPS.
Throughput- Data transfer speed in megabytes per second is often termed as throughput. Earlier, it was measured in Kilobytes. But now the standard has become megabytes.
More about this see: What is the difference between IOPS and throughput?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart