Can we use SSE intrinsics to write to a memory mapped PCI device memory - sse

I have a use case where the x86 CPU has to write 64 bytes of data to PCIe slave device whose memory has been mmapp'ed into the user space. As of now, i use memcpy to do that, but it turns out that it is very slow. Can we use the Intel SSE intrinsics like _mm_stream_si128 to speed it up? Or any other mechanism other than using DMA.
The objective is to pack all the 64 bytes into one TLP and send it on the PCI bus to reduce the overhead.

As I understand it, memory mapped I/O doesn't make certain store instructions special. An 8B store from movq mem, xmm is the same as the store from mov mem, r64.
I think if you have 64B to write into MMIO, you should do it with whatever instructions do it most efficiently as its generated, then flush the cache line. Generating a 64B buffer and then doing memcpy (or doing it yourself with four movdqa, or two AVX vmovdqa) is a waste of time, unless you expect your code that generates the 64B to be slow and more likely to be interrupted part way through than memcpy. A timer interrupt can come in any time, including during your memcpy, if you're in user space where you can't disable interrupts. Since you can't guarantee complete 64B writes, a 99.99% chance of a full cacheline write vs. a 99.99999% chance prob. won't make a difference.
Streaming stores to the MMIO region might avoid the CPU doing a read-for-ownership after the clflush from the previous write. clwb isn't available yet, so the only option is clflush, which evicts the data from cache.
Non-temporal load/stores are so-called weakly-ordered. IDK if that means you'd need more fencing to guarantee ordering.
One use-case for streaming loads/stores is copying from uncacheable memory, like video RAM. I'm not sure about using them for MMIO. I found this article about it, talking about how to read from MMIO without just getting the same cached value.

Related

Does a DMA controller copy one word of memory at a time?

A DMA controller greatly speeds up memory copy operations because the data in memory doesn't have to be read into the CPU.
From what I've read, DMA controllers can "copy a block of memory from one location to another" in one operation, but thinking about this at a low level, I'm guessing the DMA ultimately has to iterate over memory one word at a time. Is that correct? Is that one word per clock cycle? One word per two clock cycles? (one for read memory into the DMA, one for write to memory) Or does the DMA have a circuit that can somehow (I can't imagine how) copy large chunks of memory in one or two clock cycles?
If the CPU tells the DMA to copy 1024 bytes of memory from one address to another, how many clock cycles will the CPU have free to perform other tasks while waiting for the DMA to finish?
Is it possible to have an architecture where the DMA is doing a memory copy using one bus, while the CPU can access memory at the same time in a different area? Say, in a different bank?
I'm sure it's architecture dependent, so for the answers just pick one or more 8 or 16 bit home micros.
Yes this is architecture dependent. Usually there is a main memory bus, and one or more cache. All buses in the system are not required to be the same width. Memory buses are usually larger than processor words, it might be 64bits wide, so it loads 64bits at a time.
Then the destination might be the same memory bus, or PCIE, or even another bus that is memory mapped, in which case the transfers might be constrained by the destination bus width.
How many clock cycles are available again depend of how things are done. Usually in a µC the DMA triggers an interrupt when it is done, and the CPU does nothing. An other option is polling.
There are dual port memory but usually there is only 1 main memory bus. IIRC banks are usually a trick to avoid large addresses, but use the same memory bus.
Cache and bus arbitration are used to mitigate bus contention, the user really shouldn't care about that. You can have a look at your µC datasheet if you want reliable information.

cache read system memory vs cpu read system memory

On an arm based SoC running Android/Linux, I observed following:
Allocate a memory area as un-cached for device DMA input. After DMA finishes, the content of this memory area is copied to another system memory area.
Alloc a memory area as cached for device DMA input. After DMA finished, invalid the memory range, then copy the content of this memory area to anther system memory area.
The size of memory area allocated is about 2MB which is larger than the cache size (the L2 cache size is 256KB).
method 2 is x10 faster than method 1
That is: the memory copy operation of method 2 is x10 faster than method 1
I speculate that method 2 using cache read by cache line size from system memory when copying and the method 1 needs cpu read by bus transaction size from system memory bypassing the cache hardware.
However, I cannot find explicit explanation. I appreciate who can help providing detailed explaination.
There are so many hardware items involved that it is difficult to give specifics. The SOC determines a lot of this. However, what you observe is typical in performance terms for modern ARM systems.
The main factor is SDRAM. All DRAM is structured with 'rows' and 'columns'.DRAM history On the DRAM chip, an entire 'row' can be read at one time. Ie, there is a matrix of transistors and there is a physical point/wiring where an entire row can be read (in fact there maybe SRAM to store the ROW on the chip). When you read another 'column', you need to 'un-charge/pre-charge' the wiring to access the new 'row'. This takes some time. The main point is that DRAM can read sequential memory very fast in large chunks. Also, there is no command overhead as the memory streams out with each clock edge.
If you mark memory as un-cached, then a CPU/SOC may issue single beat reads. Often these will 'pre-charge' consuming extra cycles during a single read/write and many extra commands must be sent to the DRAM device.
SDRAM also has 'banks'. A bank has a separate 'ROW' buffer (static RAM/multi-transistor memory) which allows you to read from one bank to another without having to recharge/re-read. The banks are often very far apart. If your OS has physically allocated the 'un-cached' memory in a different bank from the 2nd 'cached' area, then this will also add an additional efficiency. It common in an OS to manage cached/un-cached memory separately (for MMU issues). The memory pools are often distant enough to be in separate banks.

Memcpy from PCIe memory takes more time than memcpy to PCIe memory

I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?
That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.

Does Word length == number of bits transferred between memory and CPU per access?

I am really confused about the concept of the "word length".
I know that in 32-bit machine, the memory address has 32 bits. And each memory access transfers 32 bits (4 bytes) to the CPU.
In 64-bit machine, the address has 64 bits. But does it mean the memory access unit is also 64 bits?
In this answer, the author says "Word: The natural size with which a processor is handling data (the register size)". But it does not explicitly specifies how many bits are transferred between memory and CPU per memory access.
In a CPU with a cache, data usually only transfers between CPU and memory a whole cache-line at a time. e.g. on a modern x86, a 1B load that hits in cache would not produce any external memory access.
If it missed even in the last-level cache, the memory chips would see a request for the 64B aligned block containing that byte.
Modern x86 CPUs have 16B or even 32B (256b) data paths between cache and execution units.
See also other links in the x86 tag wiki to learn more.
It does not say that because it does not mean that. Addresses are often 64-bits on a 64-bit machine, but not always. Data paths are often 64-bits on a 64-bit machine, but not always.

HLSL: Memory coalescing with structured buffers

I'm currently working on an HLSL shader that is limited by global memory bandwidth. I need to coalesce as much memory as possible in each memory transaction. Based on the guidelines from NVIDIA for CUDA and OpenCL (DirectCompute documentation is quite lacking), the largest memory transaction size for compute capability 2.0 is 128 bytes, while the largest word that can be accessed is 16 bytes. Global memory accesses can be coalesced when the data being accessed by the threads in a warp fall into the same 128 byte segment. With this in mind, wouldn't structured buffers be detrimental for memory coalescing if the structure is larger than 16 bytes?
Suppose you have a structure of two float4's, call them A and B. You can access either A or B, but not both in a single memory transaction for an instruction issued in a non-divergent warp. The layout of the memory would look like ABABABAB. If you're trying to read consecutive structures into shared memory, wouldn't memory bandwidth be wasted by storing the data in this manner? For example, you can only access the A elements, but the hardware coalesces the memory transaction so it reads in 128 bytes of consecutive data, half of which is the B elements. Essentially, you're wasting half of your memory bandwidth. Wouldn't it be better to store the data like AAAABBBB, which is a structure of buffers instead of a buffer of structures? Or is this handled by the L1 cache, where the B elements are cached so you can access them faster when the next instruction is to read in the B elements? The only other solution would be to have even numbered threads access the A elements, while odd numbered elements access the B elements.
If memory bandwidth is indeed wasted, I don't see why anyone would use structured buffers other than for convenience. Hopefully I explained this well enough so someone could understand. I would ask this on the NVIDIA developer forums, but I think they're still down. Visual Studio keeps crashing when I try to run the NVIDIA Nsight frame profiler, so it's difficult to see how the memory bandwidth is affected by changes in how the data is stored. P.S., has anyone been able to successfuly run the NVIDIA Nsight frame profiler?

Resources