Is cudaMemcpy3DPeer supported on geforce cards? - memory

Is it possible to use peer-to-peer memory transfer on GeForce cards or is it allowed only on Teslas? I assume cards are 2 GTX690s (each one has two GPUs on board).
I have tried copying between Quadro 4000 and Quadro 600, and it failed. I was transferring 3D arrays using cudaMemcpy3DPeer by filling the cudaMemcpy3DPeerParms struct.

Peer-to-peer memory copy should work on Geforce and Quadro as well as Tesla, see the programming guide for more details.
Memory copies can be performed between the memories of two different devices.
When a unified address space is used for both devices (see Unified Virtual Address
Space), this is done using the regular memory copy functions mentioned in Device
Memory.
Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(),
cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync()
Peer-to-peer memory access, which is where one GPU can directly read from another GPU, requires UVA (which means 64-bit OS) and Tesla and compute capability 2.0 or higher.
Tesla Compute Cluster Mode for Windows), on Windows XP, or on Linux, devices
of compute capability 2.0 and higher from the Tesla series may address each other’s
memory (i.e., a kernel executing on one device can dereference a pointer to the memory
of the other device).

Related

Can x86 cpu read or write on physical address which is larger than RAM? [duplicate]

This question already has answers here:
What happens with a processor when it tries to access a nonexistent physical address?
(2 answers)
Closed 1 year ago.
I'm doing operating system lab on QEMU. I found that read/write is allowed when accessing physical address after paging which is larger than RAM. Is it the same condition on a real x86 machine? Will x32 or x64 cause different results?
The physical address space contains RAM, ROM, memory mapped devices (some PCI and some built into the chipset) and unused space.
An OS can access all of it, including unused space (even though there's no sane reason to deliberately access unused space).
The total amount of physical address space depends on the CPU, and is a "size in bits" (which you can obtain from the CPUID instruction) that ranges from 32 bits to 52 bits, but is often in the 36 to 48 bits range. If you try to use paging to access a "too high, not supported by the CPU" physical address you will get a General Protection Exception (because the "not supported by CPU physical address bits" are treated as reserved and the CPU checks if reserved bits are set in page table entries, etc).
Note that when writing an OS (for modern CPUs) it's easier to assume that physical addresses are 64 bits (regardless of what the CPU supports) and that the physical address space includes a reserved area that can't be accessed (where the size of the reserved area depends on what the CPU supports); as this simplifies code and data structures used for physical memory management (e.g. C has a uint64_t type but nothing has a uint52_t).
I'm doing operating system lab on QEMU. I found that read/write is allowed when accessing physical address after paging which is larger than RAM. Is it the same condition on a real x86 machine?
Yes; both Qemu and real hardware work the same.
Will x32 or x64 cause different results?
The CPU supports several types of paging structures - "plain 32-bit paging", PSE36, PAE (Physical Address Extensions), and long mode. For x32 you can't use long mode paging, but PAE normally has the same layout and the same physical address restrictions (the only case where it doesn't is some Xeon Phi accelerator cards).
If x32 is using "plain 32-bit paging" physical addresses will be restricted to 32 bits; and if it's using PSE36 physical addresses will be restricted to 36 bits.
The other possibility is that x32 isn't using any paging at all. In this case addresses are masked so that only 32 bits can be used (e.g. if you create a segment with a base address of 0xFFFFF000 and "high enough" limit; then use an offset within the segment that's 0x00001000 or more, the result will be masked causing physical addresses to wrap around; like (0xFFFFF000 + 0x00001234) & 0xFFFFFFFF = 0x00000234).
Apart from that, it still works the same (you can still accessed unused parts of the physical address space, there's just less of it, and you might not be able to access all RAM).

Is Intel QuickPath Interconnect (QPI) used by processors to access memory?

I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI.
Is my understanding correct?
Intel QuickPath Interconnect (QPI) is not wired to the DRAM DIMMs and as such is not used to access the memory that connected to the CPU integrated memory controller (iMC).
In the paper you linked this picture is present
That shows the connections of a processor, with the QPI signals pictured separately from the memory interface.
A text just before the picture confirm that QPI is not used to access memory
The processor
also typically has one or more integrated memory
controllers. Based on the level of scalability
supported in the processor, it may include an
integrated crossbar router and more than one
Intel® QuickPath Interconnect port.
Furthermore, if you look at a typical datasheet you'll see that the CPU pins for accessing the DIMMs are not the ones used by QPI.
The QPI is however used to access the uncore, the part of the processor that contains the memory controller.
Courtesy of QPI article on Wikipedia
QPI is a fast internal general purpose bus, in addition to giving access to the uncore of the CPU it gives access to other CPUs' uncore.
Due to this link, every resource available in the uncore can potentially be accessed with QPI, including the iMC of a remote CPU.
QPI define a protocol with multiple message classes, two of them are used to read memory using another CPU iMC.
The flow use a stack similar to the usual network stack.
Thus the path to remote memory include a QPI segment but the path to local memory doesn't.
Update
For Xeon E7 v3-18C CPU (designed for multi-socket systems), the Home agent doesn't access the DIMMS directly instead it uses an Intel SMI2 link to access the Intel C102/C104 Scalable Memory Buffer that in turn accesses the DIMMS.
The SMI2 link is faster than the DDR3 and the memory controller implements reliability or interleaving with the DIMMS.
Initially the CPU used a FSB to access the North bridge, this one had the memory controller and was linked to the South bridge (ICH - IO Controller Hub in Intel terminology) through DMI.
Later the FSB was replaced by QPI.
Then the memory controller was moved into the CPU (using its own bus to access memory and QPI to communicate with the CPU).
Later, the North bridge (IOH - IO Hub in Intel terminology) was integrated into the CPU and was used to access the PCH (that now replaces the south bridge) and PCIe was used to access fast devices (like the external graphic controller).
Recently the PCH has been integrated into the CPU as well that now exposes only PCIe, DIMMs pins, SATAexpress and any other common internal bus.
As a rule of thumb the buses used by the processors are:
To other CPUs - QPI
To IOH - QPI (if IOH present)
To the uncore - QPI
To DIMMs - Pins as the DRAM technology (DDR3, DDR4, ...) support mandates. For Xeon v2+ Intel uses a fast SMI(2) link to connect to an off-core memory controller (Intel C102/104) that handle the DIMMS and channels based on two configurations.
To PCH - DMI
To devices - PCIe, SATAexpress, I2C, and so on.
Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access).
Basically, most x86 multi-socket systems are lightly1 NUMA: every DRAM bank is attached to a the memory controller of a particular socket: this memory is then local memory for that socket, while the remaining memory (attached to some other socket) is remote memory. All access to remote memory goes over the QPI links, and on many systems2 that is fully half of all memory access and more.
So QPI is designed to be low latency and high bandwidth to make such access still perform well. Furthermore, aside from pure memory access, QPI is the link through which the cache coherence between sockets occurs, e.g., notifying the other socket of invalidations, lines which have transitioned into the shared state, etc.
1 That is, the NUMA factor is fairly low, typically less than 2 for latency and bandwidth.
2 E.g., with NUMA interleave mode on, and 4 sockets, 75% of your access is remote.

Is the memory snooping possible against the Intel architecture?

I am interested in developing external hardware monitor in Intel architecture.
I want to know if it is possible to snoop the bus between the CPU and DRAM to know which physical address is being read or written.
In other words, can we attach any devices to the motherboard or use other devices such as graphics card to snoop the bus to know which memory area is accessed by the CPU?
The reason I want to use the hardware approach is we don't fully trust the kernel or hypervisor which translate the VA to PA under our threat model.

Is address 0xFFFFFFF0 hardwired for system BIOS ROM?

I read this from a previous stack overflow answer:
At initial power on, the BIOS is executed directly from ROM. The ROM chip is mapped to a fixed location in the processor's memory space (this is typically a feature of the chipset). When the x86 processor comes out of reset, it immediately begins executing from 0xFFFFFFF0.
Follow up questions,
Is this address 0xFFFFFFF0 hardwired just to access the system BIOS ROM and later after the system is up and running this address 0xFFFFFFF0 can not be used by RAM?
Also, when this address 0xFFFFFFF0 is being us to access the system BIOS ROM, is the CPU accessing it as an IO device or Memory device?
At power up, it is ROM. Has to be or the CPU would be unable to boot. Some chipsets have register bits that allow you to unmap the BIOS flash chip from the memory address space. Of course you should not do this while executing from ROM!
There is a common technique on PC hardware called "shadowing" where the BIOS will copy the contents of the ROM chip into RAM mapped at the same address. RAM is generally much faster than ROM, so it can speed up the system.
As for your second question, it is a memory device. It must be for the following reasons:
I/O addresses are 16-bits, not 32.
An x86 processors cannot execute code from I/O space. You cannot point the Instruction Pointer to an I/O address.
It's mapped to the global memory space and is addressed in the same way. Conventionally, the RAM shouldn't be mapped to any range of addresses that are used by other devices. This is common enough. You might remember a few years ago before 64-bit operating systems became more standard on home PCs that a user could have 4 GB of physical memory installed but perhaps only 3.5 GB accessible due to the graphics card being mapped to 512 MB of the address space.

How do I increase the "global memory" available to the Intel CPU OpenCL driver?

My system has 32GB of ram, but the device information for the Intel OpenCL implementation says "CL_DEVICE_GLOBAL_MEM_SIZE: 2147352576" (~2GB).
I was under the impression that on a CPU platform the global memory is the "normal" ram and thus something like ~30+GB should be available to the OpenCL CPU implementation. (ofcourse I'm using the 64bit version of the SDK)
Is there some sort of secret setting to tell the Intel OpenCL driver to increase global memory and use all the system memory ?
SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.

Resources