Determine Rank and Channel Bits in Haswell Physical Address Space

Determine Rank and Channel Bits in Haswell Physical Address Space - mapping

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell). The chipset datasheets include Datasheet, volume 1 (M- and H-processor lines) and Datasheet, volume 2 (M- and H-processor lines). The physical address mapping is as follows:
I want to know which bits of the physical address determine the target rank and channel. Based on Volume 2, if high-order rank-interleaving is enabled, either of the address bits 20-27 can be used:
Does this mean that if I used high-order rank-interleaving, I can choose the rank bit, and, otherwise, the rank bit should be detected using reverse-engineering? How about the channel bits?
P.S.: The chipset supports two channels, each supporting up to two DIMMs, each supporing up to two ranks.

Related

CK (tCK, nCK) unit ambiguity in DDR3 standard/datasheets?

I am designing a simplistic memory controller and PHY on an Artix-7 FPGA but am having problems reading the datasheet. The timings in the memory part's datasheet (and in the JEDEC JSD79-3F doc) are expressed in CK/tCK/nCK units, which are in my opinion ambiguous if not running the memory at the nominal frequency (e.g. lower than 666 MHz clock for a 1333 MT/s module).
If I run a 1333 MT/s module at a frequency of 300 MHz -- still allowed with DLL on, as per the datasheet speed bins, -- is the CK/tCK/nCK unit equal to 1.5 ns (from the module's native 666 MHz), or 3.33 ns (from the frequency it is actually run at)? On one hand it makes sense that certain delays are constant, but then again some delays are expressed relative to the clock edges on the CK/CK# pins (like CL or CWL).
That is to say, some timing parameters in the datasheet only change when changing speed bins. E.g. tRP is 13.5 ns for a 1333 part, which is also backwards compatible with the tRP of 13.125 ns of a 1066 part -- no matter the chosen operating frequency of the physical clock pins of the device.
But then, running a DDR3 module at 300 MHz only allows usage of CL = CWL = 5, which is again expressed in "CK" units. To my understanding, this means 5 periods of the input clock, i.e. 5 * 3.33 ns.
I suppose all I am asking is whether the "CK" (or nCK or tCK) unit is tied to the chosen speed bin (tCK = 1.5 ns when choosing DDR3-1333) or the actual frequency of the clock signal provided to the memory module by the controlling hardware (e.g. 3.3 ns for the 600 MT/s mode)?

This is the response of u/Allan-H on reddit who has helped me reach a conclusion:
When you set the CL in the mode register, that's the number of clocks that the chip will wait before putting the data on the pins. That clock is the clock that your controller is providing to the chip (it's SDRAM, after all).
It's your responsibility to ensure that the number of clocks you program (e.g. CL=5) when multiplied by the clock period (e.g. 1.875ns) is at least as long as the access time of the RAM. Note that you program a number of clocks, but the important parameter is actually time. The RAM must have the data ready before it can send it to the output buffers.
Now let's run the RAM at a lower speed, say 312.5MHz (3.2ns period). We now have the option of programming CL to be as low as 3, since 3 x 3.2ns > 5 x 1.875ns.
BTW, since we are dealing with fractions of a ns, we also need to take the clock jitter into account.
Counterintuitively, the DRAM chip doesn't know how fast it is; it must be programmed with that information by the DRAM controller. That information might be hard coded into the controller (e.g. for an FPGA implementation) or by software which would typically read the SPD EEPROM on the DIMM to work out the speed grade then write the appropriate values into the DRAM controller.
This also explains timing values defined as e.g. "Greater of 3CK or 5ns". In this case, the memory chip cannot respond faster than 5 ns, but the internal logic also needs 3 positive clock edges on the input CK pins to complete the action defined by this example parameter.

Calculating memory size based on address bit-length and memory cell contents

I am trying to calculate the maximum memory size knowing the bit length of an address and the size of the memory cell.
It is my understanding that if the address is n bits then there are 2^n memory locations. But then to calculate the actual memory size of the machine, you would need to multiply the number of addresses by the size of the memory cell. Is that correct?
To put it another way,
Step 1: calculate the length of the address in bits (n bits)
Step 2: calculate the number of memory locations 2^n(bits)
Step 3: take the number of memory locations and multiply it by the Byte size of the memory cells.
If each cell was 2 bytes for example, would I multiply 2^n bits (for address length) by the 2 Bytes per memory cell.
So total memory would be 2^n bits (address size) * x bytes (cell size)?

"actual memory size of the machine"
I will assume here that you mean the physical address space of the machine in question, ignoring virtual addressing, etc.
Most modern machines are byte-addressable (8-bit) meaning that each address refers to 1 byte. In this case, assuming that you have an n-bit processor with a matching n-bit address bus (there are cases where these aren't the same, e.g. Pentium processors) the number of memory locations possible is 2^n bytes.
If you have more specialized hardware, (embedded microcontrollers, etc) that are word addressable (16-bit, 32-bit), then you are correct that you would multiply 2^n * (word-size in bits) / (8) = # of bytes.
That being said, when you take into consideration virtual addressing and physical bus sizes that might not be the same as the processor's address lines, you would have to take a look at that specific machine for the "theoretical limit".

Why is RAM in powers of 2?

Why is the amount of RAM always a power of 2?
512, 1024, etc.
Specifically, what is the difference between using 512, 768, and 1024 RAM for an Android emulator?

Memory is closely tied to the CPU, so making their size a power of two
means that multiple modules can be packed requiring a minimum of logic
in order to switch between them; only a few bits from the end need to
be checked (since the binary representation of the size is 1000...0000
regardless of its size) instead of many more bits were it not a power
of two.
Hard drives are not tied to the CPU and not packed in the same manner,
so exactness of their size is not required.
from https://superuser.com/questions/235030/why-are-ram-size-usually-in-powers-of-2-512-mb-1-2-4-8-gb
as referenced by BrajeshKumar in the comments on the OP. Thanks Brajesh!

Because computers deal with binary values such as 0 and 1, because registers are on(1) or off(0)
So if you use powers of 2, your hardware will use 100% of the registers.
If computers used ternary values in their circuits, then we'd have memory, processors and anything else in powers of 3.

I think, it is related with the number of bits in an address bus (or bits used to select between address spaces). n bits can address 2^n bytes, so whenever the number of address bits increases to n+1, automatically the space increases by a factor of 2. The manufacturers use their maximum address capacity when including memory chips to the design.
In Android emulator, the increase in RAM may make your program more efficient, because when your application exceeds the RAM, a part of ROM (non-volatile memory) and it is slower.

Maximum memory size a system can support

Suppose that I have a computer with an address register of size 16 bits (MAR, for example). The smallest addressable unit in this computer is a word and each word is of size 2 bytes. What is the maximum memory size (in bytes) this system can support?
I thought it would be 2^16 = 65536 bytes, but the part about the smallest addressable unit implies that this is not the way to solve it.
Thanks in advance

There is no direct correlation to the maximum amount of memory a system can support, and the size of address registers.
16bit computers 30 years ago could very well support more than 64 kilobytes. On the other hand, modern 64bit processors typilcally only have lanes for 52 bits (or less), but even so a typical computer cannot nearly support 2^52 bytes of memory.
Typical 64bit computers today could in theory address 16 exibytes, but present-time CPUs only support 4 petabytes of phyisical and 256 terabytes of per-process virtual memory. Typical desktop mainboards support 128GiB maximum, if you buy extra expensive DIMMS. With affordable DIMMS, you're limited to about half as much (there are only so and so many slots).
Operating systems typically allow for main memory sizes in the hundreds of gigabytes only (e.g. 512 GiB for Windows 8 enterprise/professional, and 128GiB otherwise, or as little as 16GiB for Windows 7 Home Premium)

Generally the smallest addressable size is one byte, as you have calculated it, if it were one byte it would be 2^16*1 = 65536 bytes. However, because on this system there are two bytes per address, it is actually 2^16*2 = 131072 bytes.

Understanding memory usage in CUDA

I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0.
I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly.
However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations.
I want to understand this by an explicit example.
Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.
Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks.
Number of blocks : 16384
Number of threads per block : 32 ( => total threads 32*16384 = 524288)
Each thread-block is given a 32 x 32 two-d integer array of shared memory
to work with.
Within a thread I would like to store some numbers of type double. But I am not sure
how many such double numbers I can store without any register spilling into local memory (which is on device). Can someone tell
me how many doubles can be stored per thread for this kernel configuration?
Also is the above mentioned configuration for shared-memory for each of my blocks valid?
A sample computation about how one would go about deducing these things would be very
illustrative and helpful
Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce GTX 570"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1279 MBytes (1341325312 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.46 GHz
Memory Clock rate: 1900.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570
[deviceQuery] test results...
PASSED
Press ENTER to exit...

So, the kernel configuration is a little complicated. You should use the CUDA OCCUPANCY CALCULATOR. And the other hand you have to study how warps work. Once a block is assigned to a SM, it is further divided into 32-thread units called warps. We can say that a warp is a unit of thread scheduling in SMs. We can calculate the number of warps that reside in a SM for a given block size and given number of blocks assigned to each SM. In your case a warp consists in 32 threads, so if you have a block with 256 threads then you have 8 warps. Now choosing a correctly kernel setting depends of your data and operations, remember that you have to full occupy a SM, that is: you have to get full thread capacity in each SM and the maximal number of warps for scheduling around the long-latency operations. Another important thing is dont exceed the limitations of up to maximum threads per blocks, in your case 1024.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart