Does a one cycle instruction take one cycle, even if RAM is slow? - memory

I am using an embedded RISC processor. There is one basic thing I have a problem figuring out.
The CPU manual clearly states that the instruction ld r1, [p1] (in C: r1 = *p1) takes one cycle. Size of register r1 is 32 bits. However, the memory bus is only 16 bits wide. So how can it fetch all data in one cycle?

The clock times are assuming full width zero wait state memory. The time it takes for the core to execute that instruction is one clock cycle.
There was a time when each instruction took a different number of clock cycles. Memory was relatively fast then too, usually zero wait state. There was a time before pipelines as well where you had to burn a clock cycle fetching, then a clock cycle decoding, then a clock cycle executing, plus extra clock cycles for variable length instructions and extra clock cycles if the instruction had a memory operation.
Today clock speeds are high, chip real estate is relatively cheap so a one clock cycle add or multiply is the norm, as are pipelines and caches. Processor clock speed is no longer the determining factor for performance. Memory is relatively expensive and slow. So caches (configuration, number of and size), bus size, memory speed, peripheral speed determine the overall performance of a system. Normally increasing the processor clock speed but not the memory or peripherals will show minimal if any performance gain, in some occasions it can make it slower.
Memory size and wait states are not part of the clock execution spec in the reference manual, they are talking about only what the core itself costs you in units of clocks for each of the instructions. If it is a harvard architecture where the instruction and data bus are separate, then one clock is possible with the memory cycle. The fetch of the instruction happens at least the prior clock cycle if not before that, so at the beginning of the clock cycle the instruction is ready, decode, and execute (the read memory cycle) happen during the one clock at the end of the one clock cycle the result of the read is latched into the register. If the instruction and data bus are shared, then you could argue that it still finishes in one clock cycle, but you do not get to fetch the next instruction so there is a bit of a stall there, they might cheat and call that one clock cycle.

My understanding is : when saying some instruction take one cycle , it is not that instruction will be finished in one cycle. We should take in count of instruction pipe-line. Suppose your CPU has 5 stage pipe line , that instruction would takes 5 cycles if it were exectued sequentially.

Related

Does a DMA controller copy one word of memory at a time?

A DMA controller greatly speeds up memory copy operations because the data in memory doesn't have to be read into the CPU.
From what I've read, DMA controllers can "copy a block of memory from one location to another" in one operation, but thinking about this at a low level, I'm guessing the DMA ultimately has to iterate over memory one word at a time. Is that correct? Is that one word per clock cycle? One word per two clock cycles? (one for read memory into the DMA, one for write to memory) Or does the DMA have a circuit that can somehow (I can't imagine how) copy large chunks of memory in one or two clock cycles?
If the CPU tells the DMA to copy 1024 bytes of memory from one address to another, how many clock cycles will the CPU have free to perform other tasks while waiting for the DMA to finish?
Is it possible to have an architecture where the DMA is doing a memory copy using one bus, while the CPU can access memory at the same time in a different area? Say, in a different bank?
I'm sure it's architecture dependent, so for the answers just pick one or more 8 or 16 bit home micros.
Yes this is architecture dependent. Usually there is a main memory bus, and one or more cache. All buses in the system are not required to be the same width. Memory buses are usually larger than processor words, it might be 64bits wide, so it loads 64bits at a time.
Then the destination might be the same memory bus, or PCIE, or even another bus that is memory mapped, in which case the transfers might be constrained by the destination bus width.
How many clock cycles are available again depend of how things are done. Usually in a µC the DMA triggers an interrupt when it is done, and the CPU does nothing. An other option is polling.
There are dual port memory but usually there is only 1 main memory bus. IIRC banks are usually a trick to avoid large addresses, but use the same memory bus.
Cache and bus arbitration are used to mitigate bus contention, the user really shouldn't care about that. You can have a look at your µC datasheet if you want reliable information.

In RAM Memory: CL are the total RAM cycles to access memory?

Well, my doubt is: When you buy a new RAM memory for your computer, you can see something like CL17 on it specifications. I know that CL is the same as CAS, but I have a question here: I've read in some posts that CAS is the amount of RAM clock cycles it takes for the RAM to output data called for by the CPU, but also I've read that we have to add RAS-to-CAS to that CAS to count the total RAM clock cycles it would take the RAM to output data requested from CPU.
So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay?
And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?
Edit: Supose that when I talk about the 17 cycles I'm refering to "17 RAM cycles between L3 misses and the reception of the first bytes of the data requested"
So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay? And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?
No. This delay is only a small part of the total delay from when a core requests some memory and the line returns to the core.
In particular, the request must make its way all the way from the core, checking the L1, L2 and L3 caches, and to the memory controller, before the DRAM (and timings like CAS) even become involved. After the read occurs, it has to go all the way back. This trip usually accounts for much more of the total latency of RAM access than the RAM access itself.
John D McCalpin has an excellent blog post about the memory latency components on an x86 system. On that system the CAS delay of ~11 ns makes up only a bit more than 20% of the total latency of ~50 ns.
John also points out in a comment that on some multi-socket systems, the memory latencies may not even matter because snopping the other cores in the system takes longer than the response from memory.
About RAS-to-CAS vs CAS alone, it depends on the access pattern. The RAS-to-CAS delay is only needed if that row wasn't already open, in that case the row must be opened, and RAS-to-CAS delay incurred. Otherwise, if the row is already opened, only the CAS delay is required. Which case applies depending your access physical address access pattern, RAM configuration and how the memory controllers maps physical addresses to RAM addresses.

how does burst-mode DMA speed up data transfer between main memory and I/O devices?

According to Wikipedia, there are three kinds of DMA modes, namely, the Burst Mode, the cycle stealing mode and the transparent mode.
In the Burst Mode, the dma controller will take over the control of the bus. Before the transfer completes, CPU tasks that need the bus will be suspended. However, in each instruction cycle, the fetch cycle has to reference the main memory. Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
In my understanding, the cycle stealing mode is essentially the same. The only difference is that in those mode the CPU uses one in two consecutive cycles, as opposed to being totally idle in the bust mode.
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process up?
Thanks a lot!
how does burst-mode DMA speed up data transfer between main memory and I/O devices?
There is no "speed up" as you allege, nor is any "speed up" typically necessary/possible. The data transfer is not going to occur any faster than the slower of the source or destination.
The DMA controller will consolidate several individual memory requests into occasional burst requests, so the benefit of burst mode is reduced memory contention due to a reduction in the number of memory arbitrations.
Burst mode combined with a wide memory word improves memory bandwidth utilization. For example, with a 32-bit wide memory, four sequential byte reads consolidated into a single burst could result in only one memory access cycle.
Before the transfer completes, CPU tasks that need the bus will be suspended.
The concept of "task" does not exist at this level of operations. There is no "suspension" of anything. At most the CPU has to wait (i.e. insertion of wait states) to gain access to memory.
However, in each instruction cycle, the fetch cycle has to reference the main memory.
Not true. A hit in the instruction cache will make a memory access unnecessary.
Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
Faulty assumption for every cache hit.
Apparently you are misusing the term "interrupt-driven IO" to really mean programmed I/O using interrupts.
Equating a wait cycle or two to the execution of numerous instructions of an interrupt service routine for programmed I/O is a ridiculous exaggeration.
And "interrupt-driven IO" (in its proper meaning) does not exclude the use of DMA.
In my understanding, the cycle stealing mode is essentially the same.
Then your understanding is incorrect.
If the benefits of DMA are so minuscule or nonexistent as you allege, then how do you explain the existence of DMA controllers, and the preference of using DMA over programmed I/O?
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process
Comparing DMA to "interrupt-driven I/O" is illogical. See this.
Programmed I/O using interrupts requires a lot more than just the one instruction that you allege.
I'm unfamiliar with any CPU that can read a device port, write that value to main memory, bump the write pointer, and check if the block transfer is complete all with just a single instruction.
And you're completely ignoring the ISR code (e.g. save and then restore processor state) that is required to be executed for each interrupt (that the device would issue for requesting data).
When used with many older or simpler CPUs, burst mode DMA can speed up data transfer in cases where a peripheral is able to accept data at a rate faster than the CPU itself could supply it. On a typical ARM, for example, a loop like:
lp:
ldr r0,[r1,r2] ; r1 points to address *after* end of buffer
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
adds r2,#4
bne lp
would likely take at least 11 cycles for each group of four bytes to transfer (including five 32-bit instruction fetches, one 32-bit data fetch, four 8-bit writes, plus a wasted fetch for the instruction following the loop). A burst-mode DMA operation, by contrast, DMA would only need 5 cycles per group (assuming the receiving device was able to accept data that fast).
Because a typical low-end ARM will only use the bus about every other cycle when running most kinds of code, a DMA controller that grabs the bus on every other cycle could allow the CPU to run at almost normal speed while the DMA controller performed one access every other cycle. On some platforms, it may be possible to have a DMA controller perform transfers on every cycle where the CPU isn't doing anything, while giving the CPU priority on cycles where it needs the bus. DMA performance would be highly variable in such a mode (no data would get transferred while running code that needs the bus on every cycle) but DMA operations would have no impact on CPU performance.

What happens when VRAM is full?

I want to know the current nvidia/AMD implementation of handling VRAM resource allocation.
We already know that operating systems use swap/virtual memory when system RAM is full, then what is the equivalent of swap when it comes to VRAM? Do they fall back to system RAM or hard disk?
I thought that falling back to system RAM is rational, but from my experience video games lag horribly(1/20 of typical FPS) when they are out of video memory space, that made me doubt that they are using system RAM because I think system RAM is not that slow to make the game lag so much.
In short I would like to know what the current implementations are and what is the biggest bottleneck that causes the game to lag under out-of-memory situations.
the swapping is really done to RAM
if there is enough RAM to swap to. Swapping to file is unusable due to slow speed see next bullet
The RAM it self is not that slow (still slower) but the buses connected to it are
while swapping system memory to swap file the memory swap occur when needed (change focus of application,open new file/table,...) this is not that frequent but if you are out of VRAM then you are in trouble because usually most of gfx data is used in each frame.
This leads to swapping per frame so you need to copy usually very large data blocks very often for example swapping 256MB 20fps leads to:
256M x 2 x 20 = 10 GB/s read
256M x 2 x 20 = 10 GB/s write
which is 20GB/s bandwidth needed of coarse depending on the memory controller and architecture You can do read/write simultaneously up to a point so you can get close to 10GB/s in total theoretically but still that is huge number for only 256MB chunk of data look here:
Cache size estimation on your system?
My setup at that time has memory write only around 5GB/s which is nowhere near the needed memory transfer rate needed for such task

Which factors affect the speed of cpu tracing?

When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.

Resources