FMC slower than QSPI on STM32H7?

FMC slower than QSPI on STM32H7? - memory

I'm working on STM32H753, for now on the STM32H753I-EVAL2 board.
I am evaluating the external memories capabilities, in particular FMC SRAM and QSPI Flash.
I used projects from STMicro (from STM32CubeH7) and measured the duration of reading 1KB of data respectively from QSPI Flash and from FMC SRAM. In both cases, if I understood correctly, the different clocks are configured at their maximum speed (without boost, ie. CPU clock at 400MHz and so on).
I was surprised to notice that, with both D-cache and I-cache enabled, reading 1KB from QSPI Flash is almost twice faster than from FMC SRAM. I was expecting the contrary since FMC is a parallel bus.
It's the first time I'm using a FMC memory.
Do you have any idea of how FMC and QSPI compare on a STM32 ?

Although being named the same, STM32 peripherials behaves differentely from family to family. So your question is actually family-dependent. I'll elaborate about your H7 part, but YMMV.
QSPI is actually a parallel interface as well, as it transfers data thru 4 wires simultaneously. And more than than, QSPI is synchronous, and quite fast (up to 133 MHz for certain voltage ranges) That's about 533 Mbit/s instant speed.
FMC, on the other hand, is no so fast. Max clock is 100 MHz, and it takes a few clocks to start a transfer, even if it's burst. More than that, it works in async mode as well, and it takes 5-8 clocks for a single transfer there. If your SRAM is connected to the FMC in the async mode, it would be no faster than around 15 megatransfers per second, which is 240 Mbit/s instant speed for a 16-bit part.
Most SRAM parts can do better than that, but it would reqire some FMC setting up/tweaking and maybe some glue logic to start it in the sync/burst mode.

Related

ESP32: Best way to store data frequently?

I'm developing a C++ application in the ESP32-DevKitC board where I sense acceleration from an accelerometer. The application goal is to store the accelerometer data until storage is full and then send all the data through WiFi and start all again. The micro also goes to deep-sleep mode when is possible.
I'm currently using the ESP32 NVS library which is very well documented and pretty easy to use. The negative side of this is that the library uses Flash memory, therefore a lot of writings will end up degrading the drive.
I know that Espressif also offers some other storage libraries (FAT, SPIFFS, etc.) but, as far as I know (correct me if I'm wrong), they all use Flash drive.
Is there any other possibility of doing what I want to but without using the Flash storage?
Aclarations
Using Flash memory is not the problem itself, but degrading it.
Storage has to be non volatile or at least not being erased when the micro goes to deep-sleep mode.
I'm not using any Arduino library.

That's a great question that I wish more people would ask.
ESP32s use NOR flash storage, which is usually rated for between 10,000 to 100,000 write cycles (100,000 seems to be the standard these days). Flash can't write single bytes; instead of writes a "page" of bytes, which I believe is 256 bytes. So each 256 byte page is rated for at least 100,000 cycles. When a device is rated for 100,000 cycles it's likely to be usable for at least 10 times that, but the manufacturer is not going to make any promises beyond the 100,000.
SPIFFS (and LittleFS, now used on the ESP8266 Arduino Core) perform "wear leveling", to minimize the number of times a particular page is written. So if you modify the same section of a file repeatedly, it will automatically be written to different pages of flash. FAT is not designed to work well with flash storage; I would avoid it.
Whether SPIFFS with wear leveling will be adequate for your needs depends on your needed lifetime of the device versus how much data you'll be writing and how frequently.
NVS may perform some level of wear levelling, to an extent I'm unsure about. Here, in a forum post with 2 ESP employees, they both confirm that NVS does do some form of wear levelling. NVS is best used to persist things like configuration information that doesn't change frequently. It's not a great choice for storing information that's updated often.
You mentioned that the data just needs to survive deep sleep. If that's the case, your best option (if it's large enough) is to use the ESP32's RTC static RAM. This chunk of memory will survive restarts and deep sleep mode, but will lose its state if power is interrupted. It's real RAM so you won't wear it out by writing to it frequently, and it doesn't cost a lot of energy to write to. The catch is there's only 8KB of it.
If the 8KB of RTC RAM isn't enough and you're writing too much data too frequently to trust that SPIFFS will be okay, your best bet would be an SD card. The ESP32 can talk to an SD card adapter. SD cards use NAND flash, which has a much greater lifespan than NOR and can be safely overwritten many more times (which is why these kinds of cards are usable for filesystems in devices like Raspberry Pis).
Writing to flash also takes much more energy than writing to regular RAM. If your device is going to be battery powered, the RTC RAM is also a better choice than SPIFFS or an SD card from a power savings perspective.
Finally, if you use the RTC RAM I'd recommend starting to write it over wifi before it's full, as bringing up wifi and transmitting the data could easily take long enough that you might run out of space for some samples. Using it as a ring buffer and starting the transmit process when you hit a high water mark rather than when the buffer is full would probably be your best bet.

I know i'm late with this answer but you can buy ESP32 modules with external RAM even with 4-8mb. External ram is really fast ( at least much faster than the flash, it uses SPI interface to communicate ) and you can fit a lot of sensor readings in there.
I'm using an ESP32_WROVER_E module with 8mb external ram ( 4mb is usable with normal function calls ) and 16mb flash.
Here is a link of the module that i'm using at TME's site.

how does burst-mode DMA speed up data transfer between main memory and I/O devices?

According to Wikipedia, there are three kinds of DMA modes, namely, the Burst Mode, the cycle stealing mode and the transparent mode.
In the Burst Mode, the dma controller will take over the control of the bus. Before the transfer completes, CPU tasks that need the bus will be suspended. However, in each instruction cycle, the fetch cycle has to reference the main memory. Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
In my understanding, the cycle stealing mode is essentially the same. The only difference is that in those mode the CPU uses one in two consecutive cycles, as opposed to being totally idle in the bust mode.
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process up?
Thanks a lot!

how does burst-mode DMA speed up data transfer between main memory and I/O devices?
There is no "speed up" as you allege, nor is any "speed up" typically necessary/possible. The data transfer is not going to occur any faster than the slower of the source or destination.
The DMA controller will consolidate several individual memory requests into occasional burst requests, so the benefit of burst mode is reduced memory contention due to a reduction in the number of memory arbitrations.
Burst mode combined with a wide memory word improves memory bandwidth utilization. For example, with a 32-bit wide memory, four sequential byte reads consolidated into a single burst could result in only one memory access cycle.
Before the transfer completes, CPU tasks that need the bus will be suspended.
The concept of "task" does not exist at this level of operations. There is no "suspension" of anything. At most the CPU has to wait (i.e. insertion of wait states) to gain access to memory.
However, in each instruction cycle, the fetch cycle has to reference the main memory.
Not true. A hit in the instruction cache will make a memory access unnecessary.
Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
Faulty assumption for every cache hit.
Apparently you are misusing the term "interrupt-driven IO" to really mean programmed I/O using interrupts.
Equating a wait cycle or two to the execution of numerous instructions of an interrupt service routine for programmed I/O is a ridiculous exaggeration.
And "interrupt-driven IO" (in its proper meaning) does not exclude the use of DMA.
In my understanding, the cycle stealing mode is essentially the same.
Then your understanding is incorrect.
If the benefits of DMA are so minuscule or nonexistent as you allege, then how do you explain the existence of DMA controllers, and the preference of using DMA over programmed I/O?
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process
Comparing DMA to "interrupt-driven I/O" is illogical. See this.
Programmed I/O using interrupts requires a lot more than just the one instruction that you allege.
I'm unfamiliar with any CPU that can read a device port, write that value to main memory, bump the write pointer, and check if the block transfer is complete all with just a single instruction.
And you're completely ignoring the ISR code (e.g. save and then restore processor state) that is required to be executed for each interrupt (that the device would issue for requesting data).

When used with many older or simpler CPUs, burst mode DMA can speed up data transfer in cases where a peripheral is able to accept data at a rate faster than the CPU itself could supply it. On a typical ARM, for example, a loop like:
lp:
ldr r0,[r1,r2] ; r1 points to address *after* end of buffer
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
lsr r0,r0,#8
strb r0,[r3]
adds r2,#4
bne lp
would likely take at least 11 cycles for each group of four bytes to transfer (including five 32-bit instruction fetches, one 32-bit data fetch, four 8-bit writes, plus a wasted fetch for the instruction following the loop). A burst-mode DMA operation, by contrast, DMA would only need 5 cycles per group (assuming the receiving device was able to accept data that fast).
Because a typical low-end ARM will only use the bus about every other cycle when running most kinds of code, a DMA controller that grabs the bus on every other cycle could allow the CPU to run at almost normal speed while the DMA controller performed one access every other cycle. On some platforms, it may be possible to have a DMA controller perform transfers on every cycle where the CPU isn't doing anything, while giving the CPU priority on cycles where it needs the bus. DMA performance would be highly variable in such a mode (no data would get transferred while running code that needs the bus on every cycle) but DMA operations would have no impact on CPU performance.

What happens when VRAM is full?

I want to know the current nvidia/AMD implementation of handling VRAM resource allocation.
We already know that operating systems use swap/virtual memory when system RAM is full, then what is the equivalent of swap when it comes to VRAM? Do they fall back to system RAM or hard disk?
I thought that falling back to system RAM is rational, but from my experience video games lag horribly(1/20 of typical FPS) when they are out of video memory space, that made me doubt that they are using system RAM because I think system RAM is not that slow to make the game lag so much.
In short I would like to know what the current implementations are and what is the biggest bottleneck that causes the game to lag under out-of-memory situations.

the swapping is really done to RAM
if there is enough RAM to swap to. Swapping to file is unusable due to slow speed see next bullet
The RAM it self is not that slow (still slower) but the buses connected to it are
while swapping system memory to swap file the memory swap occur when needed (change focus of application,open new file/table,...) this is not that frequent but if you are out of VRAM then you are in trouble because usually most of gfx data is used in each frame.
This leads to swapping per frame so you need to copy usually very large data blocks very often for example swapping 256MB 20fps leads to:
256M x 2 x 20 = 10 GB/s read
256M x 2 x 20 = 10 GB/s write
which is 20GB/s bandwidth needed of coarse depending on the memory controller and architecture You can do read/write simultaneously up to a point so you can get close to 10GB/s in total theoretically but still that is huge number for only 256MB chunk of data look here:
Cache size estimation on your system?
My setup at that time has memory write only around 5GB/s which is nowhere near the needed memory transfer rate needed for such task

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...

Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Clarify: Processor operates at 800 Mhz and 200Mhz DDR RAM

I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.

The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)

Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart