Store data in DDR3 from PL in SoC Zynq 7020

Store data in DDR3 from PL in SoC Zynq 7020 - zynq

I have a ZTurn-Board with a 7020 processor featuring a total of 1GB of DDR3 memory connected to the PS.
Due to the needs of the project I have to do, from the PL I am going to be reading a total of 4*2584=10336 consecutive 8-bit data and with a very precise timing control ( I get 4 8-bit data at a time every 2MHz).
So I was wondering if it is possible to store all the data that I am generating in the DDR3 memory from the PL until the process is finished and then, once finished, from the PS send it to the PC, either by UART or GBe. And in case of being possible to store all these data, which would be the IP of which I have to look for information?
Would it be possible to store all the data from the PL until the maximum storage of the 1GB DDR3 memory is fully completed?

Related

What is the storage available in Movesense? For how long is it able to capture data locally?

In the tech specs sheet, it seems that Movesense operates with 512kB local memory:
Am I right? Are we able count for how long is it able to store data locally (e.g. 26Hz).
The idea is to store data locally and once a while to sync it with mobile app.
Thanks

UPDATE 2:
The DataLogger and Logbook improvements in software versions 1.4, 1.6 and 1.9 have changed the situation to better. Now the chunk overhead is smaller at 15/255 bytes and the data with big measurements can be split to following chunks. Also to see if the memory is full there is the /Mem/Logbook/isFull resource that can be GET'd and SUBSCRIBE'd.
UPDATE:
In the last proto build (hw build G1) and production builds, the EEPROM Data memory has grown to 384kB. The memory can be freely allocated between DataLogger/Logbook use and "other" (Movesense device lib sw version >= 1.0.1).
Movesense sensor has (at the moment of writing):
512kB of FLASH (program) memory, out of which there is about 70kB for customer application (the rest is taken by Bluetooth stack, bootloader, movesense platform and settings)
64kB of RAM out of which ~10kB is reserved for Bluetooth stack. Current software seems to have 12.5kB free heap for customer software after framework and execution contexts have been initialized.
128 kB of EEPROM data memory (though it may be bigger in production version). This is the memory where DataLogger saves the measurements.
The bytes per measurement required by dataLogger is seen in /sbem-code/sbem_definitions.cpp. At 26Hz sample rate each data packet contains 2 measurements so it takes 28 bytes and they come at 13Hz interval. There is 112 bytes in each EEPROM chunk available for data so it fits exactly and each 128 byte chunk can contain 4 data packets. So the answer:
128*1024 [B] / 128 [B / chuck] / ( 13 [pkg/sec] / 2 [pkg/chunk] ) =>
1024 [chunks] / 6.5 [chuck/sec] = ~157 seconds
Disclaimer: The calculations above are for current Movesense hardware and current software, the situation for both may change in the future
Full disclosure: I work for the Movesense team

Memcpy from PCIe memory takes more time than memcpy to PCIe memory

I am trying to do read/write data to/from a Linux PC from/to a PCIe 2.0 (2 lane) device. The memory for reading and writing are at different RAM locations in the PCIe device. Those memories are mapped in Linux PC using ioremap. My use case is to achieve 18MBytes/second read/write throughput which is obviously supported by the PCIe link. The memory at the PCIe device is uncached.
I am able to achieve the write throughput i.e when I write from Linux PC local memory to PCIe device memory using memcpy. The memcpy takes less than 1 ms for 9216 bytes of data in this case. But when I read the ioremapped PCIe memory to Linux local memory, data loss is happening. I profiled the memcpy and it takes more than 1ms, sometimes 2ms for 9216 bytes of data. I don't want to do DMA for this operation.
Any thoughts on what can be the problem in this case? How can I handle this?

That's entirely expected, and there is nothing you can do about that. The CPU can only issue serialized word-sized reads and writes, which have very poor throughput over the PCIe link due to protocol overheads. Every operation has 24 or 28 byte-times worth of overhead associated with it - that's a 12 or 16 byte TLP header plus 12 byte-times of link layer overhead, and the CPU can only operate on 4 or 8 bytes at a time....which is best case 25% efficient (8/(8+24) = 25%) and at worst 12.5% efficient (4/(4+28) = 12.5%).
The protocol overhead is not the only issue, however. Writes in PCIe are posted, so the CPU can simply issue a bunch of back-to-back writes which eventually make their way onto the bus and to the device. On the other hand, when reading, the CPU can only issue a single read operation, wait for it to traverse the bus twice, store the result, issue another read, etc. Since it can only operate on 8 bytes at a time, the performance is horrible due to the relatively high latency over the PCIe bus (can be on the order of microseconds for each transfer).
The solution? Use DMA. PCIe is specifically designed to support efficient DMA operations over the bus as devices can issue much larger read and write operations, minimum up to 128 bytes per operation.

Maximum data a GPU can take?

I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?

A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.

How long does it take to set up an I/O controller on PCIe bus

Say I have an InfiniBand or similar PCIe device and a fast Intel Core CPU and I want to send e.g. 8 bytes of user data over the IB link. Say also that there is no device driver or other kernel: we're keeping this simple and just writing directly to the hardware. Finally, say that the IB hardware has previously been configured properly for the context, so it's just waiting for something to do.
Q: How many CPU cycles will it take the local CPU to tell the hardware where the data is and that it should start sending it?
More info: I want to get an estimate of the cost of using PCIe communication services compared to CPU-local services (e.g. using a coprocessor). What I am expecting is that there will be a number of writes to registers on the PCIe bus, for example setting up an address and length of a packet, and possibly some reads and writes of status and/or control registers. I expect each of these will take several hundred CPU cycles each, so I would expect the overall setup would take order of 1000 to 2000 CPU cycles. Would I be right?
I am just looking for a ballpark answer...

Your ballpark number is correct.
If you want to send an 8 byte payload using an RDMA write, first you will write the request descriptor to the NIC using Programmed IO, and then the NIC will fetch the payload using a PCIe DMA read. I'd expect both the PIO and the DMA read to take between 200-500 nanoseconds, although the PIO should be faster.
You can get rid of the DMA read and save some latency by putting the payload inside the request descriptor.

Clarify: Processor operates at 800 Mhz and 200Mhz DDR RAM

I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.

The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)

Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart