I have a beagleboard with TMS320C64x+ DSP. I'm working on an image processing beagleboard application. Here's how it's going to work:
The ARM reads an image from a file and puts the image in a 2D array.
The arm sends the matrix to the DSP. The DSP receives the matrix.
The DSP performs the image processing algorithm on the received matrix (the algorithm code uses about 5MB of dynamically allocated memory).
The DSP sends the processed image (matrix) to the ARM. The arm receives the matrix.
The arm saves the processed image to a file.
I'v already written the code for steps 1,3,5. What is the easiest way to do steps 3+4 (sending the data)? Code examples are welcome.
The easiest way is to use shared memory:
Use the CMEM kernel module to allocate a chunk of memory on the ARM that can be accessed from ARM and DSP. Then pass the pointer down to the DSP using the DspBios NOTIFY component.
Once the DSP is done with processing you can notify the ARM via NOTIFY.
This way there is no need to copy the data from the ARM to the DSP or vice versa. All you have to make sure is, that the data comes from the CMEM component. This makes sure the memory is contiguous (the DSP does not know about the ARM memory manager).
Shared memory is the right approach, but learning how to do it can be a pain. The C6Run tool can abstract the ARM/DSP communications for you making it easier. Although NOTIFY is really the right API to use, C6Run utilizes CMEM using an older API.
If you want to try C6Run out on the BeagleBoard, the easiest way is by following the instructions on the eLinux wiki for setting up C6Run for the ECE597 course given by Mark Yoder at Rose-Hulman. These instructions depend on running the Angstrom demo image(2). A stable version that was used to demonstrate functionality of the hardware is documented as well(3).
(2): www.angstrom-distribution.org/demo/beagleboard
(3): code.google.com/p/beagleboard/wiki/BeagleBoardDiagnosticsNext
Related
This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.
We are working on project where we need to do some image processing on FPGA. For that purpose we are using ZedBoard with linaro (Ubuntu Version) running on it.
What we have already done is we have stored the image in binary form pixel by pixel in DDR using python script on Processing System of Zedboard.
Now our task is to read the content of DDR memory, process it and send back the processed output to DDR Memory again. We are using vivado xilinx tool for FPGA part. We tried to use AXI-DMA with AXI-Interconnect to read and write data from DDR.
My question is, Do We need to use SDK and some sort of C coding to read and write DDR Memory on Programmable Logic side? As we want to make our module start reading the data from DDR with a control signal and then start actual processing of Image data. Once we read specific block of data, process it and store the result back to DDR memory on the fly. We are not sure which IP Block do we need in our block design for vivado. Also do we need Block Ram Memory at the end before sending the date to DDR.
Can anyone who already done this sort of project or have any knowledge ? Any help from your side will be appreciated !
Thanks
The zynq FPGA provide an AMBA AXI interconnect for that purpose.
This is the interconnect on the right.
I'm developing an application for a Atmel SAME70Q21 Microprocessor. This MCU has a ARM Coretex-M7 core.
Atmel have implemented the ARM TCM (Tightly Coupled Memory) in this particular MCU variant. Atmel seems to classify the TCM into two sections "ITCM" (instruction TCM), and "DTCM" (data TCM)
I'm currently using the DTCM for fast storage, usually from interrupts. However, the ITCM is currently actually turned off, though the configuration system for the TCM still allocates it 32K of data.
I was thinking, since I'm not executing out of the ITCM, and the ram is already allocated, can I use the ITCM for data storage? The Cortex-M7 is a Von Neumann architecture CPU, and the architecture diagrams I've seen show the two TCM memory segments as having separate interfaces from the CPU.
Both the DTCM and ITCM memory spaces are rw in the linker script (to use the ITCM, you actually have to relocate your code into it at runtime, actually). What are the performance affects of (ab)using ARM cores this way?
From the ARM Cortex-M7 Processor Technical Reference Manual Section 5.8 TCM Interfaces:
The Prefetch Unit (PFU) can fetch instructions from any of the TCM interfaces. The Load Store Unit (LSU) and the AHBS interface can each read and write data using any of the TCM interfaces. Best performance is achieved if code is placed in ITCM and data in DTCM. However, there is no functional restriction in which TCM, code and data is placed.
If you are using neither for code, then there is probably no performance hit, but if you are running code in TCM, then separating them benefits from the Harvard architecture, allowing concurrent instruction fetch and data read. The ITCM's 64 bit bus presumably allows single cycle instruction and operand fetch - but I doubt that will be of any benefit for data read/write.
I am new to ARM and finding out ways to detect the memory map of platform based on ARM.Earlier I worked little in x86 and can find out memory map using some BIOS calls.
Same way can we do in ARM though BIOS is not there in ARM.
Is there any instruction do exist in ARM to find the Memory map ??
How do I find the memory map for an ARM CPU guide:
Read the documentation from arm.com for your coresponding core
Read the documentation of your CPU
Read the documentation of your platform, to see if it has external memory connected to SOC(CPU)
Or as a shortcut:
If your platform vendor provides a toolchain to compile code for it, make a dummy project and look for the memory layout in you linker file...
Gather this information:
Memory map for the corresponding core
Memory map of your CPU
If it has external accessible memory you have to perform some steps to initialize the controller.
Use gathered data and build the linker file for you project
Do whatever you want with it
There is no interface as ubiquitous as BIOS or EFI for ARM systems, though Microsoft does specify UEFI for systems that run Windows.
The Linux boot interface is the most common interface, see Documentation/arm/Booting in the kernel source and the header files.
If you want to write a program that to be portable across different Arm devices, you have to detect the memory by yourself. I am not very good especially in ARM, but there are common principles - you simply can scan the whole address space and probe the memory by writing a number and then reading it back. Usually, two such operations are provided with different numbers in order to exclude the occasional mistakes:
1. write 0aah
2. read and check for 0aah
3. write 055h
4. read and check for 055h
Note 1: for better speed not every byte have to be checked - some natural granularity have to be used and to check only at the start of the pages (whatever size are on this platform).
At the end you will have a map for the RAM memory. The ROM memory is not so easy to be detected, though and there is no common solution.
Note 2: Depending on the architecture (well, I said I am not ARM expert) your program must have access to the whole memory, according to the memory protection mechanisms of the CPU (if any).
Note 3: The only possible problem with this approach is the memory mapped IO. Touching it can affect the IO devices in unpredictable way. That is why, you must know what area of the addressing space is used for memory mapped IO and not to test it at all.
If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
There are a couple things you can try to mitigate the PCIe bottleneck:
Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution
Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.
With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.
With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.
You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.
You really need to do the math to be certain that you're going to be doing enough processing on the GPU to make it worthwhile transferring data between host and GPU. Ideally you do this at the design stage, before doing any coding, since it can be a deal-breaker.