DSP on Beaglebone - signal-processing

I have a Beaglebone running Ubuntu. We want to continuously sample from 3 on-board ATD converters at 100KS/s, and every window of samples we will run a cross correlation DSP algorithm. Once we find a correlation value above a threshold, we will send the value to a PC.
My concern is the process scheduling in Ubuntu. If our process gets swapped out and an ATD sample becomes available during this time, the process will miss the sample. We need to ensure that our process will capture every sample and save it in memory.
With this being said, is there a way to trigger interrupts on the Beaglebone so that if an ATD sample is ready, the sample will be saved in the memory of our program even if the program does not have the processor at the time?
Thanks!

You might be able to trigger the EDMA or use the PRUSS. Probably best to ask on beagleboard#googlegroups.com. There isn't a DSP per-se on the BeagleBone.

This is not exactly an answer to your question, but hopefully it explains how the process works. Since you didn't mention what hardware you are running for AD conversion, maybe this is the best that can be done:
With audio hardware, which faces the same problem, the solution comes from the hardware and the drivers working together: whenever the hardware has filled up enough of the buffer it signals the driver (via an interrupt or some similar mechanism). In some cases, it's also possible that the driver polls the hardware or something like that, but that's a less efficient solution, and I'm not sure anyone does it that way anymore (maybe on cheaper hardware?). From there, the driver process may call right into the end-user process, or it may simply mark the relevant end-user process as "runnable". Either way, control needs to be transferred to the end user process.
For that to happen, the end user process must be running at a higher priority than anything else occupying the CPUs at that moment. To guarantee that your process will always be first in the queue, you can run it at a high priority, with the appropriate permissions, you can even run in very high priorities.
The time it takes for the top priority process to go from runnable to running is sometimes called the "latency" of the OS, though I am sure there's a more specific technical term. The latency of Linux is on the order of 1 ms, but since it's not a "hard" real-time OS, this is not a guarantee. If this is too long to handle your chunks of data, you may have to buffer some of it in your driver.

Related

Multiple boards Windows Device Manager prompts resource conflict, error code is 12

I inserted multiple same boards into the system. Device PCIe is implemented with Xilinx IP core. After each FPGA program is programed, manually refresh the device manager to check whether the device and driver are working properly.
My confusion is that it seems that this approach can only work on two boards at the same time. After the third board is programmed , and then refresh the task manager, the system prompts insufficient resources, "This device cannot find enough free resources that it can use (Code 12)"
I tried to disable the other two boards, but the device still prompted conflicts. I don't know how to query conflicting resources.
My boards have 2 BARs(BAR0: 2KB, BAR1:16MB), and 1 IRQ.
I did a few experiments and felt that it was caused by the conflict of memory resources. The conflicting party was the AMD integrated graphics card that came with the motherboard.
1st, 2nd, 3rd all indicate my board number.
3rd is not recognized because of conflicts.
After I shut it down, I plugged in all three boards and then turned it on again. As a result, during the startup process, the system restarted again after a sudden power failure, and then all three boards were normal. At this time, it is found that the memory address of the graphics card has changed
I want to know how to resolve the conflict? modify my driver code or FPGA configuration?
PCIe enumeration should resolve memory allocations issues, however there are a couple implementation issues to be aware of. Case in point, I have used Xilinx XDMA's with a 64 bit BAR of 2GB in size and I have literally bricked a DELL XPS motherboard. But I have done the same with an IBM system and it just worked. The point here is that enumeration can be done with Firmware, Hardware or OS driven event. If you doing hardware manager, that sounds like OS driven, but when I toasted the XPS board, that was some kind of FW issue that was related to BAR size that resulted in a permanent failure. 16MB isn't big and it should not be a problem, but I would recommend going with the Xilinx defaults first and show reliability there. I think it's one BAR at 1M. I have run 3x 64 bit BARS at 1MB a piece without issue, but keep it simple and show reliability. Then move up. This will help isolate if it is a system flakiness or not.
I have seen some system use FW based enumeration that comes up really fast, before the FPGA has configured, in which case there is no PCIe target to ID. If you frequently find that your FPGA is not detected on power up, but detected on a rescan, this can be a symptom. How to resolve this is a bit of a pain. We ended up using partial reconfiguration. Start with the PCIe interface, then have a reconfig to load the remaining image. Let's hope it isn't this problem
The next thing is to be aware of is your reset mechanism within the FPGA. You probably hooked the PCIe IP reset to the bus reset which is great, however, I have in the past also hooked that reset to internal PLL locked signals that may not be up. For troubleshooting purposes, keep that reset simple and get rid of everything else to show that just the PCIe IP by itself is reliable first.
You also have to be careful here too. If you strip things down, make sure it is clean. If you ignore the PLL lock and try to use a Xilinx driver, such as the XDMA driver, it has a routine where it tries to identify the XDMA with data transactions. It is looking for the DMA BAR one BAR at a time. But when it does this, the transaction it attempts may go out on the AXI bus if the BAR isn't the XDMA control BAR. If the AXI bus isn't out of reset or clocked when this happens you will lock the AXI bus and I have locked up a Linux box this way on several occasions. AXI requires that transactions complete, otherwise it just sits there waiting.
BTW, on a Linux box, you can look at the enumeration output in the kernel log. I'm not sure if Windows shows you the same thing or not. But this can be helpful if you see that the device was initially probed but then something invalid was detected in the config register, versus not being seen at all.
So a couple things to look at.

Does simulating memory-mapped I/O using VMX require instruction decoding?

I am wondering how a hypervisor using Intel's VMX / VT technology would simulate memory-mapped I/O (so that the guest could think it was performing memory mapped I/O againsta device).
I think the basic principle would be to set up the EPT page tables in such a way that the memory addresses in question would cause an EPT violation (i.e. VM exit) by setting them such that they cannot be read or written? However, the next question is how to process the VM exit. Such a VM-exit would fill out all the exit qualification reasons etc. including the guest-linear and guest-physical address etc. But what I am missing in these exit qualification fields is some field indicating - in case of a write instruction - the value that was attempted to be written and the size of the write. Likewise, for a read instruction it would be nice with some bit fields indicating the destination of the read, say a register or a memory location (in case of memory-to-memory string operations). This would make it very easy for the hypervisor to figure out what the guest was trying to do and then simulate the device behavior towards the guest.
But the trouble is, I can't find such fields among the exit qualifications. I can see an instruction pointer to where the faulting instruction is, so I could walk the page tables to read in the instruction and then decode it to understand the instruction, then simulate the I/O behavior. However, this requires the hypervisor to have a fairly complete picture of all x86 instructions, and be able to decode them. That seems to be quite a heavy burden on the hypervisor, and will also require it to stay current with later instruction additions. And the CPU should already have this information.
There's a chance that that I am missing these relevant fields because the documentation is quite extensive, but I have tried to search carefully but have not been able to find it. Maybe someone can point me in the right direction OR confirm that the hypervisor will need to contain an instruction decoder.
I believe most VMs decode the instruction. It's not actually that hard, and most VMs have software emulators to fallback on when the CPU VM extensions aren't available or up to the task. You don't need to handle every instruction, just those that can take memory operands, and you can probably ignore everything that isn't a 1, 2, or 4 byte memory operand since you're not likely to emulating device registers other than those sizes. (For memory mapped device buffers, like video memory, you don't want to be trapping every memory accesses because that's too slow, and so you'll have to take different approach.)
However, there is one way you can let the CPU do the work for you, but it's much slower then decoding the instruction itself and it's not entirely perfect. You can single step the instruction while temporarily mapping in a valid page of RAM. The VM exit will tell you the guest physical address access and whether it was a read or write. Unfortunately it doesn't reliably tell you whether it was read-modify-write instruction, those may just set the write flag, and with some device registers that can make a difference. It might be easier to copy the instruction (it can only be a most 15 bytes, but watch out for page boundaries) and execute it in the host, but that requires that you can map the page to same virtual address in the host as in the guest.
You could combine these techniques, decode the common instructions that are actually used to access memory mapped device registers, while using single stepping for the instructions you don't recognize.
Note that by choosing to write your own hypervisor you've put a heavy burden on yourself. Having to decode instructions in software is a pretty minor burden compared to the task of emulating an entire IBM PC compatible computer. The Intel virtualisation extensions aren't designed to make this easier, they're just designed to make it more efficient. It would be easier to write a pure software emulator that interpreted the instructions. Handling memory mapped I/O would be just a matter of dispatching the reads and writes to the correct function.
I don't know in details how VT-X works, but I think I see a flaw in your wishlist way it could work:
Remember that x86 is not a load/store machine. The load part of add [rdi], 2 doesn't have an architecturally-visible destination, so your proposed solution of telling the hypervisor where to find or put the data doesn't really work, unless there's some temporary location that isn't part of the guest's architectural state, used only for communication between the hypervisor and the VMX hardware.
To handle a read-modify-write instruction with a memory destination efficiently, the VM should do the whole thing with one VM exit. So you can't just provide separate load and store interfaces.
More importantly, handling atomic read-modify-writes is a special case. lock add [rdi], 2 can't just be done as a separate load and store.

Opening camera in multiple program in openCV

How can I open a single webcam in multiple programs written in openCV simultaneously. Btw I have attached 3 webcams and all are working fine in any single program of openCV, but why two program can't use them both, simultaneously?
Is this a restriction or is there any workaround?
Yes, this is an intention to restrict
Why? The conceptual view is related to the hardware control layer. Operating system assumes, there are some peripherals, that can be used on-demand, but keeps their context-of-use non-share-able.
As an example, one may assume a USB-mouse. While it can be used within several processes, some reasoning told the architects, it would not be the case one mouse should feed events into more than one-( window )-framed -context ( Yes, right, a.k.a. process )
Some other peripherals are even instances of both an EventSENSOR and an EventCOLLECTOR, USB-Cameras for example can receive and process signals from operating system to re-adjust their physical state ( Pan-Tilt-Zoom as an example ).
The more obvious becomes the 1:1 relation assumption, which is something we may sometimes wish not to have hardcoded there. On the other hand, what would the poor device do, if one process instructs it go-left and another go-right at the same time?
Similarly, who would be happy if one USB-mouse will send it's motion and other MMI-interaction events to all currently running processes? At least Drag&Drop UI-navigation policy will become a funny lottery.
Yes, there is a workaround
A simplest "just-enough" scenario includes aCameraControllerPROCESS which keeps the 1:1 relation with USB-device and has the visual data-acquisition responsibility. Here nanoseconds matter, so do not spend any CPU_CLK on anything other but on moving bytes into a buffer. All other processing shall be left on aVisualDataViewConsumerPROCESS where openCV may ( and will ) spend tens and hundreds microseconds on it's own.
Plus it has a communication / service provisioning part, which allows other distributed processes, access the acquired visual data in parallel manner ( in a non-blocking, concurrent view manner, which is a must for distributed soft-real-time processing ).
If one's architecture requires more features or alternatives, this layered approach allows one to add features while retaining both the control and performance overheads within acceptable envelopes.
A good implementation may remain quite smart, fast, low-latency and slim (resources-wise), if one uses ZeroMQ with it's Zero-Copy inproc: virtually abstract transport class, where one does lose an almost Zero-Latency as a cost of the trick, but gains an immense power of having a robust, massively-distribute-able one may scale-out beyond one, two, dozen, thousands, ... you name it ... hosts for free

Are there memory limitations when outputting to a ScrolledText widget?

I am fairly new to Python and to GUI programming, and have been learning the Tkinter package to further my development.
I have written a simple data logger that sends a command to a device via a serial or TCP connection, and then reads the response back, displaying it in a ScrolledText widget. In addition, I have a button that allows me to save the contents of the ScrolledText widget into a text file.
I was testing my software by sending a looped command, with a 0.5 second delay between commands. The aim was to test the durability of the logger so it may later be deployed to automatically monitor and log the output of the devices it is connected to.
After 30-40 minutes, I find that the program crashes on my Windows 7 system, and I suspect that it may be caused by a memory issue. The crash is a rather nondescript, "pythonw.exe has stopped working" message. When I monitor the process using Windows Task Manager, the memory used by pythonw.exe increases each time a response is read, and will eventually reach nearly 2Gb.
It may be that I need to rethink my logic and have the software log to the disk in 'real time', while the ScrolledText box overwrites the oldest data after x-number of lines... However, for my own education, I was wondering if there was a better way to manage the memory used by ScrolledText?
Thanks in advance!
In general, no, there are no memory limitations with writing to a scrolled text widget. Internally, the text is stored in an efficient b-tree (efficient, unless all the data is a single line, since the b-tree leaves are lines). There might be a limit of some sort, but it would likely be in the millions of lines or so.

Detailed multitasking monitoring

I'm trying to put together a model of a computer and run some simulations on it (part of a school assignment). It's a very simple model - a CPU, a disk and a process generator that generates user processes that take turns in using the CPU and accessing the disk (I've decided to omit the various system processes, because according to Process Explorer they use next to no CPU time - I'm basing this on the Microsoft Process Explorer tool, running on Windows 7). And this is where I've stopped at.
I have no idea how to get relevant data on how often do various processes read/write to disk and how much data at once, and how much time they spend using the CPU. Let's say I want to get some statistics for some typical operations on a PC - playing music/movies, browsing the internet, playing games, working with Office, video editing and so on...is there even a way to gather such data?
I'm simulating preemptive multitasking using RR with a time quantum of 15ms for switching processes, and this is how it looks:
->Process gets to CPU
->Process does its work in 0-15ms, gives up the CPU or is cut off
And now, two options arise:
a)process just sits and waits before it gets the CPU again or before it gets some user input if there is nothing to do
b)process requested data from disk, and does not rejoin the queue until said data is available
And i would like the decision between a) and b) in the model be done based on a probability, for example 90% for a) and 10% for b). But I do not know how to get those percentages to be at least a bit realistic for a certain type of process. Also, how much data can and does a process typically access at once?
Any hints, sources, utilities available for this?
I think I found an answer myself, albeit an unreliable one.
The Process Explorer utility for Windows measures disk I/O - by volume and by occurences. So there's a rough way to get the answer:
say a process performs 3 000 reads in 30 minutes, whilst using 2% of CPU during that time (assuming a single core CPU). So the process has used 36000ms of CPU time, divided into ~5200 blocks (this is the unreliable part - the process in all proabbility does not use the whole of the time slot, so I'll just divide by half the time slot). 3000/5200 gives a 57% chance of reading data after using the CPU.
I hope I did not misunderstand the "reads" statistic in Process Explorer.

Resources