STM32F407ZET6, Is it possible for Multiple streams of a DMA to run in Parallel? - dma

Hi I am using STM32F407ZET6 Microcontroller and I want to use multiple streams of DMA1. Is it possible to trigger two different streams of the same DMA for transferring data to two different peripherals simulatenously. (Like in Parallel).
In the advanced AHB bus matrix I observe that for each DMA there are only two lines, one for memory and one for peripheral, which suggest to me that at any time at max two streams can perhaps run in parallel and that also if none of the streams are really doing memory<->peripheral transfer. Is this assumption correct? And, is this also correct that to run two streams in parallel through a single DMA they should not be doing memory<->peripheral transfer? what I mean is that by the look of AHB matrix it felt if only Mem to Mem and Periph to Periph transfers are done then probably two streams can run in parallel, but if any one of them does memory<->peripheral transfer then the use of DMA memory and peripheral interface for a single transfer will probably make that NOT possible. Can you shed some light on this?
I would like to request some guidance on this particular topic as i could not find satisfactory information on it... And if it is dependent on the bus bandwidth to transfer streams in parallel then how the bandwidth is divided among multiple channels for a single bus to perform multiple transfer.... Some If there is any such example, i would be thankful. As a reference I have put the AHB matrix below:

You can only select one channel per stream, but you can enable all 8 streams per DMA peripheral at once if you like, subject to the hardware defect listed in the errata sheet*.
Each of the masters take turns to access the buses. Once a master takes the bus it decides how long to use it for. For the DMA master, this is configured with the MBURST and PBURST bits of the DMA_SxCR register. If you require very low latency in the system and do not want the processor or another master (ethernet etc) to be stalled and have to wait for the DMA to get off the bus then set the burst configuration short (but even the longest burst you can configure will still only be a microsecond or so).
(*) there is a hardware defect in DMA2 which disallows concurrent use of AHB and APB peripherals, see the errata for details.

Related

When to write a Custom Kernel Module

Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.
HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.

Using consumer cellphones to build a mesh network for IOT devices?

I have been looking into LoRaWAN for a low cost waterproof asset tracker I am looking at building.
AFAIK, the primary benefits of LoraWAN over say LTE-M or cellular are: no connectivity costs and potentially lower power consumption.
What I'm wondering is: why can't we use our own cellphones as the "base station" that the IOT device talks with? We can do this with bluetooth and WiFi, why not cell? Is it the LTE protocol that prevents it? Physics? What am I missing?
There's quite a few architectural reasons why Peer-to-Peer LTE isn't feasible, but the largest is probably the fact that in LTE the uplink and downlink use different modulation techniques.
In the downlink (the connection from the Base stations (eNodeBs) to the User Equipment (our mobile phones)) Orthogonal Frequency Division Multiplex (OFDMA) is used, this means the phone listens out onto the RF interface for the OFDMA signal.
This works well, OFDMA is a great way of encoding the data onto the air interface, but it has a very high peak-to-average-power ratio, this means if the UEs used OFDMA in the Uplink (From the UE to the eNodeB) they'd have awful battery life.
Instead in the Uplink LTE uses Single Carrier Frequency Division Multiple Access (SC-FDMA), which is much more power efficient and allows you talk all day, so the eNodeBs listen on their RF interface for the SC-FDMA modulated traffic.
This means our UEs (Mobile phones) use one type of modulation to send and a different modulation scheme to receive, so they can't talk directly to one another as they can't send OFDMA modulated data, only receive & visa-versa.
Some more reading on OFDMA & SC-FDMA.
The LTE relay interface inducted as part of Release 10 allows the deployment of relay nodes (a kind of low cost eNB) that are fixed and that use in-band LTE to extend the coverage of standard eNodeBs by one hop, improve signal quality and to increase the network capacity. Relays can be placed such that it converts the long single hop into two shorter hops.
However the approach of using UE seems have many challenges as it can make UE to get bit loaded with more functional changes across layers(MAC, PHY, RRC, NAS) as it has to take additional functionalities from Relay nodes/eNB as well ranging from lower layer signalling, co-ordination, mobility to forwarding. Also, there might be additional power consumption and change in antenna to support the same which all will add to more cost of UE.

Erlang messages when there are lots of nodes or binary data

Would native Erlang messages provide reasonable performance when there are lots of nodes or binary data?
Case 1: There's a dynamic pool of about 50-200 machines (erlang nodes). It's constantly changing, about 5-50 machines added or removed every 10min.
Case 2: Let's say we are using this cluster to build youtube-clone and planning to stream video data via messages.
By reasonable performance I mean - it's ok to be 2-3 times slower than the top possible performance achieved by the complex Erlang code, 10 times slower is not ok.
There is not any significant difference between sending a message and binary data. The message is just transformed to the binary packet using term_to_binary and sent via TCP and same apply to the binary data. (Well, it is little bit smarter than that because textual form of the same atoms is not sent again and again as would simple term_to_binary do.) So the difference is negligible.
There are important details:
1) In clusters over 100 nodes, ping noise in full connected cluster will be significant part of network traffic. Even bigger deployments require deep changes in Erlang VM and OS.
2) If you want to stream video or audio you need plan capacity of single node: clients per node, tcp/udp packets rate, network bandwidth.
3) There is performance limit ~150-200K/s messages between 2 processes on different nodes.

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

How to set/ get the time spent on packets fragmentation in ns-3 models?

If I transfer a packets through multiple subnets, which have different MTU on the routers, it may be fragmented. How can I get or set the time spent on each operation of fragmentations in ns-3 models? I need to know this to calculate the speed.
What you are asking is unclear to me but let me try to answer.
If you want to measure the CPU time it takes for ns-3 to create fragments and reassemble them, you can run a simple 2-node experiment and change the mtu of the outbound network interface of the sending node to see how much wall-clock time is spent fragmenting vs non-fragmenting.
If, on the other hand, you want to measure the effect in terms of simulation time of splitting a packet in multiple packets and perform the MAC-level access function for each fragment, it's just a function of
the access function used at the MAC level. If you want to model switched ethernet, it's easy. make it zero.
the transmission delay through your medium. If ethernet, it's easy again: it's the length of the cable modulo the speed of the electromagnetic waves in your cable which depends on the quality of the cable.
the size of the fragment and the throughput of your medium.
Basically, if you know how many times the packet will be fragmented (it is possible for multiple routers to fragment the packet consecutively in smaller fragments) and which mtu will be used every time, you can trivially create an analytical model of that process and predict the actual transport-level transmission delay of your packets through the simulation.

Resources