How to set/ get the time spent on packets fragmentation in ns-3 models? - network-programming

If I transfer a packets through multiple subnets, which have different MTU on the routers, it may be fragmented. How can I get or set the time spent on each operation of fragmentations in ns-3 models? I need to know this to calculate the speed.

What you are asking is unclear to me but let me try to answer.
If you want to measure the CPU time it takes for ns-3 to create fragments and reassemble them, you can run a simple 2-node experiment and change the mtu of the outbound network interface of the sending node to see how much wall-clock time is spent fragmenting vs non-fragmenting.
If, on the other hand, you want to measure the effect in terms of simulation time of splitting a packet in multiple packets and perform the MAC-level access function for each fragment, it's just a function of
the access function used at the MAC level. If you want to model switched ethernet, it's easy. make it zero.
the transmission delay through your medium. If ethernet, it's easy again: it's the length of the cable modulo the speed of the electromagnetic waves in your cable which depends on the quality of the cable.
the size of the fragment and the throughput of your medium.
Basically, if you know how many times the packet will be fragmented (it is possible for multiple routers to fragment the packet consecutively in smaller fragments) and which mtu will be used every time, you can trivially create an analytical model of that process and predict the actual transport-level transmission delay of your packets through the simulation.

Related

STM32F407ZET6, Is it possible for Multiple streams of a DMA to run in Parallel?

Hi I am using STM32F407ZET6 Microcontroller and I want to use multiple streams of DMA1. Is it possible to trigger two different streams of the same DMA for transferring data to two different peripherals simulatenously. (Like in Parallel).
In the advanced AHB bus matrix I observe that for each DMA there are only two lines, one for memory and one for peripheral, which suggest to me that at any time at max two streams can perhaps run in parallel and that also if none of the streams are really doing memory<->peripheral transfer. Is this assumption correct? And, is this also correct that to run two streams in parallel through a single DMA they should not be doing memory<->peripheral transfer? what I mean is that by the look of AHB matrix it felt if only Mem to Mem and Periph to Periph transfers are done then probably two streams can run in parallel, but if any one of them does memory<->peripheral transfer then the use of DMA memory and peripheral interface for a single transfer will probably make that NOT possible. Can you shed some light on this?
I would like to request some guidance on this particular topic as i could not find satisfactory information on it... And if it is dependent on the bus bandwidth to transfer streams in parallel then how the bandwidth is divided among multiple channels for a single bus to perform multiple transfer.... Some If there is any such example, i would be thankful. As a reference I have put the AHB matrix below:
You can only select one channel per stream, but you can enable all 8 streams per DMA peripheral at once if you like, subject to the hardware defect listed in the errata sheet*.
Each of the masters take turns to access the buses. Once a master takes the bus it decides how long to use it for. For the DMA master, this is configured with the MBURST and PBURST bits of the DMA_SxCR register. If you require very low latency in the system and do not want the processor or another master (ethernet etc) to be stalled and have to wait for the DMA to get off the bus then set the burst configuration short (but even the longest burst you can configure will still only be a microsecond or so).
(*) there is a hardware defect in DMA2 which disallows concurrent use of AHB and APB peripherals, see the errata for details.

Using consumer cellphones to build a mesh network for IOT devices?

I have been looking into LoRaWAN for a low cost waterproof asset tracker I am looking at building.
AFAIK, the primary benefits of LoraWAN over say LTE-M or cellular are: no connectivity costs and potentially lower power consumption.
What I'm wondering is: why can't we use our own cellphones as the "base station" that the IOT device talks with? We can do this with bluetooth and WiFi, why not cell? Is it the LTE protocol that prevents it? Physics? What am I missing?
There's quite a few architectural reasons why Peer-to-Peer LTE isn't feasible, but the largest is probably the fact that in LTE the uplink and downlink use different modulation techniques.
In the downlink (the connection from the Base stations (eNodeBs) to the User Equipment (our mobile phones)) Orthogonal Frequency Division Multiplex (OFDMA) is used, this means the phone listens out onto the RF interface for the OFDMA signal.
This works well, OFDMA is a great way of encoding the data onto the air interface, but it has a very high peak-to-average-power ratio, this means if the UEs used OFDMA in the Uplink (From the UE to the eNodeB) they'd have awful battery life.
Instead in the Uplink LTE uses Single Carrier Frequency Division Multiple Access (SC-FDMA), which is much more power efficient and allows you talk all day, so the eNodeBs listen on their RF interface for the SC-FDMA modulated traffic.
This means our UEs (Mobile phones) use one type of modulation to send and a different modulation scheme to receive, so they can't talk directly to one another as they can't send OFDMA modulated data, only receive & visa-versa.
Some more reading on OFDMA & SC-FDMA.
The LTE relay interface inducted as part of Release 10 allows the deployment of relay nodes (a kind of low cost eNB) that are fixed and that use in-band LTE to extend the coverage of standard eNodeBs by one hop, improve signal quality and to increase the network capacity. Relays can be placed such that it converts the long single hop into two shorter hops.
However the approach of using UE seems have many challenges as it can make UE to get bit loaded with more functional changes across layers(MAC, PHY, RRC, NAS) as it has to take additional functionalities from Relay nodes/eNB as well ranging from lower layer signalling, co-ordination, mobility to forwarding. Also, there might be additional power consumption and change in antenna to support the same which all will add to more cost of UE.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Erlang messages when there are lots of nodes or binary data

Would native Erlang messages provide reasonable performance when there are lots of nodes or binary data?
Case 1: There's a dynamic pool of about 50-200 machines (erlang nodes). It's constantly changing, about 5-50 machines added or removed every 10min.
Case 2: Let's say we are using this cluster to build youtube-clone and planning to stream video data via messages.
By reasonable performance I mean - it's ok to be 2-3 times slower than the top possible performance achieved by the complex Erlang code, 10 times slower is not ok.
There is not any significant difference between sending a message and binary data. The message is just transformed to the binary packet using term_to_binary and sent via TCP and same apply to the binary data. (Well, it is little bit smarter than that because textual form of the same atoms is not sent again and again as would simple term_to_binary do.) So the difference is negligible.
There are important details:
1) In clusters over 100 nodes, ping noise in full connected cluster will be significant part of network traffic. Even bigger deployments require deep changes in Erlang VM and OS.
2) If you want to stream video or audio you need plan capacity of single node: clients per node, tcp/udp packets rate, network bandwidth.
3) There is performance limit ~150-200K/s messages between 2 processes on different nodes.

iOS - synching audio between devices with millisecond accuracy

I need to synch audio in networked devices with millisecond accuracy. I've hacked together something that works quite well, but is'nt perfectly reliable :
1)Server device sends an rpc with a timeSinceClick param
2)client device launches the same click offsetted according to the time the rpc spent in transit,
3)System.Diagnostics.StopWatch checks periodicaly on all connected devices to make sure playback hasn't deviated too much from absolute time and corrects if necessary
Are there any more elegant ways to do this? Also, my way of doing it requires manual synching if non iOS devices are added to the mix : latency divergences make it very hard to automate...
I'm all eyes!
Cheers,
Gregzo
It is difficult to synchronize multiple devices on the same machine with millisecond accuracy, so if you are able to do this on multiple machines I would say you are doing well. I'm not familiar enough with iOS to the steps you describe, but I can tell you how I would approach this in cross-platform way. Maybe your approach amounts to the same thing.
one machine (the "master") would send a UDP packet with to all other machines.
all other machines would reply as quickly as possible.
the time it takes to receive the reply, divided by two, would be (approximately) the time it takes to get a packet from one machine to another. (this would have to be validated. maybe it takes much longer to process and send a packet? probably not)
after repeating steps 1-3, ignoring any extreme values and averaging the remaining results, you know about how long it takes to get a message from one machine to another.
now "sync" UDP packets can be sent from the main machine to the "slave" machines. The sync packets will include delay information so that when the save machine receive the packets, they know they were sent x milliseconds ago. Several sync packets may need to be sent in case the network delays or drops some of them.

Resources