Please tell about the query of network packet traversal in linux - network-programming

I was reading Understanding linux networking Internal book and the pdf Network packet capture in Linux kernelspace on the link networkkernel.pdf
In the Understanding linux networking Internal under topic 9.2.2 it is given that
The code that takes care of an input frame is split into two parts: first the driver copies the frame into an input queue accessible by the kernel, and then the kernel processes it (usually passing it to a handler dedicated to the associated protocol such as IP). The first part is executed in interrupt context and can preempt the execution of the second part.
Now the query is when the 2nd part is scheduled? Who schedules them? Is the call given itself in interrupt handler? and in Network packet capture in Linux kernel space the packet input flow is described as:-
When working in interrupt driven model, the nic registers an
interrupt handler;
• This interrupt handler will be called when a frame is received;
• Typically in the handler, we allocate sk buff by calling
dev alloc skb();
• Copies data from nic’s buffer to this struct just created;
• nic call generic reception routine `netif_rx();`
• `netif rx()` put frame in per cpu queue;
• if queue is full, drop!
• net rx action() decision based on skb->protocol;
• This function basically dequeues the frame and delivery a copy
for every protocol handler;
• ptype all and ptype base queues
I want to know when netif rx(); and net rx action() are called? Who call them i mean who schedule them.
Please guide.

This scheduling is done by NAPI structure. Packets are cptured by method as descibed. This softirq comes into scenario when there is 'livelock' problem or flood of packet; these are handled.
Packet egresses are significantly more complex than packet ingresses, with Queue management and QOS (perhaps even packet-shaping) being implemented. “Queue Discplines” are used to implement user-specifiable QOS policies.
NAPI Structure Scheduling:
For NAPI structure is defined as; its driver Design & Hardware Architecture.
Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.
There is also a research paper on NAPI Scheduling.

Related

Freeswitch: Lua performance bottleneck after importing http.request

We use Freeswitch stack with 4000 channels to send IVR calls. The logic is written in Lua. We need to send a POST request whenever a call gets picked up. For this purpose we are using lua-http package. Both the functionalities, i.e. outbound calling and sending POST request are implemented in same Lua file. When we import lua-http package (require "http.request"), we observe high CPU utilization (more than 95%). This affects outbound calls. We have also observed that, whenever number of threads importing the library cross 1500, system starts slowing down. One possible solution is to restrict number of threads but that would reduce number of outbound calls. Is there anything else we can do to remove this bottleneck?
System configuration:
Operating system: Debian,
4 core CPU, 16G RAM

When to write a Custom Kernel Module

Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.
HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.

Storm process increasing memory

I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.

Interfacing peripheral drivers with RTOS

For one of my project the controller selection made was STM32L1 series. ST provides the drivers for USB, I2C, SPI etc. So while making a decision on RTOS is there any consideration needed to be given to the drivers. Or in another way after deciding an RTOS, is there any standard way of interfacing peripheral drivers of the microcontroller with RTOS?
No, microcontroller peripheral drivers and the RTOS are typically independent so compatibility doesn't need to be a consideration. The microcontroller peripheral drivers are basic drivers that aren't reliant on any RTOS services. In fact the peripheral library can be used without any RTOS. And an RTOS typically does not rely on any microcontroller peripherals beyond a timer. Even the setup of the timer is not built-in to the RTOS. The timer is typically setup by user code, before starting the RTOS.
If I haven't convinced you and you still want some assurance of compatibility then explore CMSIS.
While ST's low level drivers do not have RTOS dependencies or requirements, you might build a higer-level driver architecture around these using RTOS mechanisms to support mutual exclusion, buffering, and to manager handler priority for example.
You could for example manage multi-thread access to a device either through a device manager thread, or via mutual exclusion.
There is no standard defined way to interface peripheral drivers to an RTOS as it depends on the RTOS. However, a common way is to take advantage of blocking mutex or semaphore that are provided by the RTOS. A blocking mutex means that if a mutex is not available, a task will wait until it is free and not use any CPU time until then.
Usually when running an RTOS, you want the peripheral driver to grab the input data as quickly as possibly, using an interrupt, and then pass off the data to an RTOS task that can take its time processing the data. This is a nice clean way of managing peripheral interrupts and RTOS multitasking.
The general scenario is then you have a task that waits on the mutex. Most of the time it does not take any CPU time. When a peripheral driver gets invoked by an interrupt, the driver grabs the data off the hardware, and frees the mutex so the waiting task will wake up. The actual data can be passed between the peripheral driver and the task using global variable or other RTOS defined mechanism. Similar mechanism can be done using a semaphore.
The ST's provided peripheral drivers (whether it is StdPeripheralLib, HAL, or LL) can operate in this model. Therefore when making a decision on which RTOS to use, you should consider an RTOS with API that supports this model.

When does a UDP sendto() block?

While using the default (blocking) behavior on an UDP socket, in which case will a call to sendto() block? I'm interested essentially in the Linux behavior.
For TCP I understand that congestion control makes the send() call blocking if the sending window is full, but what about UDP? Does it even block sometimes or just let packets getting discarded at lower layers?
This can happen if you filled up your socket buffer, but it is highly operating system dependent. Since UDP does not provide any guarantee your operating system can decide to do whatever it wants when your socket buffer is full: block or drop. You can try to increase SO_SNDBUF for temporary relief.
This can even depend on the fine tuning of your system, for instance it can also depend on the size of the TX ring in the driver of your network interface. There are a few discussions about this in the iperf mailing list, but you really want to discuss this with the developers of your operating system. Pay special attention to O_NONBLOCK and EAGAIN / EWOULDBLOCK.
This may be because your operating system is attempting to perform an ARP request in order to get the hardware address of the remote host.
Basically whenever a packet goes out, the header requires the IP address of the remote host and the MAC address of the remote host (or the first gateway to reach it). 192.168.1.34 and AB:32:24:64:F3:21.
Your "block" behavior could be that ARP is working.
I've heard in older versions of Windows (2k I think), that the 1st packet would sometimes get discarded if the request is taking too long and you're sending out too much data. A service pack probably fixed that since then.

Resources