Freeswitch: Lua performance bottleneck after importing http.request - lua

We use Freeswitch stack with 4000 channels to send IVR calls. The logic is written in Lua. We need to send a POST request whenever a call gets picked up. For this purpose we are using lua-http package. Both the functionalities, i.e. outbound calling and sending POST request are implemented in same Lua file. When we import lua-http package (require "http.request"), we observe high CPU utilization (more than 95%). This affects outbound calls. We have also observed that, whenever number of threads importing the library cross 1500, system starts slowing down. One possible solution is to restrict number of threads but that would reduce number of outbound calls. Is there anything else we can do to remove this bottleneck?
System configuration:
Operating system: Debian,
4 core CPU, 16G RAM

Related

When to write a Custom Kernel Module

Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.
HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.

Multiple unary rpc calls vs long-running bidirectional streaming in grpc?

I have a use case where many clients need to keep sending a lot of metrics to the server (almost perpetually). The server needs to store these events, and process them later. I don't expect any kind of response from the server for these events.
I'm thinking of using grpc for this. Initially, I thought client-side streaming would do (like how envoy does), but the issue is that client side streaming cannot ensure reliable delivery at application level (i.e. if the stream closed in between, how many messages that were sent were actually processed by the server) and I can't afford this.
My thought process is, I should either go with bidi streaming, with acks in the server stream, or multiple unary rpc calls (perhaps with some batching of the events in a repeated field for performance).
Which of these would be better?
the issue is that client side streaming cannot ensure reliable delivery at application level (i.e. if the stream closed in between, how many messages that were sent were actually processed by the server) and I can't afford this
This implies you need a response. Even if the response is just an acknowledgement, it is still a response from gRPC's perspective.
The general approach should be "use unary," unless large enough problems can be solved by streaming to overcome their complexity costs. I discussed this at 2018 CloudNativeCon NA (there's a link to slides and YouTube for the video).
For example, if you have multiple backends then each unary RPC may be sent to a different backend. That may cause a high overhead for those various backends to synchronize themselves. A streaming RPC chooses a backend at the beginning and continues using the same backend. So streaming might reduce the frequency of backend synchronization and allow higher performance in the service implementation. But streaming adds complexity when errors occur, and in this case it will cause the RPCs to become long-lived which are more complicated to load balance. So you need to weigh whether the added complexity from streaming/long-lived RPCs provides a large enough benefit to your application.
We don't generally recommend using streaming RPCs for higher gRPC performance. It is true that sending a message on a stream is faster than a new unary RPC, but the improvement is fixed and has higher complexity. Instead, we recommend using streaming RPCs when it would provide higher application (your code) performance or lower application complexity.
Streams ensure that messages are delivered in the order that they were sent, this would mean that if there are concurrent messages, there will be some kind of bottleneck.
Google’s gRPC team advises against using streams over unary for performance, but nevertheless, there have been arguments that theoretically, streams should have lower overhead. But that does not seem to be true.
For a lower number of concurrent requests, both seem to have comparable latencies. However, for higher loads, unary calls are much more performant.
There is no apparent reason we should prefer streams over unary, given using streams comes with additional problems like
Poor latency when we have concurrent requests
Complex implementation at the application level
Lack of load balancing: the client will connect with one server and ignore any new servers
Poor resilience to network interruptions (even small interruptions in TCP connections will fail the connection)
Some benchmarks here: https://nshnt.medium.com/using-grpc-streams-for-unary-calls-cd64a1638c8a

Using kqueue for simple async io

How does one actually use kqueue() for doing simple async r/w's?
It's inception seems to be as a replacement for epoll(), and select(), and thus the problem it is trying to solve is scaling to listening on large number of file descriptors for changes.
However, if I want to do something like: read data from descriptor X, let me know when the data is ready - how does the API support that? Unless there is a complimentary API for kicking-off non-blocking r/w requests, I don't see a way other than managing a thread pool myself, which defeats the purpose.
Is this simply the wrong tool for the job? Stick with aio?
Aside: I'm not savvy with how modern BSD-based OS internals work - but is kqueue() built on aio or visa-versa? I would imagine it would depend on whether the OS io subsystem system is fundamentally interrupt-driven or polling.
None of the APIs you mention, aside from aio itself, has anything to do with asynchronous IO, as such.
None of select(), poll(), epoll(), or kqueue() are helpful for reading from file systems (or "vnodes"). File descriptors for file system items are always "ready", even if the file system is network-mounted and there is network latency such that a read would actually block for a significant time. Your only choice there to avoid blocking is aio or, on a platform with GCD, dispatch IO.
The use of kqueue() and the like is for other kinds of file descriptors such as sockets, pipes, etc. where the kernel maintains buffers and there's some "event" (like the arrival of a packet or a write to a pipe) that changes when data is available. Of course, kqueue() can also monitor a variety of other input sources, like Mach ports, processes, etc.
(You can use kqueue() for reads of vnodes, but then it only tells you when the file position is not at the end of the file. So, you might use it to be informed when a file has been extended or truncated. It doesn't mean that a read would not block.)
I don't think either kqueue() or aio is built on the other. Why would you think they were?
I used kqueues to adapt a Linux proxy server (based on epoll) to BSD. I set up separate GCD async queues, each using a kqueue to listen on a set of sockets. GCD manages the threads for you.

Please tell about the query of network packet traversal in linux

I was reading Understanding linux networking Internal book and the pdf Network packet capture in Linux kernelspace on the link networkkernel.pdf
In the Understanding linux networking Internal under topic 9.2.2 it is given that
The code that takes care of an input frame is split into two parts: first the driver copies the frame into an input queue accessible by the kernel, and then the kernel processes it (usually passing it to a handler dedicated to the associated protocol such as IP). The first part is executed in interrupt context and can preempt the execution of the second part.
Now the query is when the 2nd part is scheduled? Who schedules them? Is the call given itself in interrupt handler? and in Network packet capture in Linux kernel space the packet input flow is described as:-
When working in interrupt driven model, the nic registers an
interrupt handler;
• This interrupt handler will be called when a frame is received;
• Typically in the handler, we allocate sk buff by calling
dev alloc skb();
• Copies data from nic’s buffer to this struct just created;
• nic call generic reception routine `netif_rx();`
• `netif rx()` put frame in per cpu queue;
• if queue is full, drop!
• net rx action() decision based on skb->protocol;
• This function basically dequeues the frame and delivery a copy
for every protocol handler;
• ptype all and ptype base queues
I want to know when netif rx(); and net rx action() are called? Who call them i mean who schedule them.
Please guide.
This scheduling is done by NAPI structure. Packets are cptured by method as descibed. This softirq comes into scenario when there is 'livelock' problem or flood of packet; these are handled.
Packet egresses are significantly more complex than packet ingresses, with Queue management and QOS (perhaps even packet-shaping) being implemented. “Queue Discplines” are used to implement user-specifiable QOS policies.
NAPI Structure Scheduling:
For NAPI structure is defined as; its driver Design & Hardware Architecture.
Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.
There is also a research paper on NAPI Scheduling.

What is the most common approach for designing large scale server programs?

Ok I know this is pretty broad, but let me narrow it down a bit. I've done a little bit of client-server programming but nothing that would need to handle more than just a couple clients at a time. So I was wondering design-wise what the most mainstream approach to these servers is. And if people could reference either tutorials, books, or ebooks.
Haha ok. didn't really narrow it down. I guess what I'm looking for is a simple but literal example of how the server side program is setup.
The way I see it: client sends command: server receives command and puts into queue, server has either a single dedicated thread or a thread pool that constantly polls this queue, then sends the appropriate response back to the client. Is non-blocking I/O often used?
I suppose just tutorials, time and practice are really what I need.
*EDIT: Thanks for your responses! Here is a little more of what I'm trying to do I suppose.
This is mainly for the purpose of learning so I'd rather steer away from use of frameworks or libraries as much as I can. Take for example this somewhat made up idea:
There is a client program it does some function and constantly streams the output to a server(there can be many of these clients), the server then creates statistics and stores most of the data. And lets say there is an admin client that can log into the server and if any clients are streaming data to the server it in turn would stream that data to each of the admin clients connected.
This is how I envision the server program logic:
The server would have 3 Threads for managing incoming connections(one for each port listening on) then spawning a thread to manage each connection:
1)ClientConnection which would basically just receive output, which we'll just say is text
2)AdminConnection which would be for sending commands between server and admin client
3)AdminDataConnection which would basically be for streaming client output to the admin client
When data comes in from a client to the server the server parses what is relevant and puts that data in a queue lets say adminDataQueue. In turn there is a Thread that watches this queue and every 200ms(or whatever) would check the queue to see if there is data, if there is, then cycle through the AdminDataConnections and send it to each.
Now for the AdminConnection, this would be for any commands or direct requests of data. So you could request for statistics, the server-side would receive the command for statistics then send a command saying incoming statistics, then immediately after that send a statistics object or data.
As for the AdminDataConnection, it is just the output from the clients with maybe a few simple commands intertwined.
Aside from the bandwidth concerns of the logical problem of all the client data being funneled together to each of the admin clients. What sort of problems would arise from this design due to scaling issues(again neglecting bandwidth between clients and server; and admin clients and server.
There are a couple of basic approaches to doing this.
Worker threads or processes. Apache does this in most of its multiprocessing modes. In some versions of this, a thread or process is spawned for each request when the request arrives; in other versions, there's a pool of waiting threads which are assigned work as it arrives (avoiding the fork/thread create overhead when the request arrives).
Asynchronous (non-blocking) I/O and an event loop. This is basically using the UNIX select call (although both FreeBSD and Linux provide more optimized alternatives such as kqueue). lighttpd uses this approach and is able to achieve very high scalability, but any in-server computation blocks all other requests. Concurrent dynamic request handling is passed on to separate processes (via CGI) or waiting processes (via FastCGI or its equivalent).
I don't have any particular references handy to point you to, but if you look at the web sites for open source projects using the different approaches for information on their design wouldn't be a bad start.
In my experience, building a worker thread/process setup is easier when working from the ground up. If you have a good asynchronous framework that integrates fully with your other communications tasks (such as database queries), however, it can be very powerful and frees you from some (but not all) thread locking concerns. If you're working in Python, Twisted is one such framework. I've also been using Lwt for OCaml lately with good success.

Resources