Client connection limit for WebSocketsPP v3 library

Client connection limit for WebSocketsPP v3 library - connection

In the WebSocket++ 0.3.x library, what determines the limit of how many WebSocket clients can have an active connection? Is it one connection per thread, or can one thread handle multiple WebSocket client connections? If it is the latter, roughly how many connections can one thread hold?
Basically, I'm looking for a ballpark number of how many client connections WebSocket++ library can handle in a application with roughly 25 threads to spare. The library homepage is:
http://www.zaphoyd.com/websocketpp

If you are using the Boost.Asio based transport policy with a recent version of Boost on a platform that supports non-blocking/asynchronous I/O (epoll on Linux, kqueue on OS X/BSD, iocp on Windows) then WebSocket++ does not introduce any significant limits on simultaneous connections.
In such situations, the limits are pretty much based on OS, hardware, and application factors. The OS will limit total file descriptors in use per process (with root access this limit can be changed). High concurrency levels will require your application to be structured appropriately to handle it (primarily using short, bounded time, non-blocking handlers). Other factors will limit you in the same way that generic servers will be limited. Gigabit Ethernet can only handle so much traffic, using TLS or compression will reduce performance, etc.
I haven't done extensive performance benchmarking with 0.3.x yet, but 0.2.x in appropriately tuned applications were able to easily service 10s of thousands of concurrent clients on an i7 core.
It is the intention of the WebSocket++ architecture to scale to arbitrary connection counts given sufficient resources. If you are working on an application that scales WebSocket++ beyond 10k connections I'd be interested in more details and in addressing any bottlenecks you discover.

Related

When to write a Custom Kernel Module

Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.

HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.

What are possible Scalability options for an application supporting ONLY Single TCP Socket Connection?

There is a legacy implementation(to an extent company proprietary in Pascal, C with some java macros) which processes TCP Socket based requests from TCP client application. It supports multiple client applications(around 5K) connecting over TCP Socket, however, it only supports single socket connection with backend(database). There are two instances of the server, so in total, it supports 10K client applications over two TCP Socket connection with database. All database related communication happens in synchronous manner over single socket connection. There are massive issues in this application, especially higher RTT(Round Trip Time) and occasional outages due to back-pressure. We have an ops team for such issues. They mostly resolve them by restarting the server. Hardly, we have people in our team who know coding details of this application and there is not much documentation. As this is a critical application we can not afford messing with it. We don't want to touch the code at least for now. This even becomes more critical due to shift in business priorities. There is a need to add another 30K client applications of another business with this setup.
Task before us is to integrate it with another application which is based on microservice architecture with middleware using RabbitMQ. This is a customer facing application sensitive to higher QoS. We can not afford outage & downtime in it. As part of this integration, there is a need to process request messages coming from the above legacy application over TCP Socket before passing them to database. In other words, we want to introduce a component which would process requests of legacy application before handing over to database. This additional process is part of our client request. Some of the processing requirement is very intensive and resource hungry in terms of CPU Cycle, Memory and socket i/o. As a result, there are chances, such processing may lead to server downtime & higher RTT. Our this layer is very flexible, we can easily add more server or replace faulty ones. But, this doesn't sound very efficient in this integration as we are limited with single socket connection of legacy application. So in total at max, we can only have 2(+ 6 for new 30k client application) servers. This is our cause of concern.
I want know, what different possible options are available to address high availability, scalability and latency issues of such integration? Especially with limitation of single TCP socket connection, how can we make this integration efficient, something which can handle back-pressure, better application uptime etc.
We were thinking of leveraging RabbitMQ, Layer 4 Load balancer(like haProxy, NginX), IPVS, NAT etc.. But all lead toward making some changes(or not very efficient technique) in the legacy code, which we don't want.

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?

This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

Is DataSnap Optimized for responding to more than 1k users at the same time?

We want to start a big multi-tier application. The server side application must respond to more than 1000 users at the same time. We want to create server application by 64 bit compiler and client side with 32 bit. In this case we don't know DataSnap can respond to all client without any problem or not?
In this case The Server computer is very powerful (multi-processor and more than 16GB of RAM) and DataBase Management system is FireBird 2.5.

You need a way to perform realistic load tests.
For the Firebird database, you can simulate concurrent users with the free Apache JMeter tool. It can run SQL statements and record their execution time statistics (average, min/max etc.). So you could for example create a thread group with twenty different SQL queries, and then run twenty threads which each will perform these queries sequentially.
JMeter allows to define time limits on the SQL query, and JMeter treats it as an error if the query exceeds this limit. Then you can try to find the maximum client count where the overall error rate is still less than (for example) five percent.
But you also need to know how high the expected database load will be, and you will also need to have a test database with a realistic size, not only a couple of records. Also, some database queries like reports might cause higher load - these should be included in the simulation too, as they can affect overall performance. In JMeter, you can create a second thread group, running in parallel with the first one, for these long-running statements with different settings (less simulated clients).
Testing the database will show if there is a bottleneck already in this area. For example, the test result could be that the database can serve twenty clients with a total average transaction rate of 20 TPS (transactions per second), which means one client executes one transaction per second. But this TPS value will decrease with higher user count.
Related question: Firebird usage in big projects which also has a link to http://www.firebirdsql.org/en/case-studies-catalog/
Regarding DataSnap client load simulation: this can be done with a scripted client, which runs a predefined set of statements / commands over the connection.
To run a high number of load test clients simultaneaously you could use a service like Amazon Elastic Compute Cloud (EC2), to launch clones of your test machine image, saving you hardware costs. But of course I would start with a small client machine which simply runs ten or twenty scripted clients.

As far as I know DataSnap is based on Indy. And Indy's connection handling model is not very scaleable - one thread per connection, which is very resource consuming. Even using Indy's thread pools is not an option I think... Also in Windows (32 bit) for example there is a limit of the maximum threads you can create (2000 IIRC). Anyway - using many threads is not good and hits performance of the server (for reference - Windows Internals book, Windows Performance Team blog etc.)
A scalable, robust and professional application server would use IO Completion ports (IOCP) for data processing. But I don't know if DataSnap can take advantage of it.
UPDATE:
On the CodeRage7 I asked similar scaleability questions. Here are the answers:
Q: Recently there was a question on StackOverflow about DataSnap's scaleability/performance. So can DS handle, for example, 2000 or more concurent user request at the network and application level?
A: the scalability is based on scalability of TCP/HTTP/HTTPs and # of connections allowed in your server operating system. Also based on memory and hardware you employ. There is no specific limit in DataSnap.
My comment: While this is true, Indy's connection handling model, i.e. one thread per connection, introduces bottleneck especially in 32 bit Windows (2000 threads max). In Win64 it should not be so much problem, but again - this kind of handling data flow leads to performance degradation.
Q: Does DataSnap support some kind of load balancing?
A: Not directly. You can do this in code in your DataSnap server(s).
My comment: I've found very good paper on implementing Failover/Load Balancing in DataSnap in Andreano Lanusse's blog
Q: Does DataSnap support IO Completion ports for better scaleability?
This my question was left unanswered.
Hope this helps!
UPDATE2:
I found very interesting post on DS Performance: DataSnap analysis based on Speed & Stability tests
UPDATE3:
DataSnap, Deployment, Performance, and More (Marco Cantu)
Monitoring and control of connections in DataSnap XE2 - translated in English
Monitoring and control of connections in DataSnap XE2 - original

When the specifications for a system are made, you need to be very precise when it's about multiple users.
For example: you create a website, and the client expects 15.000 unique users.
Then the client usually comes up with a requirement that the system needs to support 15.000 simultaneous users, which is very naive.
You'll need a more detailed specification than that.
Usually it's more sensible to say something like: in 99% of the requests, 99% of the users can get a response to their request within 5 seconds average.
In normal usage, you'll never see all users send a request within the same second. If at some point they all arrive within the same minute (also very unlikely), you'll have a lot fewer concurrent users.
Even for websites with tens of thousands of users, where most of them connect on a daily basis, the webserver is idle most the time, and once and a while it jumps to 5% or in extreme cases to 20%. If we really have to serve all of these users at once we'd be screwed, but that never happens, and it's not realistic to prepare a server for such loads.

Scalable Delphi TCP server implementation

Any suggestions for components to use as a base for a scalable TCP server? I currently have an implementation that uses Indy which works well for say 100 relatively active connections or 1,000 relatively inactive connections, but the one thread per connection model limits the number of concurrent active connections that can be handled.
Let's say my goal might be 1,000 connections each processing 10 messages per second or 10,000 connections each processing 1 message per second on a good server (8-16 cores). Is this realistic? I'd really like to hear of any real-world implementations because I have found that what might work in theory does not necessarily work in practice and I do not want to be chasing a proposed solution that will not work.
Edit: IOCP would be good, but I only want to use commercial-grade classes/components, so they need to be as "professional" as Indy or IP*Works before I would think of using them. Furthermore, I have no intention of "rolling my own" solution - it would take too much time to make it commercial-grade. Lastly, I am looking for a significant improvement on what I already have. I am sure I can squeeze at least 20-50% more out of what I have (based on Indy), but I am never going to be able to handle 10,000 concurrent clients, or 10,000 messages per second, no matter how hard I try. Whether there is something out there that meets these conditions is another matter.
I have decided to accept the answer referring to the IOCP classes, even though I have not used them, because they look like the best path for investigation at this stage.

There is a project at http://voipobjects.com/ which is based on the former iopcclasses project.
It claims to handle thousands simultaneous connections:
IOCP engine is set of classes, components and routines for rapid
creation high scalable and performance TCP/UDP applications.
Application created using IOCP classes can handle thousands
simultaneous connections.
Library is written in Delphi - Delphi 7 - 2010 are supported.
Library uses IO completion ports technology. There is most powerful
technology in Win32 world for creation highly scalable and performance
TCP/UDP applications. This technology is supported in all desktop
Windows OSes except old Win9x/WinME versions.
This library is licensed under MPL1.1. Also It includes some files
from Jedi project (Winsock2 header translation).
https://bitbucket.org/voipobjects/iocpengine

My favorite Delphi network layer is ICS by Francois Piette. It's fantastically easy to understand, very scalable, and ultra-high performance. Free, and open source. Will probably scale to 1000 clients for most people, without significant effort, and without the complexity that gives me trouble when I use Indy.
I got about a 20% scalability/performance boost from switching all my stuff from Indy to ICS.

You should look at RealThinClient SDK http://www.realthinclient.com/about.htm
Well proven solution. Good support. Test results for different server solutions on the home page.

The real deciding fact is what you plan on doing on each of those transactions.
I use Indy with Network Load Balanced windows servers. One of these Delphi applications is serviced by 3 physical servers listening one public IP address where we have received millions of requests since yesterday with zero errors. Load overnight is pretty idle so the actual requests are around 350/second/server during the day and there's plenty of room for growth.
If there's not a lot of CPU/Memory needed per transaction you might get away with it on one box using Indy. It all depends on the load...as you likely can't write to 1000 different files every second.
There's other items to worry about too - like the OS supporting this amount of activity. You may need to tweak some registry settings. (see this stackoverflow question)
IOCP is the way to go for ultra-capacity servers. I have used Indy for ease of use in implementation/debugging for a very long time. I have my own IOCP implementation that I wrote years ago but never rolled it out on production as we simply haven't needed to.
My simple advice - I'd highly suggest rolling it out in Indy, using NLB as your crutch for load, and after that if you are still desiring the utmost speed, write your own IOCP implementation so you can craft it towards your specific requirements. Note that this is based on knowing nothing of the actual implementation requirements.

I've tried multiple Delphi solutions to do networking, and found that many if not all solutions add complexity and code which impacts on either performance or footprint or both. So I started searching for the lightest wrapper around the winsock API. The I (re)discovered Delphi's own TTcpClient and TTcpServer components. Used in blocking mode, and overriding TCustomTcpServer overriding the DoAccept method, I've had the best results till now.
If you expect to have a really high number of incoming connections and (small) responses to serve to (small) requests, it's highly advisable to implement I/O completion ports, as this handles incoming requests better.

I have been using ICS for the last 12 years. It is non-blocking. I support upto 2000 concurrent connections, each getting atleast 5000 bytes per second, sent as 1000 bytes every 200 msec. Never faced any problem and cpu usage of the app is very small.
Good support in forum, but not required at all.
Shekar

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart