Scalable Delphi TCP server implementation - delphi

Any suggestions for components to use as a base for a scalable TCP server? I currently have an implementation that uses Indy which works well for say 100 relatively active connections or 1,000 relatively inactive connections, but the one thread per connection model limits the number of concurrent active connections that can be handled.
Let's say my goal might be 1,000 connections each processing 10 messages per second or 10,000 connections each processing 1 message per second on a good server (8-16 cores). Is this realistic? I'd really like to hear of any real-world implementations because I have found that what might work in theory does not necessarily work in practice and I do not want to be chasing a proposed solution that will not work.
Edit: IOCP would be good, but I only want to use commercial-grade classes/components, so they need to be as "professional" as Indy or IP*Works before I would think of using them. Furthermore, I have no intention of "rolling my own" solution - it would take too much time to make it commercial-grade. Lastly, I am looking for a significant improvement on what I already have. I am sure I can squeeze at least 20-50% more out of what I have (based on Indy), but I am never going to be able to handle 10,000 concurrent clients, or 10,000 messages per second, no matter how hard I try. Whether there is something out there that meets these conditions is another matter.
I have decided to accept the answer referring to the IOCP classes, even though I have not used them, because they look like the best path for investigation at this stage.

There is a project at http://voipobjects.com/ which is based on the former iopcclasses project.
It claims to handle thousands simultaneous connections:
IOCP engine is set of classes, components and routines for rapid
creation high scalable and performance TCP/UDP applications.
Application created using IOCP classes can handle thousands
simultaneous connections.
Library is written in Delphi - Delphi 7 - 2010 are supported.
Library uses IO completion ports technology. There is most powerful
technology in Win32 world for creation highly scalable and performance
TCP/UDP applications. This technology is supported in all desktop
Windows OSes except old Win9x/WinME versions.
This library is licensed under MPL1.1. Also It includes some files
from Jedi project (Winsock2 header translation).
https://bitbucket.org/voipobjects/iocpengine

My favorite Delphi network layer is ICS by Francois Piette. It's fantastically easy to understand, very scalable, and ultra-high performance. Free, and open source. Will probably scale to 1000 clients for most people, without significant effort, and without the complexity that gives me trouble when I use Indy.
I got about a 20% scalability/performance boost from switching all my stuff from Indy to ICS.

You should look at RealThinClient SDK http://www.realthinclient.com/about.htm
Well proven solution. Good support. Test results for different server solutions on the home page.

The real deciding fact is what you plan on doing on each of those transactions.
I use Indy with Network Load Balanced windows servers. One of these Delphi applications is serviced by 3 physical servers listening one public IP address where we have received millions of requests since yesterday with zero errors. Load overnight is pretty idle so the actual requests are around 350/second/server during the day and there's plenty of room for growth.
If there's not a lot of CPU/Memory needed per transaction you might get away with it on one box using Indy. It all depends on the load...as you likely can't write to 1000 different files every second.
There's other items to worry about too - like the OS supporting this amount of activity. You may need to tweak some registry settings. (see this stackoverflow question)
IOCP is the way to go for ultra-capacity servers. I have used Indy for ease of use in implementation/debugging for a very long time. I have my own IOCP implementation that I wrote years ago but never rolled it out on production as we simply haven't needed to.
My simple advice - I'd highly suggest rolling it out in Indy, using NLB as your crutch for load, and after that if you are still desiring the utmost speed, write your own IOCP implementation so you can craft it towards your specific requirements. Note that this is based on knowing nothing of the actual implementation requirements.

I've tried multiple Delphi solutions to do networking, and found that many if not all solutions add complexity and code which impacts on either performance or footprint or both. So I started searching for the lightest wrapper around the winsock API. The I (re)discovered Delphi's own TTcpClient and TTcpServer components. Used in blocking mode, and overriding TCustomTcpServer overriding the DoAccept method, I've had the best results till now.
If you expect to have a really high number of incoming connections and (small) responses to serve to (small) requests, it's highly advisable to implement I/O completion ports, as this handles incoming requests better.

I have been using ICS for the last 12 years. It is non-blocking. I support upto 2000 concurrent connections, each getting atleast 5000 bytes per second, sent as 1000 bytes every 200 msec. Never faced any problem and cpu usage of the app is very small.
Good support in forum, but not required at all.
Shekar

Related

Distributing an Erlang Chat system

I just finished Erlang in Practice screencasts (code here), and have some questions about distribution.
Here's the is overall architecture:
Here is how to the supervision tree looks like:
Reading Distributed Applications leads me to believe that one of the primary motivations is for failover/takeover.
However, is it possible, for example, the Message Router supervisor and its workers to be on one node, and the rest of the system to be on another, without much changes to the code?
Or should there be 3 different OTP applications?
Also, how can this system be made to scale horizontally? For example if I realize now that my system can handle 100 users, and that I've identified the Message Router as the main bottleneck, how can I 'just add another node' where now it can handle 200 users?
I've developed Erlang apps only during my studies, but generally we had many small processes doing only one thing and sending messages to other processes. And the beauty of Erlang is that it doesn't matter if you send a message within the same Erlang VM or withing the same Computer, same LAN or over the Internet, the call and the pointer to the other process looks always the same for the developer.
So you really want to have one application for every small part of the system.
That being said, it doesn't make it any simpler to construct an application which can scale out. A rule of thumb says that if you want an application to work on a factor of 10-times more nodes, you need to rewrite, since otherwise the messaging overhead would be too large. And obviously when you start from 1 to 2 you also need to consider it.
So if you found a bottleneck, the application which is particularly slow when handling too many clients, you want to run it a second time and than you need to have some additional load-balancing implemented, already before you start the second application.
Let's assume the supervisor checks the message content for inappropriate content and therefore is slow. In this case the node, everyone is talking to would be simple router application which would forward the messages to different instances of the supervisor application, in a round robin manner. In case those 1 or 2 instances are not enough, you could have the router written in a way, that you can manipulate the number of instances by sending controlling messages.
However for this, to work automatically, you would need to have another process monitoring the servers and discovering that they are overloaded or under utilized.
I know that dynamically adding and removing resources always sounds great when you hear about it, but as you can see it is a lot of work and you need to have some messaging system built which allows it, as well as a monitoring system which can monitor the need.
Hope this gives you some idea of how it could be done, unfortunately it's been over a year since I wrote my last Erlang application, and I didn't want to provide code which would be possibly wrong.

Is DataSnap Optimized for responding to more than 1k users at the same time?

We want to start a big multi-tier application. The server side application must respond to more than 1000 users at the same time. We want to create server application by 64 bit compiler and client side with 32 bit. In this case we don't know DataSnap can respond to all client without any problem or not?
In this case The Server computer is very powerful (multi-processor and more than 16GB of RAM) and DataBase Management system is FireBird 2.5.
You need a way to perform realistic load tests.
For the Firebird database, you can simulate concurrent users with the free Apache JMeter tool. It can run SQL statements and record their execution time statistics (average, min/max etc.). So you could for example create a thread group with twenty different SQL queries, and then run twenty threads which each will perform these queries sequentially.
JMeter allows to define time limits on the SQL query, and JMeter treats it as an error if the query exceeds this limit. Then you can try to find the maximum client count where the overall error rate is still less than (for example) five percent.
But you also need to know how high the expected database load will be, and you will also need to have a test database with a realistic size, not only a couple of records. Also, some database queries like reports might cause higher load - these should be included in the simulation too, as they can affect overall performance. In JMeter, you can create a second thread group, running in parallel with the first one, for these long-running statements with different settings (less simulated clients).
Testing the database will show if there is a bottleneck already in this area. For example, the test result could be that the database can serve twenty clients with a total average transaction rate of 20 TPS (transactions per second), which means one client executes one transaction per second. But this TPS value will decrease with higher user count.
Related question: Firebird usage in big projects which also has a link to http://www.firebirdsql.org/en/case-studies-catalog/
Regarding DataSnap client load simulation: this can be done with a scripted client, which runs a predefined set of statements / commands over the connection.
To run a high number of load test clients simultaneaously you could use a service like Amazon Elastic Compute Cloud (EC2), to launch clones of your test machine image, saving you hardware costs. But of course I would start with a small client machine which simply runs ten or twenty scripted clients.
As far as I know DataSnap is based on Indy. And Indy's connection handling model is not very scaleable - one thread per connection, which is very resource consuming. Even using Indy's thread pools is not an option I think... Also in Windows (32 bit) for example there is a limit of the maximum threads you can create (2000 IIRC). Anyway - using many threads is not good and hits performance of the server (for reference - Windows Internals book, Windows Performance Team blog etc.)
A scalable, robust and professional application server would use IO Completion ports (IOCP) for data processing. But I don't know if DataSnap can take advantage of it.
UPDATE:
On the CodeRage7 I asked similar scaleability questions. Here are the answers:
Q: Recently there was a question on StackOverflow about DataSnap's scaleability/performance. So can DS handle, for example, 2000 or more concurent user request at the network and application level?
A: the scalability is based on scalability of TCP/HTTP/HTTPs and # of connections allowed in your server operating system. Also based on memory and hardware you employ. There is no specific limit in DataSnap.
My comment: While this is true, Indy's connection handling model, i.e. one thread per connection, introduces bottleneck especially in 32 bit Windows (2000 threads max). In Win64 it should not be so much problem, but again - this kind of handling data flow leads to performance degradation.
Q: Does DataSnap support some kind of load balancing?
A: Not directly. You can do this in code in your DataSnap server(s).
My comment: I've found very good paper on implementing Failover/Load Balancing in DataSnap in Andreano Lanusse's blog
Q: Does DataSnap support IO Completion ports for better scaleability?
This my question was left unanswered.
Hope this helps!
UPDATE2:
I found very interesting post on DS Performance: DataSnap analysis based on Speed & Stability tests
UPDATE3:
DataSnap, Deployment, Performance, and More (Marco Cantu)
Monitoring and control of connections in DataSnap XE2 - translated in English
Monitoring and control of connections in DataSnap XE2 - original
When the specifications for a system are made, you need to be very precise when it's about multiple users.
For example: you create a website, and the client expects 15.000 unique users.
Then the client usually comes up with a requirement that the system needs to support 15.000 simultaneous users, which is very naive.
You'll need a more detailed specification than that.
Usually it's more sensible to say something like: in 99% of the requests, 99% of the users can get a response to their request within 5 seconds average.
In normal usage, you'll never see all users send a request within the same second. If at some point they all arrive within the same minute (also very unlikely), you'll have a lot fewer concurrent users.
Even for websites with tens of thousands of users, where most of them connect on a daily basis, the webserver is idle most the time, and once and a while it jumps to 5% or in extreme cases to 20%. If we really have to serve all of these users at once we'd be screwed, but that never happens, and it's not realistic to prepare a server for such loads.

What is the performance overhead of Apache ActiveMQ vs. raw sockets?

We're looking to implement ActiveMQ to handle messaging between two of our servers, over a geographically diverse environment (Australia to the UK and back, via the internet).
I've been looking for some vague indicators of performance round the net but so far have had no luck.
My question: compared to a DIY TCP/SSL implementation of basic messaging, how would ActiveMQ perform? Similar systems of our own can send and receive messages across Australia in 100-150ms, over a SSL layer with an already established connection.
Also, does ActiveMQ persist its TLS/SSL connections, thus saving a substantial amount of time that would already be used in connection creation/teardown?
What I am hoping is that it will at least perform better than HTTPS, at a per-request level.
I am aware that performance can vary remarkably, depending on hardware, networks, code and so on. I'm just after something to start with.
I know the above is a little fuzzy - if you need any clarification please let me know and I will only be too happy to oblige.
Thank you.
What Tim means is that this is not an apples to apples comparison. If you are solely concerned with the performance of a single point to point connection to transfer data, a direct link will give you a good result (although DIY is still a dubious design decision). If you are building a system that requires the transfer of data and you have more complex functional requirements, then a broker-based messaging platform like ActiveMQ will come into play.
You should consider broker-based messaging if you want:
a post-office style system where a producer sends a message, and knows that it will be consumed at some point, even if there is no consumer there at that time
to not care where the consumer of a message is, or how many of them there are
a guarantee that a message will be consumed, even if the consumer that first handle it dies mid-way through the process (transactions, redelivery)
many consumers, with a guarantee that a message will only be consumed once - queues
many consumers that will each react to a single message - topics
These patterns are pretty standard, and apply to all off the shelf messaging products. As a general rule, DIY in this domain is a bad idea, as messaging is complex (see http://www.ohloh.net/p/activemq/estimated_cost for an estimate of how long it would take you do do same); and has many existing implementations of various flavours (some without a broker) that are all well used, commercially supported and don't require you to maintain them. I would think very hard before going down to the TCP level for any sort of data transfer as there is so much prior art.

Sharing data system wide

Good evening.
I'm looking for a method to share data from my application system-wide, so that other applications could read that data and then do whatever they want with it (e.g. format it for display, use it for logging, etc). The data needs to be updated dynamically in the method itself.
WMI came to mind first, but then you've got the issue of applications pausing while reading from WMI. Additionally, i've no real idea how to setup my own namespace or classes if that's even possible in Delphi.
Using files is another idea, but that could get disk heavy, and it's a real awful method to use for realtime data.
Using a driver would probably be the best option, but that's a little too intrusive on the users end for my liking, and i've no idea on where to even start with it.
WM_COPYDATA would be great, but i'm not sure if that's dynamic enough, and whether it'll be heavy on resources or not.
Using TCP/IP would be the best choice for over the network, but obviously is of little use when run on a single system with no networking requirement.
As you can see, i'm struggling to figure out where to go with this. I don't want to go into one method only to find that it's not gonna work out in the end. Essentially, something like a service, or background process, to record data and then allow other applications to read that data. I'm just unsure on methods. I'd prefer to NOT need elevation/UAC to do this, but if needs be, i'll settle for it.
I'm running in Delphi 2010 for this exercise.
Any ideas?
You want to create some Client-Server architecture, which is also called IPC.
Using WM_COPYDATA is a very good idea. I found out it is very fast, lightweight, and efficient on a local machine. And it can be broadcasted over the system, to all applications at once (to be used with care if some application does not handle it correctly).
You can also share some memory, using memory mapped files. This is may be the fastest IPC option around for huge amount of data, but synchronization is a bit complex (if you want to share more than one buffer at once).
Named pipes are a good candidates for local. They tend to be difficult to implement/configure over a network, due to security issues on modern Windows versions (and are using TCP/IP for network communication - so you should better use directly TCP/IP instead).
My personal advice is that you shall implement your data sharing with abstract classes, able to provide several implementations. You may use WM_COPYDATA first, then switch to named pipes, TCP/IP or HTTP in order to spread your application over a network.
For our Open Source Client-Server ORM, we implemented several protocols, including WM_COPY_DATA, named pipe, HTTP, or direct in-process access. You can take a look at the source code provided for implementation patterns. Here are some benchmarks, to give you data from real implementations:
Client server access:
- Http client keep alive: 3001 assertions passed
first in 7.87ms, done in 153.37ms i.e. 6520/s, average 153us
- Http client multi connect: 3001 assertions passed
first in 151us, done in 305.98ms i.e. 3268/s, average 305us
- Named pipe access: 3003 assertions passed
first in 78.67ms, done in 187.15ms i.e. 5343/s, average 187us
- Local window messages: 3002 assertions passed
first in 148us, done in 112.90ms i.e. 8857/s, average 112us
- Direct in process access: 3001 assertions passed
first in 44us, done in 41.69ms i.e. 23981/s, average 41us
Total failed: 0 / 15014 - Client server access PASSED
As you can see, fastest is direct access, then WM_COPY_DATA, then named pipes, then HTTP (i.e. TCP/IP). Message was around 5 KB of JSON data containing 113 rows, retrieved from server, then parsed on the client 100 times (yes, our framework is fast :) ). For huge blocks of data (like 4 MB), WM_COPY_DATA is slower than named pipes or HTTP-TCP/IP.
Where are several IPC (inter-process communication) methods in Windows. Your question is rather general, I can suggest memory-mapped files to store your shared data and message broadcasting via PostMessage to inform other application that the shared data changed.
If you don't mind running another process, you could use one of the NoSQL databases.
I'm pretty sure that a lot of them won't have Delphi drivers, but some of them have REST drivers and hence can be driven from pretty much anything.
Memcached is an easy way to share data between applications. Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects).
A Delphi 2010 client for Memcached can be found on google code:
http://code.google.com/p/delphimemcache/
related question:
Are there any Caching Frameworks for Delphi?
Googling for 'delphi interprocess communication' will give you lots of pointers.
I suggest you take a look at http://madshi.net/, especially MadCodeHook (http://help.madshi.net/madCodeHook.htm)
I have good experience with the product.

How to specifically identify server issues from a load test result (using LoadRunner)?

How do you isolate a performance issue to a specific component of the application infrastructure? Specifically, are there distinct markers in the result logs that distinguish between bottlenecks at web, application and/or database server levels?
I was asked this question in an interview and went blank on it. Seems this information is not available anywhere.
In addition to SiteScope and other agentless monitoring of system components, you need to make sure your scenario and scripts are working as expected. This includes proper error checking and use of transactions (and a host of other things). If the transactions are granular enough, this will give you insight into at least the requests that have performance issues. Once you have these indicators, work with the infrastructure team to review logs and other information. Being an iterative process, tests can be made to focus on a smaller and smaller section of the infrastructure.
In addition, loadrunner scripts don't have to be made strictly 'coming in through the frontdoor'. If you have a multi-tiered system, scripts can be made to hit the web/app/database servers directly.
For what to look for, focus on any measurements that have 'knees' or 'hockey stick' type of behaviour. You can hook into any of the server resource type measurements directly in the controller and integrate other team's stats in the analysis phase. Compare with benchmarks at lower virtual user levels to determine what is acceptable and unacceptable.
Good luck!
If the interview is focused on LoadRunner and SiteScope is considered - I'd come to conclusion that it's more focused on HP/Mercury solutions.. In that case I'd suggest you to look into HP Diagnostics and it's LoadRunner integration capabilities.
This type of information is usually not available by just looking at the standard results from a performance test.
Parts of the information you are looking for MAY be found by using SiteScope to monitor all the relevant servers in the test. SiteScope offers many counters to look at such as CPU, Memory, Disk I/O and Network I/O - as seen on each server.
This information perhaps gives clues as to where the bottleneck is, and the more counters you add to SiteScope, the bigger the change to pinpoint the bottleneck.
It is a very common misconception that AppServer and DBServer bottlenecks could be identified by just looking at the raw response times or hits, pages etc (web protocol), unless of course the URI accessed defines the exact component(s) in the system...

Resources