What's the idiomatic way to do async socket programming in Delphi? - delphi

What is the normal way people writing network code in Delphi use Windows-style overlapped asynchronous socket I/O?
Here's my prior research into this question:
The Indy components seem entirely synchronous. On the other hand, while ScktComp unit does use WSAAsyncSelect, it basically only asynchronizes a BSD-style multiplexed socket app. You get dumped in a single event callback, as if you had just returned from select() in a loop, and have to do all the state machine navigation yourself.
The .NET situation is considerably nicer, with Socket.BeginRead / Socket.EndRead, where the continuation is passed directly to Socket.BeginRead, and that's where you pick back up. A continuation coded as a closure obviously has all the context you need, and more.

I have found that Indy, while a simpler concept in the beginning, is awkward to manage due to the need to kill sockets to free threads at application termination. In addition, I had the Indy library stop working after an OS patch upgrade. ScktComp works well for my application.

#Roddy - Synchronous sockets are not what I'm after. Burning a whole thread for the sake of a possibly long-lived connection means you limit the amount of concurrent connections to the number of threads that your process can contain. Since threads use a lot of resources - reserved stack address space, committed stack memory, and kernel transitions for context switches - they do not scale when you need to support hundreds of connections, much less thousands or more.

What is the normal way people writing
network code in Delphi use
Windows-style overlapped asynchronous
socket I/O?
Well, Indy has been the 'standard' library for socket I/O for a long while now - and it's based on blocking sockets. This means if you want asynchronous behaviour, you use additional thread(s) to connect/read/write data. To my mind this is actually a major advantage, as there's no need to manage any kind of state machine navigation, or worry about callback procs or similar stuff. I find the logic of my 'reading' thread is less cluttered and much more portable than non-blocking sockets would allow.
Indy 9 has been mostly bombproof, fast and reliable for us. However the move to Indy 10 for Tiburon is causing me a little concern.
#Mike: "...the need to kill sockets to free threads...".
This made go "huh?" until I remembered our threading library uses an exception-based technique to kill 'waiting' threads safely. We call QueueUserAPC to queue a function which raises a C++ exception (NOT derived from class Exception) which should only be caught by our thread wrapper procedure. All destructors get called so the threads all terminate cleanly and tidy up on the way out.

"Synchronous sockets are not what I'm after."
Understood - but I think in that case the answer to your original question is that there just isn't a Delphi idiom for async socket IO because it's actually a highly specialized and uncommon requirement.
As a side issue, you might find these links interesting. They're both a little old, and more *nxy than Windows. The second one implies that - in the right environment - threads might not be as bad as you think.
The C10K problem
Why Events Are A Bad Idea (for High-concurrency Servers)

#Chris Miller - What you've stated in your answer is factually inaccurate.
Windows message-style async, as available through WSAAsyncSelect, is indeed largely a workaround for lack of a proper threading model in Win 3.x days.
.NET Begin/End, however, is not using extra threads. Instead, it is using overlapped I/O, using the extra argument on WSASend / WSARecv, specifically the overlapped completion routine, to specify the continuation.
This means that the .NET style harnesses the Windows OS's async I/O support to avoid burning a thread by blocking on a socket.
Since threads are generally speaking expensive (unless you specify a very small stack size to CreateThread), having threads blocking on sockets will stop you from scaling to 10,000s of concurrent connections.
This is why it's important that async I/O be used if you want to scale, and also why .NET is not, I repeat, is not, simply "using threads, [...] just managed by the Framework".

#Roddy - I've already read the links you point to, they are both referenced from Paul Tyma's presentation "Thousands of Threads and Blocking I/O - The old way to write Java Servers is New again".
Some of the things that don't necessarily jump out from Paul's presentation, however, are that he specified -Xss:48k to the JVM on startup, and that he's assuming that the JVM's NIO implementation is efficient in order for it to be a valid comparison.
Indy does not specify a similarly shrunken and tightly constrained stack size. There are no calls to BeginThread (the Delphi RTL thread creation routine, which you should use for such situations) or CreateThread (the raw WinAPI call) in the Indy codebase.
The default stack size is stored in the PE, and for the Delphi compiler it defaults to 1MB of reserved address space (space is committed page by page by the OS in 4K chunks; in fact, the compiler needs to generate code to touch pages if there are >4K of locals in a function, because the extension is controlled by page faults, but only for the lowest (guard) page in the stack). That means you're going to run out of address space after max 2,000 concurrent threads handling connections.
Now, you can change the default stack size in the PE using the {$M minStackSize [,maxStackSize]} directive, but that will affect all threads, including the main thread. I hope you don't do much recursion, because 48K or (similar) isn't a lot of space.
Now, whether Paul is right about non-performance of async I/O for Windows in particular, I'm not 100% sure - I'd have to measure it to be certain. What I do know, however, is that arguments about threaded programming being easier than async event-based programming, are presenting a false dichotomy.
Async code doesn't need to be event-based; it can be continuation-based, like it is in .NET, and if you specify a closure as your continuation, you get state maintained for you for free. Moreover, conversion from linear thread-style code to continuation-passing-style async code can be made mechanical by a compiler (CPS transform is mechanical), so there need be no cost in code clarity either.

There is a free IOCP (completion ports) socket components : http://www.torry.net/authorsmore.php?id=7131 (source code included)
"By Naberegnyh Sergey N.. High
performance socket server based on
Windows Completion Port and with using
Windows Socket Extensions. IPv6
supported. "
i've found it while looking better components/library to rearchitecture my little instant messaging server. I haven't tried it yet but it looks good coded as a first impression.

For async stuff try ICS
http://www.overbyte.be/frame_index.html?redirTo=/products/ics.html

Indy uses synchronous sockets because it's a simpler way of programming. The asynchronous socket blocking was something added to the winsock stack back in the Windows 3.x days. Windows 3.x did not support threads and there you couldn't do socket I/O without threads. For some additional information about why Indy uses the blocking model, please see this article.
The .NET Socket.BeginRead/EndRead calls are using threads, it's just managed by the Framework instead of by you.
#Roddy, Indy 10 has been bundled with Delphi since at Delphi 2006. I found that migrating from Indy 9 to Indy 10 to be a straight forward task.

With the ScktComp classes, you need to use a ThreadBlocking server rather than an a NonBlocking server type. Use the OnGetThread event to hand off the ClientSocket param to a new thread of your devising. Once you've instantiated an inherited instance of TServerClientThread you'll create a instance of TWinSocketStream (inside the thread) which you can use to read and write to the socket. This method gets you away from trying to process data in the event handler. These threads could exist for just the short period need to read or write, or hang on for the duration for the purpose of being reused.
The subject of writing a socket server is fairly vast. There are many techniques and practices you could choose to implement. The method of reading and writing to the same socket with in the TServerClientThread is straight forward and fine for simple applications. If you need a model for high availability and high concurrency then you need to look into patterns like the Proactor pattern.
Good luck!

Related

How To Detect If Another Process Reading My Application's Memory

Is it possible to detect if another process reading the memory of my application? If so, can you give me any examples on how to accomplish this? (examples in C++ would be great)
Thank You
To detect a process that opens a handle to your process and calls ReadProcessMemory you must either, hook OpenProcess & ReadProcessMemory in every process or have a kernel mode driver that intercepts these calls.
Detecting ReadProcessMemory solely from inside the target process is usermode is not possible to my knowledge. I have seen this question come up many times and have never seen an acceptable answer.

How to cap memory usage of Haskell threads

In a Haskell program compiled with GHC, is it possible to programmatically guard against excessive memory usage? That is, have it notify the program when memory usage reaches a specified limit, preferably indicating the offending thread.
For example, suppose I want to write a server, hosting a scripting language interpreter, that users can connect to. It's Turing-complete, so programs could theoretically use unlimited memory or time. Suppose each client is handled with a separate thread. If a client writes an infinite loop that consumes memory very quickly, I want to ensure that the thread consumes no more than, say, 1 MB of memory, before being alerted with an exception. I do not want other users to be affected when that happens.
This is probably possible using separate processes and ulimit, but:
I would rather keep it in one program, to avoid the complexity of inter-process communication.
I need to support both Linux and Windows, so I would prefer to keep it platform-agnostic if possible.
Edward Z. Yang and David Mazières have developed an extension to GHC that supports dynamic resource limits, and discuss it at http://ezyang.com/rlimits.html They also provide a version of GHC 7.8 that supports this.
Unfortunately, their work was not included in GHC upstream.
May not be exactly what you want. But, as documented here you have a ghc compile option:
-Ksize, update: Oops, sorry, -K is for stack overflows. Still, you can check that link.
In your example, you may need to modify the source of the scripting language interpreter, make some twists to the memory mgmt. module(s), of course IF it has some managed memory allocation features, the interpreter can complain about an execessive use of memory quota by an API callback to your host application.

Does Erlang always copy messages between processes on the same node?

A faithful implementation of the actor message-passing semantics means that message contents are deep-copied from a logical point-of-view, even for immutable types. Deep-copying of message contents remains a bottleneck for implementations the actor model, so for performance some implementations support zero-copy message passing (although it's still deep-copy from the programmer's point-of-view).
Is zero-copy message-passing implemented at all in Erlang? Between nodes it obviously can't be implemented as such, but what about between processes on the same node? This question is related.
I don't think your assertion is correct at all - deep copying of inter-process messages isn't a bottleneck in Erlang, and with the default VM build/settings, this is exactly what all Erlang systems are doing.
Erlang process heaps are completely separate from each other, and the message queue is located in the process heap, so messages must be copied. This is also true for transferring data into and out of ETS tables as their data is stored in a separate allocation area from process heaps.
There are a number of shared datastructures however. Large binaries (>64 bytes long) are generally allocated in a node-wide area and are reference counted. Erlang processes just store references to these binaries. This means that if you create a large binary and send it to another process, you're only sending the reference.
Sending data between processes is actually worse in terms of allocation size than you might imagine - sharing inside a term isn't preserved during the copy. This means that if you carefully construct a term with sharing to reduce memory consumption, it will expand to its unshared size in the other process. You can see a practical example in the OTP Efficiency Guide.
As Nikolaus Gradwohl pointed out, there was an experimental hybrid heap mode for the VM which did allow term sharing between processes and enabled zero-copy message passing. It hasn't been a particularly promising experiment as I understand it - it requires extra locking and complicates the existing ability of processes to independently garbage collect. So not only is copying inter-process messages not the usual bottleneck in Erlang systems, allowing it actually reduced performance.
AFAIK there was/is experimental support for zero-copy message-passing in erlang using the -shared or -hybrid modell. I read a blog post in 2009 claiming that it's broken on smp machines, but I have no idea about the current status
As has been mentioned here and in other questions current versions of Erlang basically copy everything except for larger binaries. In older pre-SMP times it was feasible to not copy but pass references. While this resulted in very fast message passing it created other problems in the implementation, primarily it made garbage collection more difficult and complicated implementation. I think that today passing references and having shared data could result in excessive locking and synchronisation which is, of course, not a Good Thing.
I wrote the accepted answer to that other question you're referencing, and in it I give you a direct pointer to this line of code:
message = copy_struct(message, msize, &hp, &bp->off_heap);
This is in a function called when the Erlang run-time system needs to send a message, and it's not inside any kind of "if" that could cause it to be skipped. So, as far as I can tell, the answer is "yes, it's always copied." (That's not strictly true -- there is an "if", but it seems to be dealing with exceptional cases, not the normal code-flow path.)
(I'm ignoring the hybrid heap option brought up by Nikolaus. It looks like he's right, but since this isn't the way Erlang is normally built and it has its own penalties, I don't see that it's worth considering as a way to answer your concern.)
I don't know why you're considering 10 GByte/sec a bottleneck, though. Nothing short of registers or CPU cache goes faster in the computer, and such memories are small, thus constituting a kind of bottleneck themselves. Besides which, the zero-copy idea you're proposing would require locking in the case of cross-CPU message passing in a multi-core system, which is also a bottleneck. We're already paying the locking penalty once in this function to copy the message into the other process's message queue; why pay it again later when that process gets around to reading the message?
Bottom line, I don't think your ideas of ways to make it go faster would actually help much.

Handling segfault signal SIGSEGV need to determine the cause of segfault using siginfo_t

I'm making a wrapper for the pthread library that allows each thread to have its own set of non-shared memory. Right now the way c is set up if any thread tries to rwe another threads data, the program segfaults. This is fine, I can catch it with a sighandler and call pthread_exit() and continue on with the program.
But not every segfault is going to be the result of a bad rwe. I need to find a way to use the siginfo type to determine if the segfault was bad programming or this error. Any ideas?
Since I am using mmap to manage the memory pages I think using si_addr in siginfo will help me out.
It sounds like what you're really after is thread local storage which is already solved much more portably than this. GCC provides __thread, MSVC provides __declspec(thread). boost::thread provides portable thread local storage using a variety of mechanisms depending on platform/toolchain etc.
If you really do want to go down this road it can be made to work however the path is fraught with dangers. Recovering from SIGSEGV is undefined behaviour technically, although it can be made to work on quite a few platforms it is neither robust nor portable. You need to be very careful what you do in the signal handler though too -- the list of async-safe functions, i.e. those which may legally be safely called from a signal handler is very small.
I've used this trick successfully a few times in the past, normally for marking "pages" as "dirty" in userspace. The way I did this was by setting up a hashtable which contained the base address of all the "pages" of memory that I was interested in. When you catch a SIGSEGV in a handler you can then map an address back to a page with simple arithmetic operations. Provided the hashtable can be read without locks you can then lookup if this is a page that you care about or a segfault from somewhere else and decide how to act.

List of Delphi data types with 'thread-safe' read/write operations?

Are 'boolean' variables thread-safe for reading and writing from any thread? I've seen some newsgroup references to say that they are. Are any other data types available? (Enumerated types, short ints perhaps?)
It would be nice to have a list of all data types that can be safely read from any thread and another list that can also be safely written to in any thread without having to resort to various synchronization methods.
Please note that you can make essentially everything in delphi unthreadsafe. While others mention alignment problems on boolean this in a way hides the real problem.
Yes, you can read a boolean in any thread and write to a boolean in any thread if it's correctly aligned. But reading from a boolean you change is not necessarily "thread safe" anyway. Say you have a boolean you set to true when you've updated a number so that another thread reads the number.
if NumberUpdated then
begin
LocalNumber = TheNumber;
end;
Due to optimizations the processor makes TheNumber may be read before NumberUpdated is read, thus you may get the old value of TheNumber eventhough you updated NumberUpdated last.
Aka, your code may become:
temp = TheNumber;
if NumberUpdated the
begin
LocalNumber = temp;
end;
Imho, a basic rule of thumb:
"Reads are thread safe. Writes are not thread safe."
So if you're going to do a write protect the data with synchronization everywhere you read the value while a write could potentially occur.
On the other hand, if you only read and write a value in one thread, then it's thread safe. So you can do a large chunk of writing in a temporary location, then synchronize an update of applicationwide data.
Bonus blurb:
The VCL is not thread safe. Keep all modification of ui stuff in the main thread. Keep the creation of all ui stuff in the main thread too.
Many functions are not thread safe either, while others are, it often depends on the underlying winapi calls.
I don't think a "list" would be helpful as "thread safe" can mean a lot of stuff.
This is not a question of data types being thread-safe, but it is a question of what you do with them. Without locking no operation is thread-safe that involves loading a value, then changing it, then writing it back: incrementing or decrementing a number, clearing or setting an element in a set - they are all not thread-safe.
There is a number of functions that allow for atomic operations: interlocked increment, interlocked decrement, and interlocked exchange. This is a common concept, nothing specific to Windows, x86 or Delphi. For Delphi you can use the InterlockedFoo() functions of the Windows API, there are several wrappers around those too. Or write your own. The functions operate on integers, so you can have atomic increment, decrement and exchange of integers (32 bit) with them.
You can also use assembler and prefix ops with the lock prefix.
For more information see also this StackOverflow question.
On 32-bit architecture, only properly aligned 32-bit or less data types should be considered atomic. 32-bit values must be 4-aligned (address of the data must be evenly divisible by four). You probably wouldn't run into interleaving at such tight level, but theoretically you could have double, Int64 or Extended non-atomic write.
With multi-core RISC processing and separate core cache memory being in the mix of a modern processor, it is no-longer the case that any 'trivial' high-level language read or write construct (or for that matter many once-upon-a-time 8086 'atomic' assembly instructions) can be considered atomic. Indeed, unless an assembler instruction is specifically designed to be atomic, it probably is not atomic - and that includes most mechanisms for memory reads. Even a long integer read at assembler level can be corrupted by a simultaneous write from another processor core that is sharing the same memory and using asynchronous cache update actions at the RISC-processor level. Remember that on a processor comprising multiple RISC cores, even assembly language instructions are effectively just "higher-level" code instructions! You never really know just how they are being implemented at the bit level, and it might not be quite what you expected if you were reading an old 8086 (single-core) assembler manual. Windows does provide native-system compatible atomic operators, and you would be well advised to use these rather than make any base assumptions about atomic operations.
Why use the Windows operators? Because one of the first things that Windows does is establish what the machine is that it is running upon. One of the key aspects it ensures it gets right is what atomic operations there are and how they will work. If you want your code to work well into the future upon any future processor, you can either duplicate (and constantly update) all this effort in your own code, or you can make use of the fact that Windows did it all already at startup. It then incorporated the necessary code into its API at runtime.
Read the MSDN pages on atomic operations. The Windows API surfaces these for you. They may sometimes seem clunky or clumsy - but they are future proof and they will always work exactly as it says on the tin.
How do I know this? Well, because if they didn't - then you wouldn't be able to run Windows. Full-stop. Never mind running your own code.
Whenever you write code, it is always a good idea to understand Parsimony and consider Occam's razor. In other words, if Windows already does it, and your code needs Windows to run, then use what Windows already does, rather than trying out many alternative and increasingly complex hypothetical solutions that may or may not work. Doing anything else is just a waste of your time (unless of course that is what you are in to).
The Indy code contains some atomic / thread safe data types in IdThreadSafe.pas:
TIdThreadSafeInteger
TIdThreadSafeBoolean
TIdThreadSafeString
TIdThreadSafeStringList
and some more ...

Resources