pthread_cleanup_pop_restore - what is it? - pthreads

pthread_cleanup_pop_restore - what is it?
It is from glibc. And it is called too often and eats a lot of cpu time.
The program uses a lot of getc() calls. I can't change the program (it is the benchmark with fixed source), but want to make it run faster.

It is a function!
which in turn, will deregister cleanup function from pthread_cancel's cleanup queue. When glibc io function (with file locking enabled in glibc) will be interrupted by pthread_cancel, it will have a only chance of unlocking file descriptor. This chance is the this function counterpart.

Related

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?
I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.
The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).
Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

How to cap memory usage of Haskell threads

In a Haskell program compiled with GHC, is it possible to programmatically guard against excessive memory usage? That is, have it notify the program when memory usage reaches a specified limit, preferably indicating the offending thread.
For example, suppose I want to write a server, hosting a scripting language interpreter, that users can connect to. It's Turing-complete, so programs could theoretically use unlimited memory or time. Suppose each client is handled with a separate thread. If a client writes an infinite loop that consumes memory very quickly, I want to ensure that the thread consumes no more than, say, 1 MB of memory, before being alerted with an exception. I do not want other users to be affected when that happens.
This is probably possible using separate processes and ulimit, but:
I would rather keep it in one program, to avoid the complexity of inter-process communication.
I need to support both Linux and Windows, so I would prefer to keep it platform-agnostic if possible.
Edward Z. Yang and David Mazières have developed an extension to GHC that supports dynamic resource limits, and discuss it at http://ezyang.com/rlimits.html They also provide a version of GHC 7.8 that supports this.
Unfortunately, their work was not included in GHC upstream.
May not be exactly what you want. But, as documented here you have a ghc compile option:
-Ksize, update: Oops, sorry, -K is for stack overflows. Still, you can check that link.
In your example, you may need to modify the source of the scripting language interpreter, make some twists to the memory mgmt. module(s), of course IF it has some managed memory allocation features, the interpreter can complain about an execessive use of memory quota by an API callback to your host application.

What's the idiomatic way to do async socket programming in Delphi?

What is the normal way people writing network code in Delphi use Windows-style overlapped asynchronous socket I/O?
Here's my prior research into this question:
The Indy components seem entirely synchronous. On the other hand, while ScktComp unit does use WSAAsyncSelect, it basically only asynchronizes a BSD-style multiplexed socket app. You get dumped in a single event callback, as if you had just returned from select() in a loop, and have to do all the state machine navigation yourself.
The .NET situation is considerably nicer, with Socket.BeginRead / Socket.EndRead, where the continuation is passed directly to Socket.BeginRead, and that's where you pick back up. A continuation coded as a closure obviously has all the context you need, and more.
I have found that Indy, while a simpler concept in the beginning, is awkward to manage due to the need to kill sockets to free threads at application termination. In addition, I had the Indy library stop working after an OS patch upgrade. ScktComp works well for my application.
#Roddy - Synchronous sockets are not what I'm after. Burning a whole thread for the sake of a possibly long-lived connection means you limit the amount of concurrent connections to the number of threads that your process can contain. Since threads use a lot of resources - reserved stack address space, committed stack memory, and kernel transitions for context switches - they do not scale when you need to support hundreds of connections, much less thousands or more.
What is the normal way people writing
network code in Delphi use
Windows-style overlapped asynchronous
socket I/O?
Well, Indy has been the 'standard' library for socket I/O for a long while now - and it's based on blocking sockets. This means if you want asynchronous behaviour, you use additional thread(s) to connect/read/write data. To my mind this is actually a major advantage, as there's no need to manage any kind of state machine navigation, or worry about callback procs or similar stuff. I find the logic of my 'reading' thread is less cluttered and much more portable than non-blocking sockets would allow.
Indy 9 has been mostly bombproof, fast and reliable for us. However the move to Indy 10 for Tiburon is causing me a little concern.
#Mike: "...the need to kill sockets to free threads...".
This made go "huh?" until I remembered our threading library uses an exception-based technique to kill 'waiting' threads safely. We call QueueUserAPC to queue a function which raises a C++ exception (NOT derived from class Exception) which should only be caught by our thread wrapper procedure. All destructors get called so the threads all terminate cleanly and tidy up on the way out.
"Synchronous sockets are not what I'm after."
Understood - but I think in that case the answer to your original question is that there just isn't a Delphi idiom for async socket IO because it's actually a highly specialized and uncommon requirement.
As a side issue, you might find these links interesting. They're both a little old, and more *nxy than Windows. The second one implies that - in the right environment - threads might not be as bad as you think.
The C10K problem
Why Events Are A Bad Idea (for High-concurrency Servers)
#Chris Miller - What you've stated in your answer is factually inaccurate.
Windows message-style async, as available through WSAAsyncSelect, is indeed largely a workaround for lack of a proper threading model in Win 3.x days.
.NET Begin/End, however, is not using extra threads. Instead, it is using overlapped I/O, using the extra argument on WSASend / WSARecv, specifically the overlapped completion routine, to specify the continuation.
This means that the .NET style harnesses the Windows OS's async I/O support to avoid burning a thread by blocking on a socket.
Since threads are generally speaking expensive (unless you specify a very small stack size to CreateThread), having threads blocking on sockets will stop you from scaling to 10,000s of concurrent connections.
This is why it's important that async I/O be used if you want to scale, and also why .NET is not, I repeat, is not, simply "using threads, [...] just managed by the Framework".
#Roddy - I've already read the links you point to, they are both referenced from Paul Tyma's presentation "Thousands of Threads and Blocking I/O - The old way to write Java Servers is New again".
Some of the things that don't necessarily jump out from Paul's presentation, however, are that he specified -Xss:48k to the JVM on startup, and that he's assuming that the JVM's NIO implementation is efficient in order for it to be a valid comparison.
Indy does not specify a similarly shrunken and tightly constrained stack size. There are no calls to BeginThread (the Delphi RTL thread creation routine, which you should use for such situations) or CreateThread (the raw WinAPI call) in the Indy codebase.
The default stack size is stored in the PE, and for the Delphi compiler it defaults to 1MB of reserved address space (space is committed page by page by the OS in 4K chunks; in fact, the compiler needs to generate code to touch pages if there are >4K of locals in a function, because the extension is controlled by page faults, but only for the lowest (guard) page in the stack). That means you're going to run out of address space after max 2,000 concurrent threads handling connections.
Now, you can change the default stack size in the PE using the {$M minStackSize [,maxStackSize]} directive, but that will affect all threads, including the main thread. I hope you don't do much recursion, because 48K or (similar) isn't a lot of space.
Now, whether Paul is right about non-performance of async I/O for Windows in particular, I'm not 100% sure - I'd have to measure it to be certain. What I do know, however, is that arguments about threaded programming being easier than async event-based programming, are presenting a false dichotomy.
Async code doesn't need to be event-based; it can be continuation-based, like it is in .NET, and if you specify a closure as your continuation, you get state maintained for you for free. Moreover, conversion from linear thread-style code to continuation-passing-style async code can be made mechanical by a compiler (CPS transform is mechanical), so there need be no cost in code clarity either.
There is a free IOCP (completion ports) socket components : http://www.torry.net/authorsmore.php?id=7131 (source code included)
"By Naberegnyh Sergey N.. High
performance socket server based on
Windows Completion Port and with using
Windows Socket Extensions. IPv6
supported. "
i've found it while looking better components/library to rearchitecture my little instant messaging server. I haven't tried it yet but it looks good coded as a first impression.
For async stuff try ICS
http://www.overbyte.be/frame_index.html?redirTo=/products/ics.html
Indy uses synchronous sockets because it's a simpler way of programming. The asynchronous socket blocking was something added to the winsock stack back in the Windows 3.x days. Windows 3.x did not support threads and there you couldn't do socket I/O without threads. For some additional information about why Indy uses the blocking model, please see this article.
The .NET Socket.BeginRead/EndRead calls are using threads, it's just managed by the Framework instead of by you.
#Roddy, Indy 10 has been bundled with Delphi since at Delphi 2006. I found that migrating from Indy 9 to Indy 10 to be a straight forward task.
With the ScktComp classes, you need to use a ThreadBlocking server rather than an a NonBlocking server type. Use the OnGetThread event to hand off the ClientSocket param to a new thread of your devising. Once you've instantiated an inherited instance of TServerClientThread you'll create a instance of TWinSocketStream (inside the thread) which you can use to read and write to the socket. This method gets you away from trying to process data in the event handler. These threads could exist for just the short period need to read or write, or hang on for the duration for the purpose of being reused.
The subject of writing a socket server is fairly vast. There are many techniques and practices you could choose to implement. The method of reading and writing to the same socket with in the TServerClientThread is straight forward and fine for simple applications. If you need a model for high availability and high concurrency then you need to look into patterns like the Proactor pattern.
Good luck!

How to log mallocs

This is a bit hypothetical and grossly simplified but...
Assume a program that will be calling functions written by third parties. These parties can be assumed to be non-hostile but can't be assumed to be "competent". Each function will take some arguments, have side effects and return a value. They have no state while they are not running.
The objective is to ensure they can't cause memory leaks by logging all mallocs (and the like) and then freeing everything after the function exits.
Is this possible? Is this practical?
p.s. The important part to me is ensuring that no allocations persist so ways to remove memory leaks without doing that are not useful to me.
You don't specify the operating system or environment, this answer assumes Linux, glibc, and C.
You can set __malloc_hook, __free_hook, and __realloc_hook to point to functions which will be called from malloc(), realloc(), and free() respectively. There is a __malloc_hook manpage showing the prototypes. You can add track allocations in these hooks, then return to let glibc handle the memory allocation/deallocation.
It sounds like you want to free any live allocations when the third-party function returns. There are ways to have gcc automatically insert calls at every function entrance and exit using -finstrument-functions, but I think that would be inelegant for what you are trying to do. Can you have your own code call a function in your memory-tracking library after calling one of these third-party functions? You could then check if there are any allocations which the third-party function did not already free.
First, you have to provide the entrypoints for malloc() and free() and friends. Because this code is compiled already (right?) you can't depend on #define to redirect.
Then you can implement these in the obvious way and log that they came from a certain module by linking those routines to those modules.
The fastest way involves no logging at all. If the amount of memory they use is bounded, why not pre-allocate all the "heap" they'll ever need and write an allocator out of that? Then when it's done, free the entire "heap" and you're done! You could extend this idea to multiple heaps if it's more complex that that.
If you really do need to "log" and not make your own allocator, here's some ideas. One, use a hash table with pointers and internal chaining. Another would be to allocate extra space in front of every block and put your own structure there containing, say, an index into your "log table," then keep a free-list of log table entries (as a stack so getting a free one or putting a free one back is O(1)). This takes more memory but should be fast.
Is it practical? I think it is, so long as the speed-hit is acceptable.
You could run the third party functions in a separate process and close the process when you are done using the library.
A better solution than attempting to log mallocs might be to sandbox the functions when you call them—give them access to a fixed segment of memory and then free that segment when the function is done running.
Unconfined, incompetent memory usage can be just as damaging as malicious code.
Can't you just force them to allocate all their memory on the stack? This way it would be garanteed to be freed after the function exits.
In the past I wrote a software library in C that had a memory management subsystem that contained the ability to log allocations and frees, and to manually match each allocation and free. This was of some use when attempting to find memory leaks, but it was difficult and time consuming to use. The number of logs was overwhelming, and it took an extensive amount of time to understand the logs.
That being said, if your third party library has extensive allocations, its more then likely impractical to track this via logging. If you're running in a Windows environment, I would suggest using a tool such as Purify[1] or BoundsChecker[2] that should be able to detect leaks in your third party libraries. The investment in the tool should pay for itself in time saved.
[1]: http://www-01.ibm.com/software/awdtools/purify/ Purify
[2]: http://www.compuware.com/products/devpartner/visualc.htm BoundsChecker
Since you're worried about memory leaks and talking about malloc/free, I assume you're in C. I'm also assuming based on your question that you do not have access to the source code of the third party library.
The only thing I can think of is to examine memory consumption of your app before & after the call, log error messages if they're different and convince the third party vendor to fix any leaks you find.
If you have money to spare, then consider using Purify to track issues. It works wonders, and does not require source code or recompilation. There are also other debugging malloc libraries available that are cheaper. Electric Fence is one name I recall. That said, the debugging hooks mentioned by Denton Gentry seem interesting too.
If you're too poor for Purify, try Valgrind. It it a lot better than it was 6 years ago and a lot easier to dive into than Purify.
Microsoft Windows provides (use SUA if you need a POSIX), quite possibly, the most advanced heap+(other api known to use the heap) infrastructure of any shipping OS today.
the __malloc() debug hooks and the associated CRT debug interfaces are nice for cases where you have the source code to the tests, however they can often miss allocations by standard libraries or other code which is linked. This is expected as they are the Visual Studio heap debugging infrastructure.
gflags is a very comprehensive and detailed set of debuging capabilities which has been included with Windows for many years. Having advanced functionality for source and binary only use cases (as it is the OS heap debugging infrastructure).
It can log full stack traces (repaginating symbolic information in a post-process operation), of all heap users, for all heap modifying entrypoint's, serially if needed. Also, it may modify the heap with pathalogical cases which may align the allocation of data such that the page protection offered by the VM system is optimally assigned (i.e. allocate your requested heap block at the end of a page, so even a singele byte overflow is detected at the time of the overflow.
umdh is a tool which can help assess the status at various checkpoints, however the data is continually accumulated during the execution of the target o it is not a simple checkpointing debug stop in the traditional context. Also, WARNING, Last I checked at least, the total size of the circular buffer which store's the stack information, for each request is somewhat small (64k entries (entries+stack)), so you may need to dump rapidly for heavy heap users. There are other ways to access this data but umdh is fairly simple.
NOTE there are 2 modes;
MODE 1, umdh {-p:Process-id|-pn:ProcessName} [-f:Filename] [-g]
MODE 2, umdh [-d] {File1} [File2] [-f:Filename]
I do not know what insanity gripped the developer who chose to alternate between -p:foo argument specifier's and naked ordering of argument's but it can get a little confusing.
The debugging sdk works with a number of other tools, memsnap is a tool which apparently focuses on memory leask and such, but I have not used it, your milage may vary.
Execute gflags with no arguments for the UI mode, +arg's and /args are different "modes" of use also.
On Linux I've successfully used mtrace(3) to log allocations and freeings. Its usage is as simple as
Modify your program to call mtrace() when you need to begin tracing (e.g. at the top of main()),
Set environment variable MALLOC_TRACE to the file path where the trace should be saved and run the program.
After that the output file will contain something like this (excerpt from the middle to show a failed allocation):
# /usr/lib/tls/libnvidia-tls.so.390.116:[0xf44b795c] + 0x99e5e20 0x49
# /opt/gcc-7/lib/libstdc++.so.6:(_ZdlPv+0x18)[0xf6a80f78] - 0x99beba0
# /usr/lib/tls/libnvidia-tls.so.390.116:[0xf44b795c] + 0x9a23ec0 0x10
# /opt/gcc-7/lib/libstdc++.so.6:(_ZdlPv+0x18)[0xf6a80f78] - 0x9a23ec0
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668ee49] + 0x99c67c0 0x8
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668f14f] - 0x99c67c0
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668ee49] + (nil) 0x30000000
# /lib/libc.so.6:[0xf677f8eb] + 0x99c21f0 0x158
# /lib/libc.so.6:(_IO_file_doallocate+0x91)[0xf677ee61] + 0xbfb00480 0x400
# /lib/libc.so.6:(_IO_setb+0x59)[0xf678d7f9] - 0xbfb00480

Resources