Dask Vs Multiprocessing when using C pointers - dask

When I use C pointers in python and try to process it using dask, it working like a pro. But when I try to use python's multiprocessing module, it splits the pointer reference error.
How is dask able to overcome the multiprocessing module when using C pointers

The default scheduler for dask ("threaded") works with threads in the same process where you have defined your graph, so each worker has access to the original memory space - the C pointers are therefore valid. In a new process, whether using the builtin multiprocessing module or otherwise, you would need to recreate whatever C structures you need, and make new pointers to them; this can be done either in de/serialisation (dask-distributed has a lot of logic dedicated to this), or by reloading modules/data in the worker.

Related

Wasm Hot Reloading Experiment: Debunking Assumptions, and How to specify where the data section is?

First, to avoid making this seem like an XYZ problem, I'd like to give some context (Note I am not using Emscripten):
I am trying to see if I can implement a form of hot reloading for Wasm programs written in C++, hosted on the web. To do this, I want to have a section of memory that I call my "world state" (to anyone who has watched Handmade Hero ( https://handmadehero.org/ ), this will be familiar):
struct State {
// put everything here
} state;
Typically for a full C++ program with a platform layer, you'd allocate this struct on the platform side and feed a pointer to that memory through a function pointer in the reloadable/dll/dylib part of the code. The reloadable code puts EVERYTHING into this persistent memory so if the code needs to be recompiled and reloaded, all the state will continue to exist since the memory was allocated in the part of the program that wasn't reloaded. As far as I can tell, this is impossible in Wasm though.
Firstly, is my assumption correct that I have to use WebAssembly.Memory? --or can I allocate a uint8array in js and use that for my persistent state, separate from the program memory? If so, is that slower?
So this will work as long as I don't use a dynamic allocator like WASI, and instead use a push allocator I can control. (I think this because, suppose I use malloc to get memory addresses and reload--malloc's internal state will reload and think all the heap memory is available when it's not, so future allocations might clobber previous ones.)
Upon reload, I can first copy the struct into a temporary buffer on the js side, reload, get the memory location of the struct from Wasm (I will require that it exists), and copy the saved memory from js back into position.
However this falls apart if I use pointers because if I change the program (which is the point) __data_end might change, which would offset all of the addresses! I checked the linker flags here https://lld.llvm.org/WebAssembly.html to see what I could control. I can specify that the stack comes before the data segment, but the heap would still come after that, which results in the same problem. I can also specify where the global data are located, but that's not the data segment I believe, so the variable-size data segment could still offset all of my addresses.
Here's a nice page that can help us visualize the Wasm memory: https://dassur.ma/things/c-to-webassembly/
Would anyone have any thoughts on how to achieve what I'd like? The only options I can think of involve somehow using memory outside the Wasm memory (possibly slower or impossible), using only stack memory and no pointers (unrealistic unless I can auto-recalculate all pointer offsets after a recompile, which would be painful and bug-prone), or finding a way to make the data segment come after the stack and heap at a fixed address, which would then guarantee that the stack and heap segments wouldn't get offset if the data segment needs to grow. Another option, if possible, would be to fix the max size of the data segment. The Wasm spec/documentation aren't really great when it comes to memory manipulation like this, so I'd appreciate some clarification about what's possible too. Lastly, maybe I could use two Wasm modules (but wouldn't that sort of indirection be slow)? I might be missing something crucial related to the memory layout.
Please let me know if you need more details. I've done something like this before in C, as I mentioned, and it's a common rapid iteration game-dev technique. Basically I'm trying to recreate it in Wasm.
EDIT: Apparently you can call Wasm functions from another module directly. Firstly, how do you do it, and secondly, what would be performance characteristics be for accessing the memory of another module?
EDIT2: Maybe some form of dynamic linking if that's supported? https://webassembly.org/docs/dynamic-linking/
WebAssembly modules hold variable state in three distinct places:
Linear memory
Local variables associated with the execution stack
Global variables
Of these, only global variables and linear memory are accessible to the host environment, and potentially serialisable in order to cache them as you hot-reload your module. There is of course no way to directly access and store the current call-stack.
If I were looking to achieve this, I'd create my own state machine within WebAssembly, storing this within a known location within linear memory.
Wasm is organised into modules, and modules define four relevant kinds of entities: functions, memories, tables, globals. The code is in the functions, while the other three represent a module's state.
Now, the interesting thing is that all four of these entity kinds can be imported and exported. Moreover, all of them can be created externally to the module, e.g., by the JS API.
Consequently, a way to emulate code swapping is to set up your module such that all three pieces of state are created externally and imported into the module. That way, you can keep them alive externally and pass them to the upgraded module once available. (You also need to make sure that the upgraded module doesn't use data/element segments or start functions in a way that paves over preexisting state.)
Of course, this only works if the shape of the module's state does not change between upgrades. E.g., no new globals, no new data layout in memory, otherwise the new code won't understand the old state. That is actually the hard part of the problem, but it's independent from Wasm specifics.

Constant memory dumps

Is it possible to constantly dump the memory of a process to record every change that is happening? For example if I have a program that modifies the contents of an array I'd like to know the contents of that array before some modification. I imagine a program could save the initial memory and then all changes in a file and I'd just search the file by the modified contents of the array which I know. Then I'd look for changes in that specific memory location before that moment and find the initial contents.
Does a program like that exist? If so, what program would you recommend?
EDIT: I wrote a program in C++ that captures packets of another process using pcap and I would like to know how these packets are constructed inside that program. I'm using Windows.
Notice that memory content is (or may be) changing a lot faster than what a disk is capable of writing.
Also, your question is OS specific. I guess that you are using Linux.
In all cases, design your application very early with your goals.
Perhaps you are looking for application checkpointing. If on Linux, consider BLCR.
Perhaps you are looking for some persistence mechanism. A possible way might be to explicitly persist the state of your application at some points in your program, which are executed frequently. Persistence of the call stack or of continuations is a difficult issue
You may want to use textual formats (like JSON) for serialization. You could be interested in database technology, either relational-SQL (e.g. Sqlite or PostGreSQL) or noSQL mongodb
Persistence and checkpointing may be related to garbage collection algorithms (notably copying GC).
Some language implementations are able to persist their entire heap. For example, in Common Lisp, the SBCL implementation offers save-lisp-and-die
For debugging, you might want watchpoints, or the gcore(1) command.
Notice that if you fork(2) your process and sleep or idle immediately the child process you are keeping in that child process a snapshot of your address space.
Read also about transactional memory & ACID properties

How to convert synchronous blocking shared memory model code to asynchronous coroutines running on thread pool?

While there are lots of solutions matching my question partially, I'd like to know if a complete match exists. It's hard to find a complete solution because of these partial ones occupying search results. This should be a runtime framework and (optionally) a transformation required to source language code when the language doesn't support coroutines.
There are libraries like lthread having lthread_cond_wait() API, but every lthread is bounded by a single pthread. I'd like lightweight threads to be able to run in several pthreads. They should be arbitrary picked by thread pool. Either single-threaded schedulers or global lock schedulers don't match. I think we can do better.
lthreads is also not an option because it neither involves source code transformation nor avoids it like protothreads.
Several green-threading runtimes (Erlang, Limbo) don't match because they are limited to CSP (communicating sequential processes) model only, but I'd like to have shared memory model synchronization primitives as well: mutexes, condition variables, rwlocks.
Transformation involves:
Transforming stack contexts into objects in heap
Transforming mutex calls into manipulating disabling and activating jobs on thread pool and publish-subscribe
Condition variables should also be transformed into publish-subscribe realtionships
It would be nice to have Ada-style rendezvous
I failed to do straightforward runtime implementation due to potential deadlocks in publish-subscribe mechanism without using global lock or single scheduler thread, but I still think this is possible.
Disclaimer: lthread author.
You can launch several pthreads and run an lthread scheduler in each one (this is done automagically by calling lthread_run() in the pthread function). This way each pthread will run a bunch of lthreads.

How can I get beam size for Erlang?

I have a legacy Erlang program that needs optimizations. This piece of code uses up to 20G memory in run time. I'm wondering if there is a way to get the Erlang Beam size of the process itself in run time? If that is possible then I can do something like if beam size>10GB then reject all calls to gen_server process. Thanks for the help!
Perhaps you could use some proces_info data:
{memory, Size}:
Size is the size in bytes of the process. This includes call
stack, heap and internal structures.
process_info(self(), memory).
{memory,17128}
Just start with calling memory() from the shell to learn if it is in binaries, ets, processes and so on the memory is being kept. Next you can ask a tool like etop to give you the processes using the most memory if a process is the culprit. This can often track down the problem.
If the problem is ETS or binaries, then you may be keeping certain large binaries around for a long time due to sub-binary pointers inside them. This needs GC tweaks to fix.

How to log mallocs

This is a bit hypothetical and grossly simplified but...
Assume a program that will be calling functions written by third parties. These parties can be assumed to be non-hostile but can't be assumed to be "competent". Each function will take some arguments, have side effects and return a value. They have no state while they are not running.
The objective is to ensure they can't cause memory leaks by logging all mallocs (and the like) and then freeing everything after the function exits.
Is this possible? Is this practical?
p.s. The important part to me is ensuring that no allocations persist so ways to remove memory leaks without doing that are not useful to me.
You don't specify the operating system or environment, this answer assumes Linux, glibc, and C.
You can set __malloc_hook, __free_hook, and __realloc_hook to point to functions which will be called from malloc(), realloc(), and free() respectively. There is a __malloc_hook manpage showing the prototypes. You can add track allocations in these hooks, then return to let glibc handle the memory allocation/deallocation.
It sounds like you want to free any live allocations when the third-party function returns. There are ways to have gcc automatically insert calls at every function entrance and exit using -finstrument-functions, but I think that would be inelegant for what you are trying to do. Can you have your own code call a function in your memory-tracking library after calling one of these third-party functions? You could then check if there are any allocations which the third-party function did not already free.
First, you have to provide the entrypoints for malloc() and free() and friends. Because this code is compiled already (right?) you can't depend on #define to redirect.
Then you can implement these in the obvious way and log that they came from a certain module by linking those routines to those modules.
The fastest way involves no logging at all. If the amount of memory they use is bounded, why not pre-allocate all the "heap" they'll ever need and write an allocator out of that? Then when it's done, free the entire "heap" and you're done! You could extend this idea to multiple heaps if it's more complex that that.
If you really do need to "log" and not make your own allocator, here's some ideas. One, use a hash table with pointers and internal chaining. Another would be to allocate extra space in front of every block and put your own structure there containing, say, an index into your "log table," then keep a free-list of log table entries (as a stack so getting a free one or putting a free one back is O(1)). This takes more memory but should be fast.
Is it practical? I think it is, so long as the speed-hit is acceptable.
You could run the third party functions in a separate process and close the process when you are done using the library.
A better solution than attempting to log mallocs might be to sandbox the functions when you call them—give them access to a fixed segment of memory and then free that segment when the function is done running.
Unconfined, incompetent memory usage can be just as damaging as malicious code.
Can't you just force them to allocate all their memory on the stack? This way it would be garanteed to be freed after the function exits.
In the past I wrote a software library in C that had a memory management subsystem that contained the ability to log allocations and frees, and to manually match each allocation and free. This was of some use when attempting to find memory leaks, but it was difficult and time consuming to use. The number of logs was overwhelming, and it took an extensive amount of time to understand the logs.
That being said, if your third party library has extensive allocations, its more then likely impractical to track this via logging. If you're running in a Windows environment, I would suggest using a tool such as Purify[1] or BoundsChecker[2] that should be able to detect leaks in your third party libraries. The investment in the tool should pay for itself in time saved.
[1]: http://www-01.ibm.com/software/awdtools/purify/ Purify
[2]: http://www.compuware.com/products/devpartner/visualc.htm BoundsChecker
Since you're worried about memory leaks and talking about malloc/free, I assume you're in C. I'm also assuming based on your question that you do not have access to the source code of the third party library.
The only thing I can think of is to examine memory consumption of your app before & after the call, log error messages if they're different and convince the third party vendor to fix any leaks you find.
If you have money to spare, then consider using Purify to track issues. It works wonders, and does not require source code or recompilation. There are also other debugging malloc libraries available that are cheaper. Electric Fence is one name I recall. That said, the debugging hooks mentioned by Denton Gentry seem interesting too.
If you're too poor for Purify, try Valgrind. It it a lot better than it was 6 years ago and a lot easier to dive into than Purify.
Microsoft Windows provides (use SUA if you need a POSIX), quite possibly, the most advanced heap+(other api known to use the heap) infrastructure of any shipping OS today.
the __malloc() debug hooks and the associated CRT debug interfaces are nice for cases where you have the source code to the tests, however they can often miss allocations by standard libraries or other code which is linked. This is expected as they are the Visual Studio heap debugging infrastructure.
gflags is a very comprehensive and detailed set of debuging capabilities which has been included with Windows for many years. Having advanced functionality for source and binary only use cases (as it is the OS heap debugging infrastructure).
It can log full stack traces (repaginating symbolic information in a post-process operation), of all heap users, for all heap modifying entrypoint's, serially if needed. Also, it may modify the heap with pathalogical cases which may align the allocation of data such that the page protection offered by the VM system is optimally assigned (i.e. allocate your requested heap block at the end of a page, so even a singele byte overflow is detected at the time of the overflow.
umdh is a tool which can help assess the status at various checkpoints, however the data is continually accumulated during the execution of the target o it is not a simple checkpointing debug stop in the traditional context. Also, WARNING, Last I checked at least, the total size of the circular buffer which store's the stack information, for each request is somewhat small (64k entries (entries+stack)), so you may need to dump rapidly for heavy heap users. There are other ways to access this data but umdh is fairly simple.
NOTE there are 2 modes;
MODE 1, umdh {-p:Process-id|-pn:ProcessName} [-f:Filename] [-g]
MODE 2, umdh [-d] {File1} [File2] [-f:Filename]
I do not know what insanity gripped the developer who chose to alternate between -p:foo argument specifier's and naked ordering of argument's but it can get a little confusing.
The debugging sdk works with a number of other tools, memsnap is a tool which apparently focuses on memory leask and such, but I have not used it, your milage may vary.
Execute gflags with no arguments for the UI mode, +arg's and /args are different "modes" of use also.
On Linux I've successfully used mtrace(3) to log allocations and freeings. Its usage is as simple as
Modify your program to call mtrace() when you need to begin tracing (e.g. at the top of main()),
Set environment variable MALLOC_TRACE to the file path where the trace should be saved and run the program.
After that the output file will contain something like this (excerpt from the middle to show a failed allocation):
# /usr/lib/tls/libnvidia-tls.so.390.116:[0xf44b795c] + 0x99e5e20 0x49
# /opt/gcc-7/lib/libstdc++.so.6:(_ZdlPv+0x18)[0xf6a80f78] - 0x99beba0
# /usr/lib/tls/libnvidia-tls.so.390.116:[0xf44b795c] + 0x9a23ec0 0x10
# /opt/gcc-7/lib/libstdc++.so.6:(_ZdlPv+0x18)[0xf6a80f78] - 0x9a23ec0
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668ee49] + 0x99c67c0 0x8
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668f14f] - 0x99c67c0
# /opt/Xorg/lib/video-libs/libGL.so.1:[0xf668ee49] + (nil) 0x30000000
# /lib/libc.so.6:[0xf677f8eb] + 0x99c21f0 0x158
# /lib/libc.so.6:(_IO_file_doallocate+0x91)[0xf677ee61] + 0xbfb00480 0x400
# /lib/libc.so.6:(_IO_setb+0x59)[0xf678d7f9] - 0xbfb00480

Resources