How resilient is modern Rails to the antipattern "thread + fork"? - ruby-on-rails

I think this is a popular antipattern that happens either standalone, for example activeJob local task with async, or coming from controllers, because then the strategy of the server must be taken into account.
My question is, what cautions should one take in the code when forking inside a thread (think inside of a ActiveJob task) and then even threading it?
The main worries I have seen online are:
Needs to lose and reopen the database connections after the fork. It seems that nowadays activeRecord takes care of it, doesn't it?
Access to the common Logger could be complicated. Somehow it seems to work.
Concurrent was expected to be problematic too but current versions are patched to detect that a fork has happened and threads are dead. Still it seems that one needs to make sure of doing, at the end of the forked process, a fine shutdown of any Rails::Concurrent pool that could have active or pending jobs. I think that it is enough
ActiveJob::Base.queue_adapter.shutdown
but perhaps it could miss some tasks that have not started or tasks under other Concurrent queue. In fact I think it already happens if one uses Concurrent::Future in a controller managed by the puma webserver. Generically I try to insert
Concurrent::global_io_executor.shutdown
Concurrent::global_io_executor.wait_for_termination
Extra problems I have found are resource-related: the postgres server is not ready to manage so many connections by default. Perhaps it could be sensible to reduce the size of the connection pool before the fork. And the inotify watcher gem also exhausts resource, when launched in development. Production is fine in both cases.

TL;DR; - I'm against doing it but many of us do it anyway and ignore the fact that it's unsafe... things break too rarely.
It is a simple fact that calling fork in a multi-threaded process may cause the new child to crash / deadlock / spin and may also cause other (harder to isolate) bugs.
This has nothing to do with Ruby, this is related to the locking mechanisms that safeguard critical sections and core process functionality such opening/closing files, allocating memory and any user created mutex / spinlock, etc'.
Why is it risky?
When calling fork the new process inherits all the state of the previous process but only the thread that called fork (all other threads do not exist in the new process).
This means that if any of the other threads was inside a critical section (i.e., allocating memory, opening a file, etc'), that critical section would remain locked for the lifetime of the new process, possibly causing deadlocks or unexpected errors.
Why do we ignore it?
In practical terms, the risk of something seriously breaking is often very low and most developers hadn't both encounter the issue and recognized its cause. Open files can be manually (if not automatically) closed, which leaves us mostly with the question of critical sections.
We can often reset our own critical sections which leaves mostly the system's critical sections...
The system's core critical sections that can be effected by fork are not that many. The main one is the memory allocator which can hardly ever break. Often the malloc implementation has multiple "arenas", each with its own critical section and it would be a long-shot to hit the system's underlying page allocation (i.e., mmap).
So is it safe?
No. Things still break, they just break rarely and when they do it isn't always obvious. Also, a parent process can sometime catch some of these errors and retry / recuperate and there are other ways to handle the risks.
Should I do it?
I wouldn't recommend to do it, but it depends. If you can handle an error, sure, go ahead. If not, that's a no.
Anyway, it's usually much better to use IPC to forward a message to a background process so that process perform any required fork / task.

The pattern can occur naturally when a Rails controller is combined with a webserver. The situation is different depending if the webserver is threaded, forked or evented, but the final conclusion is the same; that it is safe.
Fork + fork and thread + fork should not present problems of multiple access to the database or multiple execution of the same code, as only the current thread is active in the children.
Event + fork could be a source of troubles if the event machine is still active in the forked thread. Fortunately most designs generate a separate thread for the control of the event loop.

Related

Is gen_server restart strategy copy state?

Erlang world not use try-catch as usual. I'm want to know how about performance when restart a process vs try-catch in mainstream language.
A Erlang process has it's small stack and heap concept which actually allocate in OS heap. Why it's effective to restart it?
Hope someone give me a deep in sight about Beam what to do when invoke a restart operation on a process.
Besides, how about use gen_server which maintain state in it's process. Will cause a copy state operate when gen_server restart?
Thanks
I recommend having a read of https://ferd.ca/the-zen-of-erlang.html
Here's my understanding: restart is effective for fixing "Heisenbug" which only happens when the (Erlang) process is in some weird state and/or trying to handle a "weird" message.
The presumption is that you revert to a known good state (by restarting), which should handle all normal messages correctly. Restart is not meant to "fix all the problems", and certainly not for things like bad configuration or missing internet connection. By this definition we can see it's very dangerous to copy the state when crash happened and try to recover from that, because this is defeating the whole point of going back to a known state.
The second point is, say this process only crashes when handling an action that only 0.001% (or whatever percentage is considered negligible) of all your users actually use, and it's not really important (e.g. a minor UI detail) then it's totally fine to just let it crash and restart, and don't need to fix it. I think it can be a productivity enabler for these cases.
Regarding your questions in the OP comment: yes just whatever your init callback returns, you can either build the entire starting state there or source from other places, totally depend on the use case.

Use of serialized target queues for Concurrent queues in iOS

I was going through this excellent blog post
(http://www.humancode.us/2014/08/14/target-queues.html)
of target threads in iOS and I could not help but wonder why do we need such a mechanism. In the example, we are specifying a serialised target queue for a custom concurrent queue. Can we not achieve the same by executing the blocks in the original concurrent queue in a serialised queue instead?
Whats the point of having a serialised target queue for a concurrent queue????
If I got You right, you're asking why would someone start serial task on a concurrent queue.
You would need that kind of behaviour in case, if most tasks with some resource can be performed concurrently (aka, simultaneously), but some tasks are, by nature, unsafe to be performed concurrently with others.
The most common example is readers/writers problem. Here you are accessing, for example, some resource of a file system. It's ok to read it even from different threads - every reader will get exactly what it needs. But here comes necessity to update contents of that file. Modifying it while someone reads it leads to unpredicted results - reader is not guaranteed to get the right, expected, info (partially from old version, partially from new). Even worse - there can be two writers (if file contents changes by application user and from some central storage via net) - result will be some crazy mix of two versions (actually, it can be now even corrupted)
Here comes necessity for each writer to wait till all other tasks performed (no one reads, no one writes), and for each reader to wait until no writing tasks take place (no one writes, no matter how many readers)
Wikipedia has nice article on this one. I haven't run into any other practical situations, where you would need this, but I believe there're more of them.
Hope it answers your question

Interprocess SQLite Thread Safety (on iOS)

I'm trying to determine if my sqlite access to a database is thread-safe on iOS. I'm writing a non App Store app (or possibly a launch daemon), so Apple's approval isn't an issue. The database in question is the built-in sms.db, so for sure the OS is also accessing this database for reading and writing. I only want to be able to safely read it.
I've read this about reading from multiple processes with sqlite:
Multiple processes can have the same database open at the same time.
Multiple processes can be doing a SELECT at the same time. But only
one process can be making changes to the database at any moment in
time, however.
I understand that thread-safety can be compiled out of sqlite, and that sqlite3_threadsafe() can be used to test for this. Running this on iOS 5.0.1
int safe = sqlite3_threadsafe();
yields a result of 2. According to this, that means mutex locking is available. But, that doesn't necessarily mean it's in use.
I'm not entirely clear on whether thread-safety is dynamically enabled on a per connection, per database, or global basis.
I have also read this. It looks like sqlite3_config() can be used to enable safe multi-threading, but of course, I have no control, or visibility into how the OS itself may have used this call (do I?). If I were to make that call again in my app, would it make it safe to read the database, or would it only deconflict concurrent access for multiple threads in my app that used the same sqlite3 database handle?
Anyway, my question is ...
can I safely read this database that's also accessed by iOS, and if so, how?
I've never used SQLite, but I've spent a decent amount of time reading its docs because I plan on using it in the future (and the docs are interesting). I'd say that thread safety is independent of whether multiple processes can access the same database file at once. SQLite, regardless of what threading mode it is in, will lock the database file, so that multiple processes can read from the database at once but only one can write.
Thread safety only affects how your process can use SQLite. Without any thread safety, you can only call SQLite functions from one thread. But it should still, say, take an EXCLUSIVE lock before writing, so that other processes can't corrupt the database file. Thread safety just protects data in your process's memory from getting corrupted if you use multiple threads. So I don't think you ever need to worry about what another process (in this case iOS) is doing with an SQLite database.
Edit: To clarify, any time you write to the database, including a plain INSERT/UPDATE/DELETE, it will automatically take an EXCLUSIVE lock, write to the database, then release the lock. (And it actually takes a SHARED lock, then a RESERVED lock, then a PENDING lock, then an EXCLUSIVE lock before writing.) By default, if the database is already locked (say from another process), then SQLite will return SQLITE_BUSY without waiting. You can call sqlite3_busy_timeout() to tell it to wait longer.
I don't think any of this is news to you, but a few thoughts:
In terms of enabling multi-threading (either serialized or multi-threaded), the general counsel is that one can invoke sqlite3_config() (but you may have to do a shutdown first as suggested in the docs or as discussed on SO here) to enable the sort of multi-threading you want. That may be of diminished usefulness here, though, where you have no control over what sort of access iOS is requesting of sqlite and/or this database.
Thus, I would have thought that, from an academic perspective, it would not be safe to read this system database (because as you say, you have no assurance of what the OS is doing). But I wouldn't be surprised if iOS is opening the database using whatever the default mode is, so from a more pragmatic perspective, you might be fine.
Clearly, for most users concerned about multi-threaded access within a single app, the best counsel would be to bypass the sqlite3_config() silliness and just simply ensure coordinated access through your own GCD serial queue (i.e., have a dedicated queue through which all database interactions go through, gracefully eliminating the multi-thread issue altogether). Sadly, that's not an option here because you're trying to coordinate database interaction with iOS itself.

Threading in Rails - do params[] persist?

I am trying to spawn a thread in Rails. I am usually not comfortable using threads as I will need to have an in-depth knowledge of Rails' request/response cycle, yet I cannot avoid using one as my request times out.
In order to avoid the time out, I am using a thread within a request. My question here is simple. The thread that I've used accesses a params[] variable inside it. And things seem to work OK now. I want to know whether this is right? I'd be happy if someone can throw some light on using Threads in Rails during request/response cycle.
[Starting a bounty]
The short answer is yes, but only to a degree; the binding in which the thread was created will continue to persist. The params will still exist only if no one (including Rails) goes out of their way to modify or delete the params hash. Instead, they rely on the garbage collector to clean up these objects. Since the thread has access to the current context (called the "binding" in Ruby) when it was created, all the variables that can be reached from that scope (effectively the entire state when the thread was created) cannot be deleted by the garbage collector. However, as executing continues in the main thread, the values of the variables in that context can be changed by the main thread, or even by the thread you created, if it can access it. This is the benefit--and the downfall--of threads: they share memory with everything else.
You can emulate a very similar environment to Rails to test your problem, using a function as such: http://gist.github.com/637719. As you can see, the code still prints 5.
However, this is not the correct way to do this. The better way to pass data to a thread is to pass it to Thread.new, like so:
# always dup objects when passing into a thread, else you really
# haven't done yourself any good-it would still be the same memory
Thread.new(params.dup) do |params|
puts params[:foo]
end
This way, you can be sure than any modifications to params will not affect your thread. The best practice is to only use data you pass to your thread in this way, or things that the thread itself created. Relying on the state of the program outside the thread is dangerous.
So as you can see, there are good reasons that this is not recommended. Multithreaded programming is hard, even in Ruby, and especially when you're dealing with as many libraries and dependencies as are used in Rails. In fact, Ruby seems to make it deceptively easy, but it's basically a trap. The bugs you will encounter are very subtle; they will happen "randomly" and be very hard to debug. This is why things like Resque, Delayed Job, and other background processing libraries are used so widely and recommended for Rails apps, and I would recommend the same.
The question is more does rails keep the request open whilst the thread is running than does it persist the value.
It won't persist the value as soon as the request ends and I also wouldn't recommend holding the request open unless there is a real need. As other users have said some stuff is just better in a delayed job.
Having said that we used threading a couple of times to query multiple sources concurrently and actually reduce the response time of an app (that was only for admins so didn't need to have fast response times) and if memory serves correctly the thread can keep the request open if you call join at the end and wait for each thread to finish before continuing.

using Kernel#fork for backgrounding processes, pros? cons?

I'd like some thoughts on whether using fork{} to 'background' a process from a rails app is such a good idea or not...
From what I gather fork{my_method; Process#setsid} does in fact do what it's supposed to do.
1) creates another processes with a different PID
2) doesn't interrupt the calling process (e.g. it continues w/o waiting for the fork to finish)
3) executes the child until it finishes
..which is cool, but is it a good idea? What exactly is fork doing? Does it create a duplicate instance of my entire rails mongrel/passenger instance in memory? If so that would be very bad. Or, does it somehow do it without consuming a huge swath of memory.
My ultimate goal was to do away with my background daemon/queue system in favor of forking these processes (primarily sending emails) -- but if this won't save memory then it's definitely a step in the wrong direction
The fork does make a copy of your entire process, and, depending on exactly how you are hooked up to the application server, a copy of that as well. As noted in the other discussion this is done with copy-on-write so it's tolerable. Unix is built around fork(2), after all, so it has to manage it fairly fast. Note that any partially buffered I/O, open files, and lots of other stuff are also copied, as well as the state of the program that is spring-loaded to write them out, which would be incorrect.
I have a few thoughts:
Are you using Action Mailer? It seems like email would be easily done with AM or by Process.popen of something. (Popen will do a fork, but it is immediately followed by an exec.)
immediately get rid of all that state by executing Process.exec of another ruby interpreter plus your functionality. If there is too much state to transfer or you really need to use those duplicated file descriptors, you might do something like IO#popen instead so you can send the subprocess work to do. The system will share the pages containing the text of the Ruby interpreter of the subprocess with the parent automatically.
in addition to the above, you might want to consider the use of the daemons gem. While your rails process is already a daemon, using the gem might make it easier to keep one background task running as a batch job server, and make it easy to start, monitor, restart if it bombs, and shut down when you do...
if you do exit from a fork(2)ed subprocess, use exit! instead of exit
having a message queue and a daemon already set up, like you do, kinda sounds like a good solution to me :-)
Be aware that it will prevent you from using JRuby on Rails as fork() is not implemented (yet).
The semantics of fork is to copy the entire memory space of the process into a new process, but many (most?) systems will do that by just making a copy of the virtual memory tables and marking it copy-on-write. That means that (at first, at least) it doesn't use that much more physical memory, just enough to make the new tables and other per-process data structures.
That said, I'm not sure how well Ruby, RoR, etc. interacts with copy-on-write forking. In particular garbage collection could be problematic if it touches many memory pages (causing them to be copied).

Resources