Threading in Rails - do params[] persist? - ruby-on-rails

I am trying to spawn a thread in Rails. I am usually not comfortable using threads as I will need to have an in-depth knowledge of Rails' request/response cycle, yet I cannot avoid using one as my request times out.
In order to avoid the time out, I am using a thread within a request. My question here is simple. The thread that I've used accesses a params[] variable inside it. And things seem to work OK now. I want to know whether this is right? I'd be happy if someone can throw some light on using Threads in Rails during request/response cycle.
[Starting a bounty]

The short answer is yes, but only to a degree; the binding in which the thread was created will continue to persist. The params will still exist only if no one (including Rails) goes out of their way to modify or delete the params hash. Instead, they rely on the garbage collector to clean up these objects. Since the thread has access to the current context (called the "binding" in Ruby) when it was created, all the variables that can be reached from that scope (effectively the entire state when the thread was created) cannot be deleted by the garbage collector. However, as executing continues in the main thread, the values of the variables in that context can be changed by the main thread, or even by the thread you created, if it can access it. This is the benefit--and the downfall--of threads: they share memory with everything else.
You can emulate a very similar environment to Rails to test your problem, using a function as such: http://gist.github.com/637719. As you can see, the code still prints 5.
However, this is not the correct way to do this. The better way to pass data to a thread is to pass it to Thread.new, like so:
# always dup objects when passing into a thread, else you really
# haven't done yourself any good-it would still be the same memory
Thread.new(params.dup) do |params|
puts params[:foo]
end
This way, you can be sure than any modifications to params will not affect your thread. The best practice is to only use data you pass to your thread in this way, or things that the thread itself created. Relying on the state of the program outside the thread is dangerous.
So as you can see, there are good reasons that this is not recommended. Multithreaded programming is hard, even in Ruby, and especially when you're dealing with as many libraries and dependencies as are used in Rails. In fact, Ruby seems to make it deceptively easy, but it's basically a trap. The bugs you will encounter are very subtle; they will happen "randomly" and be very hard to debug. This is why things like Resque, Delayed Job, and other background processing libraries are used so widely and recommended for Rails apps, and I would recommend the same.

The question is more does rails keep the request open whilst the thread is running than does it persist the value.
It won't persist the value as soon as the request ends and I also wouldn't recommend holding the request open unless there is a real need. As other users have said some stuff is just better in a delayed job.
Having said that we used threading a couple of times to query multiple sources concurrently and actually reduce the response time of an app (that was only for admins so didn't need to have fast response times) and if memory serves correctly the thread can keep the request open if you call join at the end and wait for each thread to finish before continuing.

Related

How resilient is modern Rails to the antipattern "thread + fork"?

I think this is a popular antipattern that happens either standalone, for example activeJob local task with async, or coming from controllers, because then the strategy of the server must be taken into account.
My question is, what cautions should one take in the code when forking inside a thread (think inside of a ActiveJob task) and then even threading it?
The main worries I have seen online are:
Needs to lose and reopen the database connections after the fork. It seems that nowadays activeRecord takes care of it, doesn't it?
Access to the common Logger could be complicated. Somehow it seems to work.
Concurrent was expected to be problematic too but current versions are patched to detect that a fork has happened and threads are dead. Still it seems that one needs to make sure of doing, at the end of the forked process, a fine shutdown of any Rails::Concurrent pool that could have active or pending jobs. I think that it is enough
ActiveJob::Base.queue_adapter.shutdown
but perhaps it could miss some tasks that have not started or tasks under other Concurrent queue. In fact I think it already happens if one uses Concurrent::Future in a controller managed by the puma webserver. Generically I try to insert
Concurrent::global_io_executor.shutdown
Concurrent::global_io_executor.wait_for_termination
Extra problems I have found are resource-related: the postgres server is not ready to manage so many connections by default. Perhaps it could be sensible to reduce the size of the connection pool before the fork. And the inotify watcher gem also exhausts resource, when launched in development. Production is fine in both cases.
TL;DR; - I'm against doing it but many of us do it anyway and ignore the fact that it's unsafe... things break too rarely.
It is a simple fact that calling fork in a multi-threaded process may cause the new child to crash / deadlock / spin and may also cause other (harder to isolate) bugs.
This has nothing to do with Ruby, this is related to the locking mechanisms that safeguard critical sections and core process functionality such opening/closing files, allocating memory and any user created mutex / spinlock, etc'.
Why is it risky?
When calling fork the new process inherits all the state of the previous process but only the thread that called fork (all other threads do not exist in the new process).
This means that if any of the other threads was inside a critical section (i.e., allocating memory, opening a file, etc'), that critical section would remain locked for the lifetime of the new process, possibly causing deadlocks or unexpected errors.
Why do we ignore it?
In practical terms, the risk of something seriously breaking is often very low and most developers hadn't both encounter the issue and recognized its cause. Open files can be manually (if not automatically) closed, which leaves us mostly with the question of critical sections.
We can often reset our own critical sections which leaves mostly the system's critical sections...
The system's core critical sections that can be effected by fork are not that many. The main one is the memory allocator which can hardly ever break. Often the malloc implementation has multiple "arenas", each with its own critical section and it would be a long-shot to hit the system's underlying page allocation (i.e., mmap).
So is it safe?
No. Things still break, they just break rarely and when they do it isn't always obvious. Also, a parent process can sometime catch some of these errors and retry / recuperate and there are other ways to handle the risks.
Should I do it?
I wouldn't recommend to do it, but it depends. If you can handle an error, sure, go ahead. If not, that's a no.
Anyway, it's usually much better to use IPC to forward a message to a background process so that process perform any required fork / task.
The pattern can occur naturally when a Rails controller is combined with a webserver. The situation is different depending if the webserver is threaded, forked or evented, but the final conclusion is the same; that it is safe.
Fork + fork and thread + fork should not present problems of multiple access to the database or multiple execution of the same code, as only the current thread is active in the children.
Event + fork could be a source of troubles if the event machine is still active in the forked thread. Fortunately most designs generate a separate thread for the control of the event loop.

Rails / ActiveRecord: avoid two threads updating model at the same time with locks

Let's say I need to be sure ModelName can't be updated at the same time by two different Rails threads; this can happen, for example, when a webhooks post to the application tries to modify it at the same time some other code is running.
Per Rails documentation, I think the solution would be to use model_name_instance.with_lock, which also begins a new transaction.
This works fine and prevents simultaneous UPDATES to the model, but it does not prevent other threads from reading that table row while the with_lock block is running.
I can prove that with_lock does not prevent other READS by doing this:
Open 2 rails consoles;
On console 1, type something like ModelName.last.with_lock { sleep 30 }
On console 2, type ModelName.last. You'll be able to read the model no problem.
On console 2, type ModelName.update_columns(updated_at: Time.now). You'll see it will wait for the 30 seconds lock to expire before it finishes.
This proves that the lock DOES NOT prevent reading, and as far as I could tell there's no way to lock the database row from being read.
This is problematic because if 2 threads are running the same method at the EXACT same time and I must decide to run the with_lock block regarding some previous checks on the model data, thread 2 could be reading stale data that would be soon be updated by thread 1 after it finishes the with_lock block that is already running, because thread 2 CAN READ the model while with_lock block is in progress in thread 1, it only can't UPDATE it because of the lock.
EDIT: I found the answer to this question, so you can stop reading here and go straight to it below :)
One idea that I had was to begin the with_lock block issuing a harmless update to the model (like model_instance_name.update_columns(updated_at: Time.now) for instance), and then following it with a model_name_instance.reload to be sure that it gets the most updated data. So if two threads are running the same code at the same time, only one would be able to issue the first update, while the other would need to wait for the lock to be released. Once it is released, it would be followed with that model_instance_name.reload to be sure to get any updates performed by the other thread.
The problem is this solution seems way too hacky for my taste, and I'm not sure I should be reinventing the wheel here (I don't know if I'm missing any edge cases). How does one assure that, when two threads run the exact same method at the exact same time, one thread waits for the other to finish to even read the model ?
Thanks Robert for the Optimistic Locking info, I could definitely see me going that route, but Optimistic locking works by raising an exception on the moment of writing to the database (SQL UPDATE), and I have a lot of complex business logic that I wouldn't even want to run with the stale data in the first place.
This is how I solved it, and it was simpler than what I imagined.
First of all, I learned that pessimistic locking DOES NOT preventing any other threads from reading that database row.
But I also learned that with_lock also initiates the lock immediately, regardless of you trying to make a write or not.
So if you start 2 rails consoles (simulating two different threads), you can test that:
If you type ModelName.last.with_lock { sleep 30 } on Console 1 and ModelName.last on Console 2, Console 2 can read that record immediately.
However, if you type ModelName.last.with_lock { sleep 30 } on Console 1 and ModelName.last.with_lock { p 'I'm waiting' } on Console 2, Console 2 will wait for the lock hold by console 1, even though it's not issuing any write whatsoever.
So that's a way of 'locking the read': if you have a piece of code that you want to be sure that it won't be run simultaneously (not even for reads!), begin that method opening a with_lock block and issue your model reads inside it that they'll wait for any other locks to be released first. If you issue your reads outside it, your reads will be performed even tough some other piece of code in another thread has a lock on that table row.
Some other nice things I learned:
As per rails documentation, with_lock will not only start a transaction with a lock, but it will also reload your model for you, so you can be sure that inside the block ModelName.last is on it's most up-to-date state, since it issues a .reload on that instance.
That are some gems designed specifically to block the same piece of code running at the same time in multiple threads (which I believe the majority of every Rails app is while in production environment), regardless of the database lock. Take a look at redis-mutex, redis-semaphore and redis-lock.
That are many articles on the web (I could find at least 3) that state that Rails with_lock will prevent a READ on the database row, while we can easily see with the tests above that's not the case. Take care and always confirm information testing it yourself! I tried to comment on them warning about this.
You were close, you want optimistic locking instead of pessimist locking: http://api.rubyonrails.org/classes/ActiveRecord/Locking/Optimistic.html .
It won't prevent reading an object and submitting a form. But it can detect that the form was submitted when the user was seeing stale version of the object.

What happens to the execution context when I call Thread.new in a Rails/Rack request?

Say three Rails instances have been spawned and are taking requests.
For each request Thread.new is called a bunch of a times to put some cross cutting concerns out of band and return the response quicker.
What confuses me is what happens under the covers with the Rails instances.
I would guess that the Rails instance will not be ready to handle another request until all the threads spawned have finished executing?
To do this safely, you need to enable the threadsafe option in rails:
config.threadsafe!
Otherwise, results will probably be undefined (depending on what operations you perform).
In either case, Rails will not block on the threads before handling another request. For this to happen, you would have to call #join on the threads. The threads will happily continue processing in the background. If you are after something more specific, please post some code.
Note that normally a library such as delayed_job or factory_girl is used to provide background processing rather than instantiating threads yourself. See the documentation for the latter for a good argument on why you also use a ruby implementation with good threading support (I use JRuby) if you are doing this extensively.

Safety of Thread.current[] usage in rails

I keep getting conflicting opinions on the practice of storing information in the Thread.current hash (e.g., the current_user, the current subdomain, etc.). The technique has been proposed as a way to simplify later processing within the model layer (query scoping, auditing, etc.).
Why are my thread variables intermittent in Rails?
Alternative to using Thread.current in API wrapper for Rails
Are Thread.current[] values and class level attributes safe to use in rails?
Many consider the practice unacceptable because it breaks the MVC pattern.
Others express concerns about reliability/safety of the approach, and my 2-part question focuses on the latter aspect.
Is the Thread.current hash guaranteed to be available and private to one and only one response, throughout its entire cycle?
I understand that a thread, at the end of a response, may well be handed over to other incoming requests, thereby leaking any information stored in Thread.current. Would clearing such information before the end of the response (e.g. by executing Thread.current[:user] = nil from a controller's after_filter) suffice in preventing such security breach?
Thanks!
Giuseppe
There is not an specific reason to stay away from thread-local variables, the main issues are:
it's harder to test them, as you will have to remember to set the thread-local variables when you're testing out code that uses it
classes that use thread locals will need knowledge that these objects are not available to them but inside a thread-local variable and this kind of indirection usually breaks the law of demeter
not cleaning up thread-locals might be an issue if your framework reuses threads (the thread-local variable would be already initiated and code that relies on ||= calls to initialize variables might fail
So, while it's not completely out of question to use, the best approach is not to use them, but from time to time you hit a wall where a thread local is going to be the simplest possible solution without changing quite a lot of code and you will have to compromise, have a less than perfect object oriented model with the thread local or changing quite a lot of code to do the same.
So, it's mostly a matter of thinking which is going to be the best solution for your case and if you're really going down the thread-local path, I'd surely advise you to do it with blocks that remember to clean up after they are done, like the following:
around_filter :do_with_current_user
def do_with_current_user
Thread.current[:current_user] = self.current_user
begin
yield
ensure
Thread.current[:current_user] = nil
end
end
This ensures the thread local variable is cleaned up before being used if this thread is recycled.
This little gem ensures your thread/request local variables not stick between requests: https://github.com/steveklabnik/request_store
The accepted answer covers the question but as Rails 5 now provides a "Abstract super class" ActiveSupport::CurrentAttributes which uses Thread.current.
I thought I would provide a link to that as a possible(unpopular) solution.
https://github.com/rails/rails/blob/master/activesupport/lib/active_support/current_attributes.rb
The accepted answer is technically accurate, but as pointed out in the answer gently, and in http://m.onkey.org/thread-safety-for-your-rails not so gently:
Don't use thread local storage, Thread.current if you don't absolutely have to
The gem for request_store is another solution (better) but just read the readme there for more reasons to stay away from thread local storage.
There is almost always a better way.

Ruby/Rails synchronous job manager

hi
i'm going to set up a rails-website where, after some initial user input, some heavy calculations are done (via c-extension to ruby, will use multithreading). as these calculations are going to consume almost all cpu-time (memory too), there should never be more than one calculation running at a time. also i can't use (asynchronous) background jobs (like with delayed job) as rails has to show the results of that calculation and the site should work without javascript.
so i suppose i need a separate process where all rails instances have to queue their calculation requests und wait for the answer (maybe an error message if the queue is full), kind of a synchronous job manager.
does anyone know if there is a gem/plugin with such functionality?
(nanite seemed pretty cool to me, but seems to be only asynchronous, so the rails instances would not know when the calculation is finished. is that correct?)
another idea is to write my own using distributed ruby (drb), but why invent the wheel again if it already exists?
any help would be appreciated!
EDIT:
because of the tips of zaius i think i will be able to do this asynchronously, so i'm going to try resque.
Ruby has mutexes / semaphores.
http://www.ruby-doc.org/core/classes/Mutex.html
You can use a semaphore to make sure only one resource intensive process is happening at the same time.
http://en.wikipedia.org/wiki/Mutex
http://en.wikipedia.org/wiki/Semaphore_(programming)
However, the idea of blocking a front end process while other tasks finish doesn't seem right to me. If I was doing this, I would use a background worker, and then use a page (or an iframe) with the refresh meta tag to continuously check on the progress.
http://en.wikipedia.org/wiki/Meta_refresh
That way, you can use the same code for both javascript enabled and disabled clients. And your web app threads aren't blocking.
If you have a separate process, then you have a background job... so either you can have it or you can't...
What I have done is have the website write the request params to a database. Then a separate process looks for pending requests in the database - using the daemons gem. It does the work and writes the results back to the database.
The website then polls the database until the results are ready and then displays them.
Although I use javascript to make it do the polling.
If you really cant use javascript, then it seems you need to either do the work in the web request thread or make that thread wait for the background thread to finish.
To make the web request thread wait, just do a loop in it, checking the database until the reply is saved back into it. Once its there, you can then complete the thread.
HTH, chris

Resources