Why use threads in Ruby Event Machine? - ruby-on-rails

Since event machine is said to be an event based model async I/O library (like node.js) that is single-threaded and uses event loop to handle concurrent requests, is it really necessary to care about and use threading on the ruby application layer code (i.e rails controller when handling requests)?
I'm more used to node.js model where you actually just wrap your code inside the callback, and then everything is taken care of for you. (the select() system call to kqueue, epoll, etc that spawns new threads are handled in the lower level C++ implementation), and also, ECMAscript by its nature doesnt have threads anyway.
Recently I saw this piece of ruby code when trying to learn about Event Machine:
Thread = Thread.current
Thread.new{
EM.run{ thread.wakeup }
}
# pause until reactor starts
Thread.stop
I'm just curious when threads are to be used in the event-based programming paradigm in ruby environment and what specific situation would require us to use them.
I know that ruby has threads built into the language (MRI green threads, JRuby JVM threads) so it may be tempting to use threads? However from my point of view, it kinds of defeats the whole purpose if you're actually not supposed to worry about them in the higher level application code since event based model pretty much is introduced to solve this problem.
Thanks. appreciate any answers/clarifications.

While using EventMachine, you cannot have a cpu intensive task because the time you spend on your task is "taken away" from the reactor, I use threads when I know a task is going to:
be blocking (you should never block the eventmachine thread)
use more cpu than my average tasks
In these cases spawning the task in a separate thread allows it to do its job without preventing the reactor from doing its own work.
Another choice is to use fibers which is yet another different beast.

The biggest difference between a thread and a state machine, as far as I'm aware, is that threads will take advantage of a multi-core processor to do true parallel processing, while a state machine processes everything in serial. The state machine, on the other hand, is easier to maintain data integrity with since you don't have to worry so much about race conditions.

Related

Behaviour of Erlang system_info and system_flag methods

Are these methods system_info and system_flag make a system call to the Operating system each time I call one of them? Or are they use stored values of Erlang virtual machine?
Task: I'm writing an application which checks the idling processors and create new processes to complete a task. If these methods are doing a system call, it can be a performance overhead.
The functions erlang:system_info and erlang:system_flag inspect and work on the Erlang virtual machine and not the underlying OS. They allow you to inspect the system to see how it is performing and in some ways to control it. The BEAM, the erlang virtual machine, is a complex beast and there is a lot of information to be had. Another useful function is process_info which allows you to get information about one process.
While these functions are obviously written in C you can be certain that calling them will not casue problems in the sense that long running NIFs might. Long-running in the case means more than milliseconds. Also important is how often they are called and whether by the same process etc.
The functions system_info and system_flag are BIF's which make calls to the c code found in the file erl_bif_info.c , this code is not a NIF so calling them will not cause problems in the sense that long running NIFs might.
NIFs are considered harmful
Long-running NIFs will take over a scheduler and prevent Erlang from
efficiently handling many processes.
Short-running NIFs will still confuse the scheduler if they take more
than a few microseconds to run.
A crashing NIF will take down your VM.

What can threads do that processes can't?

I would like some input on this since it would help guide as to what I should focus on in my studies (if I should consider threads at all).
Are there examples of Rails application where threads are absolutely necessary and the multiple process model can't provide an adequate solution. One exception would be an application that has memory restrictions and would need to use threads instead of spawning multiple processes. But assuming that memory is not an issue, what are some additional cases where threads are the better bet?
Threads are easier to write and debug. I'll start with simple non-threaded code, debug it, then wrap a chunk with Thread.new and join at the end and I'm done.
And, yes, study them. You'll learn useful techniques and gain knowledge that's going to be good to have in your "programming toolchest".
As far as what can threads do that processes can't? Threads can very easily share data and work from the same queue or queues. Doing that with separate processes requires a database or IPC or using a messaging queue, all which add a lot of complexity (though they can also increase capacity too.)
Generally, Threads are more efficient to create / tear-down than processes.
SideKiq is more efficient than Resque largely because SideKiq workers are Threads, whereas Resque use forked workers (processes).
But the problem is that Ruby on MRI doesn't have native threads, so each Thread in Ruby is limited by the Global Interpreter Lock (GIL). See this Igvita article for more information: http://www.igvita.com/2008/11/13/concurrency-is-a-myth-in-ruby/
On platforms with native threads such as JRuby you can have a multi-threaded Rails app (running in a servlet container) and it will likely out-perform the same app running under MRI. Its also possible that JRuby on the Hotspot JVM can do just-in-time performance optimizations as well.

Should spawn be used in Erlang whenever I have a non dependent asynchronous function?

If I have a function that can be executed asynchronously without any dependencies and no other functions require its results directly, should I use spawn ? In my scenario I want to proceed to consume a message queue, so spawning would relif my blocking loop, but if there are other situations where I can distribute function calls as much as possible, will that affect negatively my application ?
Overall, what would be the pros and cons of using Spawn.
Unlike operating system processes or threads, Erlang processes are very light weight. There is minimal overhead in starting, stopping, and scheduling new processes. You should be able to spawn as many of them as you need (the max per vm is in the hundreds of thousands). The Actor model Erlang implements allows you to think about what is actually happening in parallel and write your programs to express that directly. Avoid complicating your logic with work queues if you can avoid it.
Spawn a process whenever it makes logical sense, and optimize only when you have to.
The first thing that come in mind is the size of parameters. They will be copied from your current process to the new one and if the parameters are huge it may be inefficient.
Another problem that may arise is bloating VM with such amount of processes that your system will become irresponsive. You can overcome this problem by using pool of worker processes or special monitor process that will allow to work only limited amount of such processes.
so spawning would relif my blocking loop
If you are in the situation that a loop will receive many messages requiring independant actions, don't hesitate and spawn new processes for each message processing, this way you will take advantage of the multicore capabilities (if any) of your computer. As kjw0188 says, the Erlang processes are very light weight and if the system hits the limit of process numbers alive in parallel (with the assumption that you are doing reasonable code) it is more likely that the application is overloading the capability of the node.

What is the difference between forking and threading in a background process?

Reading the documentation for the spawn gem it states:
By default, spawn will use the fork to spawn child processes. You can
configure it to do threading either by telling the spawn method when
you call it or by configuring your environment. For example, this is
how you can tell spawn to use threading on the call,
What would be the difference between using a fork or a thread, what are the repercussions of either decision, and how do I know which to use?
Threading means you run the code in another thread in the same process whereas forking means you fork a separate process.
Threading in general means that you'll use less memory since you won't have a separate application instance (this advantage is lessened if you have a copy on write friendly ruby such as ree). Communication between threads is also a little easier.
Depending on your ruby interpreter, ruby may not use extra cores efficiently (jruby is good at this, MRI much worse) so spawning a bunch of extra threads will impact the performance of your web app and won't make full use of your resources - MRI only runs one thread at a time
Forking creates separate ruby instances so you'll make better use of multiple cores. You're also less likely to adversely affect your main application. You need to be a tiny bit careful when forking as you share open file descriptors when you fork, so you usually want to reopen database connections, memcache connections etc.
With MRI I'd use forking, with jruby there's more of a case to be made for threading
Fork creates another process and processes are generally designed to run independently of whatever else is going on in your application. Processes do not share resources.
Threads, however, are designed for a different purpose. You would want to use a thread if you wish to parallelize a certain task.
"A fork() induces a parent-child relationship between two processes. Thread creation induces a peer relationship between all the threads of a process."
Read a more extensive explanation on this link.

How to deploy a threadsafe asynchronous Rails app?

I've read tons of material around the web about thread safety and performance in different versions of ruby and rails and I think I understand those things quite well at this point.
What seems to be oddly missing from the discussions is how to actually deploy an asynchronous Rails app. When talking about threads and synchronicity in an app, there are two things people want to optimize:
utilizing all CPU cores with minimal RAM usage
being able to serve new requests while previous requests are waiting on IO
Point 1 is where people get (rightly) excited about JRuby. For this question I am only trying to optimize point 2.
Say this is the only controller in my app:
class TheController < ActionController::Base
def fast
render :text => "hello"
end
def slow
render :text => User.count.to_s
end
end
fast has no IO and can serve hundreds or thousands of requests per second, and slow has to send a request over the network, wait for work to be done, then receive the answer over the network, and is therefore much slower than fast.
So an ideal deployment would allow hundreds of requests to fast to be fulfilled while a request to slow is waiting on IO.
What seems to be missing from the discussions around the web is which layer of the stack is responsible for enabling this concurrency. thin has a --threaded flag, which will "Call the Rack application in threads [experimental]" -- does that start a new thread for each incoming request? Spool up rack app instances in threads that persist and wait for incoming requests?
Is thin the only way or are there others? Does the ruby runtime matter for optimizing point 2?
The right approach for you depends heavily on what your slow method is doing.
In a perfect world, you could use use something like the sinatra-synchrony gem to handle each request in a fiber. You'd only be limited by the maximum number of fibers. Unfortunately, the stack size on fibers is hardcoded, and it is easy to overrun in a Rails app. Additionally, I've read a few horror stories of the difficulties of debugging fibers, due to the automatic yielding after async IO has been initiated. Race conditions are still possible when using fibers, as well. Currently, fibered Ruby is a bit of a ghetto, at least on the front-end of a web app.
A more pragmatic solution that doesn't require code changes is to use a Rack server that has pool of worker threads such as Rainbows! or Puma. I believe Thin's --threaded flag handles each request in a new thread, but spinning up a native OS thread is not cheap. Better to use a thread pool with the pool size set sufficiently high. In Rails, don't forget to set config.threadsafe! in production.
If you're OK with changing code, you can check out Konstantin Haase's excellent talk on real-time Rack. He discusses using the EventMachine::Deferrable class to produce a response outside of the traditional request/response cycle that Rack is built on. This seems really neat, but you have to rewrite the code in an async style.
Also take a look at Cramp and Goliath. These let you implement your slow method in a separate Rack app that is hosted alongside your Rails app, but you will probably have to rewrite your code to work in the Cramp/Goliath handlers as well.
As for your question about the Ruby runtime, it also depends on the work that slow is doing. If you're doing CPU-heavy computation, then you run the risk of the GIL giving you issues. If you're doing IO, then the GIL shouldn't get in your way. (I say shouldn't because I believe I've read about issues with the older mysql gem blocking the GIL.)
Personally, I've had success using sinatra-synchrony for a backend, mashup web service. I can issue several requests to external web services in parallel, and wait for all of them to return. Meanwhile, the frontend Rails server uses a thread pool, and makes requests directly to the backend. Not perfect, but it works well enough right now.

Resources