Rails 4 - threading error - ruby-on-rails

I am trying to perform some calculations to populate some historic data in the database.
The database is SQL Server. The server is tomcat (using JRuby).
I am running the script file in a rails console pointed to the uat environment.
I am trying to use threads to speed up the execution. The idea being that each thread would take an object and run the calculations for it, and save the calculated values back to the database.
Problem: I keep getting this error:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds))
code:
require 'thread'
threads = []
items_to_calculate = Item.where("id < 11").to_a #testing only 10 items for now
for item in items_to_calculate
threads << Thread.new(item) { |myitem|
my_calculator = ItemsCalculator.new(myitem)
to_save = my_calculator.calculate_details
to_save.each do |dt|
dt.save!
end
}
end
threads.each { |aThread| aThread.join }

You're probably spawning more threads than ActiveRecord's DB connection pool has connections. Ekkehard's answer is an excellent general description; so here's a simple example of how to limit your workers using Ruby's thread-safe Queue.
require 'thread'
queue = Queue.new
items.each { |i| queue << i } # Fill the queue
Array.new(5) do # Only 5 concurrent workers
Thread.new do
until queue.empty?
item = queue.pop
ActiveRecord::Base.connection_pool.with_connection do
# Work
end
end
end
end.each(&:join)
I chose 5 because that's the ConnectionPool's default, but you can certainly tune that to the max that still works, or populate another queue with the result to save later and run an arbitrary number of threads for the calculation.
The with_connection method grabs a connection, runs your block, then ensures the connection is released. It's necessary because of a bug in ActiveRecord where the connection doesn't always get released otherwise. Check out this blog post for some details.

You are potentially starting a huge amount of threads at the same time if you leave the testing stage.
Each of these threads will need a DB connection. Either Rails is going to create a new one for every thread (possible creating a huge amount of DB connections at the same time), or it does not, in which case you'll run into trouble because several threads are trying to use the same connection in parallel. The first case would explain the error message because there will probably be a hard limit of open DB connections in your DB server.
Creating threads like this is usually not advisable. You're usually better off to create a handful (controlled/limited) amount of worker threads and using a queue to distribute work between them.
In your case, you could have a set of worker threads to do the calculations, and a second set of worker threads to write to the DB. I do not know enough about the details of your code to decide for you which is better. If the calculation is expensive and the DB-work is not, then you will probably have only one worker for writing to the DB in a serial fashion. If your DB is a beast and highly optimized for parallel writing and you need to write a lot of data, then you will maybe want a (small) amount of DB workers.

Related

Is this Ruby code using threads, thread pools, and concurrency correctly

I am what I now consider part 3 of completing a task of pinging a very large list of URLs (which number in the thousands) and retrieving a URL's x509 certificate associated with it. Part 1 is here (How do I properly use threads to ping a URL) and Part 2 is here (Why won't my connection pool implement my thread code).
Since I asked these two questions, I have now ended up with the following code:
###### This is the code that pings a url and grabs its x509 cert #####
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443')
#url = url
#port = port
end
def ping_for_certificate_info
context = OpenSSL::SSL::SSLContext.new
tcp_client = TCPSocket.new(url, port)
ssl_client = OpenSSL::SSL::SSLSocket.new tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
ssl_client.connect
certificate = ssl_client.peer_cert
verify_result = ssl_client.verify_result
tcp_client.close
{certificate: certificate, verify_result: verify_result }
rescue => error
{certificate: nil, verify_result: nil }
end
end
The above code is paramount that I retrieve the ssl_client.peer_cert. Below I have the following code that is the snippet that makes multiple HTTP pings to URLs for their certs:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
#Do some rails code with the database depending on the results.
So far when I run this code, it is unbelievably slow. I thought that by creating a thread pool with threads, the code would go much faster. That doesn't seem the case and I'm not sure why. A lot of it was because I didn't know the nuances of threads, pools, starvation, locks, etc. However, after implementing the above code, I read some more to try to speed it up and once again I'm confused and could use some clarification as to how I can make the code faster.
For starters, in this excellent article here (ruby-concurrency-parallelism) . We get the following definitions and concepts:
Concurrency vs. Parallelism
These terms are used loosely, but they do have distinct meanings.
Concurrency: The art of doing many tasks, one at a time. By switching
between them quickly, it may appear to the user as though they happen
simultaneously. Parallelism: Doing many tasks at literally the same
time. Instead of appearing simultaneous, they are simultaneous.
Concurrency is most often used for applications that are IO heavy. For
example, a web app may regularly interact with a database or make lots
of network requests. By using concurrency, we can keep our application
responsive, even while we wait for the database to respond to our
query.
This is possible because the Ruby VM allows other threads to run while
one is waiting during IO. Even if a program has to make dozens of
requests, if we use concurrency, the requests will be made at
virtually the same time.
Parallelism, on the other hand, is not currently supported by Ruby.
So from this piece of the article, I understand that what I want to do needs to be done concurrently because I am pinging URLs on the network and that Parallelism is not currently supported by Ruby.
Next is where things get confused for me. From my part 1 question on Stack Overflow, I learned the following in a comment given to me that I should do the following:
Use a thread pool; don't just create a thousand concurrent threads. For something like
connecting to a URL where there will be a lot of waiting you can
oversubscribe the number of threads per CPU core, but not by a huge
amount. You'll have to experiment.
Another user says this:
You'd not spawn thousands of threads, use a connection pool
(e.g https://github.com/mperham/connection_pool) so you have maximum
20-30 concurrent requests going (this maximum number should be
determined by testing at which point network performance drops and you
get these timeouts)
So for this part, I turned to concurrent-ruby and implemented both a CachedThreadPool and a FixedThreadPool with10 threads. I chose a `CachedThreadPool because it seemed to me that the number of threads needed would be taken care of for me by the Threadpool. Now in concurrent ruby's documentation for a pool, I see this:
pool = Concurrent::CachedThreadPool.new
pool.post do
# some parallel work
end
I thought we just established in the first article that parallelism is not supported in Ruby, so what is the thread pool doing? Is it working concurrently or in parallel? What exactly is going on? Do I need a thread pool or not? Also at this point in time I thought connection pools and thread pools were the same just used interchangeably. What is the difference between the two pools and which one do I need?
In another excellent article How to Perform Concurrent HTTP Requests in Ruby and Rails, this article introduces the Concurrent::Promises class form concurrent ruby to avoid locks and have thread safety with two api calls. Here is a snippet of code below with the following description:
def get_all_conversations
groups_thread = Thread.new do
get_groups_list
end
channels_thread = Thread.new do
get_channels_list
end
[groups_thread, channels_thread].map(&:value).flatten
end
Every request is executed it its own thread, which can run in parallel because it is a blocking I/O. But can you see a catch here?
In the above code, another mention of parallelism which we just said didn't exist in ruby. Below is the approach with Concurrent::Promise
def get_all_conversations
groups_promise = Concurrent::Promise.execute do
get_groups_list
end
channels_promise = Concurrent::Promise.execute do
get_channels_list
end
[groups_promise, channels_promise].map(&:value!).flatten
end
So according to this article, these requests are being made 'in parallel'. Are we still talking about concurrency at this point?
Finally, in these two articles, they talk about using Futures for concurrent http requests. I won't go into the details but I'll paste the links here.
1.Using Concurrent Ruby in a Ruby on Rails Application
2. Learn Concurrency by Implementing Futures in Ruby
Again, what's talked about in the article looks to me like the Concurrent::Promise functionality. I just want to note that the examples show how to use the concepts for two different API calls that need to be combined together. This is not what I need. I just need to make thousands of API calls fast and log the results.
In conclusion, I just want to know what I need to do to make my code faster and thread safe to make it run concurrently. What exactly am I missing to make the code go faster because right now it is going so slow that I might as well not have used threads in the first place.
Summary
I have to ping thousands of URLs using threads to speed up the process. The code is slow and I am confused if I am using threads, thread pools, and concurrency correctly.
Let us look at the problems you have described and try to solve these one at a time:
You have two pieces of code, SslClient and the script which uses this ssl client. From my understanding of the threadpool, the way you have used the threadpool needs to be changed a bit.
From:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
to:
pool = Concurrent::FixedThreadPool.new(10)
[LARGE LIST OF URLS TO PING].each do | struct |
pool.post do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_form
In the initial version, there is only one unit of work that is posted to the pool. In the second version, we are posting as many units of work to the pool as there are items in LARGE LIST OF URLS TO PING.
To add a bit more about Concurrency vs Parallelism in Ruby, it is true that Ruby doesn't support true parallelism due to GIL (Global Interpreter Lock), but this only applies when we are actually doing any amount of work on the CPU. In case of a network request, CPU bound work duration is very negligible compared to the IO bound work, which means that your usecase is a very good candidate for using threads.
Also by using a threadpool, we can minimize the overhead of thread creation incurred by the CPU. When we use a threadpool, like in the case of Concurrent::FixedThreadPool.new(10), we are literally restricting the number of threads that are available in the pool, for an unbound threadpool, new threads are created for everytime when a unit of work is present, but rest of thre threads in the pool are busy.
In the first article, there was a need to collect the result returned by each individual workers and also to act meaningfully in case of an exception (I am the author). You should be able to use the class given in that blog without any change.
Lets try rewriting your code using Concurrent::Future since in your case too, we need the results.
thread_pool = Concurrent::FixedThreadPool.new(20)
executors = [LARGE LIST OF URLS TO PING].map do | struct |
Concurrent::Future.execute({ executor: thread_pool }) do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
struct
end
end
executors.map(&:value)
I hope this helps. In case of questions, please ask in comments, I shall modify this write up to answer those.

Multithreading vs Background jobs in Rails

I have an application that makes thousands of requests to a web service API. Each request takes about 2 seconds, then the response creates new record in the database. I want to just fire off as many of those requests as I can simultaneously, and save the response to the database as as soon as I get the response.
Is this something I should be using a gem like sidekiq for, or the ruby Thread class? I don't want to just hand off the requests to be handled synchronously.
Sounds like you need a thread pool for performing the operation, and a database thread to commit the results.
You can build one of these really simply:
require 'thread'
db_queue = Queue.new
Thread.new do
while (item = db_queue.pop)
# ... Deal with item in queue
end
end
# Example of supplying a job
db_queue.push(api_response)
# When finished
db_queue.push(nil)
Due to the Global Interpreter Lock in the standard Ruby runtime threads are only really useful for managing many lightly loaded threads. If you need something more heavy-duty, JRuby might be what you're looking for.

Sidekiq threads accessing global variable

I have a controller that spins off 6 sidekiq threads for faster parallel processing of a large file. Before that however I want to provide these threads with a few variables that should be available accross all threads because they variables themselves are fairly memory intensive. (it is only reading from that, not writing, so the concurrency issues doesn't exist)
In other words my controller looks like this
def foo
$bar1 = ....
$bar2 = ...
worker.perform_async()...
worker2.perform_async()...
end
I don't want to put those global vars into the perform methods because serializing those to redis chokes the entire thing. My issue is that the workers cannot see these variables and die because of a no method error (i.e. trying to call .first on on of them gives that error because the var is nil for the workers).
How come? Is there any other way to do this that won't kill my memory? (i.e. I don't want to take up most of the mem with 6x the same large array)
Sidekiq runs on a separate process, so it doesn't share the same memory as the initiator of the worker.
If the data is static, you might want to load it on the start of the sidekiq process (maybe when you configure the sidekiq server).
If it changes per task, you should model it in a way where you can create a global repository to hold it (if redis is not good for this, maybe you can try memcached)...

Ruby on Rails large amount of sql queries

I'm building an app that queries an api and that saves the output of the api into a database.
This is no rocket science and it is working but as the amount of data is growing, the inserts are going slower.
I have a simple form that accepts a keyword which is added into the api string to get all the keywords. I now like to show the results of the api onto the screen so the user can choose what results to keep.
I've added threading in my code, so that the inserts are going faster.
Is there a way to trigger an action when all the threads are finished?
Thanks
The simplest way to do this is to join all the threads in a thread pool, this is effectively waiting for them to finish:
threadpool = []
threadpool << Thread.new { do_stuff }
threadpool << Thread.new { do_more }
threadpool.map &:join # wait for all threads to finish
do_final_stuff # code below the join can only run when all threads finish
But that's just if you're using plain threads.

Connection pool issue with ActiveRecord objects in rufus-scheduler

I'm using rufus-scheduler to run a number of frequent jobs that do some various tasks with ActiveRecord objects. If there is any sort of network or postgresql hiccup, even after recovery, all the threads will throw the following error until the process is restarted:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5 seconds (waited 5.000122687 seconds). The max pool size is currently 5; consider increasing it.
The error can easily be reproduced by restarting postgres. I've tried playing (up to 15) with the pool size, but no luck there.
That leads me to believe the connections are just in a stale state, which I thought would be fixed with the call to clear_stale_cached_connections!.
Is there a more reliable pattern to do this?
The block that is passed is a simple select and update active record call, and happens to matter what the AR object is.
The rufus job:
scheduler.every '5s' do
db do
DataFeed.update #standard AR select/update
end
end
wrapper:
def db(&block)
begin
ActiveRecord::Base.connection_pool.clear_stale_cached_connections!
#ActiveRecord::Base.establish_connection # this didn't help either way
yield block
rescue Exception => e
raise e
ensure
ActiveRecord::Base.connection.close if ActiveRecord::Base.connection
ActiveRecord::Base.clear_active_connections!
end
end
Rufus scheduler starts a new thread for every job.
ActiveRecord on the other hand cannot share connections between threads, so it needs to assign a connection to a specific thread.
When your thread doesn't have a connection yet, it will get one from the pool.
(If all connections in the pool are in use, it will wait untill one is returned from another thread. Eventually timing out and throwing ConnectionTimeoutError)
It is your responsibility to return it back to the pool when you are done with it, in a Rails app, this is done automatically. But if you are managing your own threads (as rufus does), you have to do this yourself.
Lucklily, there is an api for this:
If you put your code inside a with_connection block, it will get a connection form the pool, and release it when it is done
ActiveRecord::Base.connection_pool.with_connection do
#your code here
end
In your case:
def db
ActiveRecord::Base.connection_pool.with_connection do
yield
end
end
Should do the trick....
http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html#method-i-with_connection
The reason can be that you have many threads which are using all connections, if DataFeed.update method takes more than 5 seconds, than your block can be overlapped.
try
scheduler.every("5s", :allow_overlapping => false) do
#...
end
Also try release connection instead of closing it.
ActiveRecord::Base.connection_pool.release_connection
I don't really know about rufus-scheduler, but I got some ideas.
The first problem could be a bug on rufus-scheduler that does not checkout database connection properly. If it's the case the only solution is to clear stale connections manually as you already do and to inform the author of rufus-scheduler about your issue.
Another problem that could happen is that your DataFeed operation takes a really long time and because it is performed every 5 secondes Rails is running out of database connections, but it's rather unlikely.

Resources