Ruby on Rails large amount of sql queries - ruby-on-rails

I'm building an app that queries an api and that saves the output of the api into a database.
This is no rocket science and it is working but as the amount of data is growing, the inserts are going slower.
I have a simple form that accepts a keyword which is added into the api string to get all the keywords. I now like to show the results of the api onto the screen so the user can choose what results to keep.
I've added threading in my code, so that the inserts are going faster.
Is there a way to trigger an action when all the threads are finished?
Thanks

The simplest way to do this is to join all the threads in a thread pool, this is effectively waiting for them to finish:
threadpool = []
threadpool << Thread.new { do_stuff }
threadpool << Thread.new { do_more }
threadpool.map &:join # wait for all threads to finish
do_final_stuff # code below the join can only run when all threads finish
But that's just if you're using plain threads.

Related

Is this Ruby code using threads, thread pools, and concurrency correctly

I am what I now consider part 3 of completing a task of pinging a very large list of URLs (which number in the thousands) and retrieving a URL's x509 certificate associated with it. Part 1 is here (How do I properly use threads to ping a URL) and Part 2 is here (Why won't my connection pool implement my thread code).
Since I asked these two questions, I have now ended up with the following code:
###### This is the code that pings a url and grabs its x509 cert #####
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443')
#url = url
#port = port
end
def ping_for_certificate_info
context = OpenSSL::SSL::SSLContext.new
tcp_client = TCPSocket.new(url, port)
ssl_client = OpenSSL::SSL::SSLSocket.new tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
ssl_client.connect
certificate = ssl_client.peer_cert
verify_result = ssl_client.verify_result
tcp_client.close
{certificate: certificate, verify_result: verify_result }
rescue => error
{certificate: nil, verify_result: nil }
end
end
The above code is paramount that I retrieve the ssl_client.peer_cert. Below I have the following code that is the snippet that makes multiple HTTP pings to URLs for their certs:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
#Do some rails code with the database depending on the results.
So far when I run this code, it is unbelievably slow. I thought that by creating a thread pool with threads, the code would go much faster. That doesn't seem the case and I'm not sure why. A lot of it was because I didn't know the nuances of threads, pools, starvation, locks, etc. However, after implementing the above code, I read some more to try to speed it up and once again I'm confused and could use some clarification as to how I can make the code faster.
For starters, in this excellent article here (ruby-concurrency-parallelism) . We get the following definitions and concepts:
Concurrency vs. Parallelism
These terms are used loosely, but they do have distinct meanings.
Concurrency: The art of doing many tasks, one at a time. By switching
between them quickly, it may appear to the user as though they happen
simultaneously. Parallelism: Doing many tasks at literally the same
time. Instead of appearing simultaneous, they are simultaneous.
Concurrency is most often used for applications that are IO heavy. For
example, a web app may regularly interact with a database or make lots
of network requests. By using concurrency, we can keep our application
responsive, even while we wait for the database to respond to our
query.
This is possible because the Ruby VM allows other threads to run while
one is waiting during IO. Even if a program has to make dozens of
requests, if we use concurrency, the requests will be made at
virtually the same time.
Parallelism, on the other hand, is not currently supported by Ruby.
So from this piece of the article, I understand that what I want to do needs to be done concurrently because I am pinging URLs on the network and that Parallelism is not currently supported by Ruby.
Next is where things get confused for me. From my part 1 question on Stack Overflow, I learned the following in a comment given to me that I should do the following:
Use a thread pool; don't just create a thousand concurrent threads. For something like
connecting to a URL where there will be a lot of waiting you can
oversubscribe the number of threads per CPU core, but not by a huge
amount. You'll have to experiment.
Another user says this:
You'd not spawn thousands of threads, use a connection pool
(e.g https://github.com/mperham/connection_pool) so you have maximum
20-30 concurrent requests going (this maximum number should be
determined by testing at which point network performance drops and you
get these timeouts)
So for this part, I turned to concurrent-ruby and implemented both a CachedThreadPool and a FixedThreadPool with10 threads. I chose a `CachedThreadPool because it seemed to me that the number of threads needed would be taken care of for me by the Threadpool. Now in concurrent ruby's documentation for a pool, I see this:
pool = Concurrent::CachedThreadPool.new
pool.post do
# some parallel work
end
I thought we just established in the first article that parallelism is not supported in Ruby, so what is the thread pool doing? Is it working concurrently or in parallel? What exactly is going on? Do I need a thread pool or not? Also at this point in time I thought connection pools and thread pools were the same just used interchangeably. What is the difference between the two pools and which one do I need?
In another excellent article How to Perform Concurrent HTTP Requests in Ruby and Rails, this article introduces the Concurrent::Promises class form concurrent ruby to avoid locks and have thread safety with two api calls. Here is a snippet of code below with the following description:
def get_all_conversations
groups_thread = Thread.new do
get_groups_list
end
channels_thread = Thread.new do
get_channels_list
end
[groups_thread, channels_thread].map(&:value).flatten
end
Every request is executed it its own thread, which can run in parallel because it is a blocking I/O. But can you see a catch here?
In the above code, another mention of parallelism which we just said didn't exist in ruby. Below is the approach with Concurrent::Promise
def get_all_conversations
groups_promise = Concurrent::Promise.execute do
get_groups_list
end
channels_promise = Concurrent::Promise.execute do
get_channels_list
end
[groups_promise, channels_promise].map(&:value!).flatten
end
So according to this article, these requests are being made 'in parallel'. Are we still talking about concurrency at this point?
Finally, in these two articles, they talk about using Futures for concurrent http requests. I won't go into the details but I'll paste the links here.
1.Using Concurrent Ruby in a Ruby on Rails Application
2. Learn Concurrency by Implementing Futures in Ruby
Again, what's talked about in the article looks to me like the Concurrent::Promise functionality. I just want to note that the examples show how to use the concepts for two different API calls that need to be combined together. This is not what I need. I just need to make thousands of API calls fast and log the results.
In conclusion, I just want to know what I need to do to make my code faster and thread safe to make it run concurrently. What exactly am I missing to make the code go faster because right now it is going so slow that I might as well not have used threads in the first place.
Summary
I have to ping thousands of URLs using threads to speed up the process. The code is slow and I am confused if I am using threads, thread pools, and concurrency correctly.
Let us look at the problems you have described and try to solve these one at a time:
You have two pieces of code, SslClient and the script which uses this ssl client. From my understanding of the threadpool, the way you have used the threadpool needs to be changed a bit.
From:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
to:
pool = Concurrent::FixedThreadPool.new(10)
[LARGE LIST OF URLS TO PING].each do | struct |
pool.post do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_form
In the initial version, there is only one unit of work that is posted to the pool. In the second version, we are posting as many units of work to the pool as there are items in LARGE LIST OF URLS TO PING.
To add a bit more about Concurrency vs Parallelism in Ruby, it is true that Ruby doesn't support true parallelism due to GIL (Global Interpreter Lock), but this only applies when we are actually doing any amount of work on the CPU. In case of a network request, CPU bound work duration is very negligible compared to the IO bound work, which means that your usecase is a very good candidate for using threads.
Also by using a threadpool, we can minimize the overhead of thread creation incurred by the CPU. When we use a threadpool, like in the case of Concurrent::FixedThreadPool.new(10), we are literally restricting the number of threads that are available in the pool, for an unbound threadpool, new threads are created for everytime when a unit of work is present, but rest of thre threads in the pool are busy.
In the first article, there was a need to collect the result returned by each individual workers and also to act meaningfully in case of an exception (I am the author). You should be able to use the class given in that blog without any change.
Lets try rewriting your code using Concurrent::Future since in your case too, we need the results.
thread_pool = Concurrent::FixedThreadPool.new(20)
executors = [LARGE LIST OF URLS TO PING].map do | struct |
Concurrent::Future.execute({ executor: thread_pool }) do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
struct
end
end
executors.map(&:value)
I hope this helps. In case of questions, please ask in comments, I shall modify this write up to answer those.

Call new thread each time within large loop

I have 20,000 to 30,000 users, who should receive a message at a given time. SendMessage is a service that does API call against a third party site. I have this loop:
#users.each do |user|
...
SendMessage.new(user.id)
...
end
Since there are quite large number of users, the API response takes about one second, and the last user receives the message too later than the scheduled time.
I thought of using Thread like this:
#users.each do |user|
...
Thread.new{ SendMessage.new(user.id) }
...
end
Can I do as above? Is it a good idea to use Thread.new 20,000 times within a loop? Are there any drawbacks? Is there something else I am supposed to do?
Looking at your need to send 20,000 API calls to a third party provider, and assuming this can be taken async, you should implement this with Sidekiq or Resque.
You can issue a request initially, and then poll continuously for status update if needed.
I can't comment yet. But if my answer it's not usefull I will destroy it.
So, if you use each, all records will be loaded into memory, it's not good idea when you have more 20 000 records.
Try to use find_each. The find is performed by find_in_batches with a batch size of 1000 (or as specified by the :batch_size option).

Rails 4 - threading error

I am trying to perform some calculations to populate some historic data in the database.
The database is SQL Server. The server is tomcat (using JRuby).
I am running the script file in a rails console pointed to the uat environment.
I am trying to use threads to speed up the execution. The idea being that each thread would take an object and run the calculations for it, and save the calculated values back to the database.
Problem: I keep getting this error:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds))
code:
require 'thread'
threads = []
items_to_calculate = Item.where("id < 11").to_a #testing only 10 items for now
for item in items_to_calculate
threads << Thread.new(item) { |myitem|
my_calculator = ItemsCalculator.new(myitem)
to_save = my_calculator.calculate_details
to_save.each do |dt|
dt.save!
end
}
end
threads.each { |aThread| aThread.join }
You're probably spawning more threads than ActiveRecord's DB connection pool has connections. Ekkehard's answer is an excellent general description; so here's a simple example of how to limit your workers using Ruby's thread-safe Queue.
require 'thread'
queue = Queue.new
items.each { |i| queue << i } # Fill the queue
Array.new(5) do # Only 5 concurrent workers
Thread.new do
until queue.empty?
item = queue.pop
ActiveRecord::Base.connection_pool.with_connection do
# Work
end
end
end
end.each(&:join)
I chose 5 because that's the ConnectionPool's default, but you can certainly tune that to the max that still works, or populate another queue with the result to save later and run an arbitrary number of threads for the calculation.
The with_connection method grabs a connection, runs your block, then ensures the connection is released. It's necessary because of a bug in ActiveRecord where the connection doesn't always get released otherwise. Check out this blog post for some details.
You are potentially starting a huge amount of threads at the same time if you leave the testing stage.
Each of these threads will need a DB connection. Either Rails is going to create a new one for every thread (possible creating a huge amount of DB connections at the same time), or it does not, in which case you'll run into trouble because several threads are trying to use the same connection in parallel. The first case would explain the error message because there will probably be a hard limit of open DB connections in your DB server.
Creating threads like this is usually not advisable. You're usually better off to create a handful (controlled/limited) amount of worker threads and using a queue to distribute work between them.
In your case, you could have a set of worker threads to do the calculations, and a second set of worker threads to write to the DB. I do not know enough about the details of your code to decide for you which is better. If the calculation is expensive and the DB-work is not, then you will probably have only one worker for writing to the DB in a serial fashion. If your DB is a beast and highly optimized for parallel writing and you need to write a lot of data, then you will maybe want a (small) amount of DB workers.

Multithreading vs Background jobs in Rails

I have an application that makes thousands of requests to a web service API. Each request takes about 2 seconds, then the response creates new record in the database. I want to just fire off as many of those requests as I can simultaneously, and save the response to the database as as soon as I get the response.
Is this something I should be using a gem like sidekiq for, or the ruby Thread class? I don't want to just hand off the requests to be handled synchronously.
Sounds like you need a thread pool for performing the operation, and a database thread to commit the results.
You can build one of these really simply:
require 'thread'
db_queue = Queue.new
Thread.new do
while (item = db_queue.pop)
# ... Deal with item in queue
end
end
# Example of supplying a job
db_queue.push(api_response)
# When finished
db_queue.push(nil)
Due to the Global Interpreter Lock in the standard Ruby runtime threads are only really useful for managing many lightly loaded threads. If you need something more heavy-duty, JRuby might be what you're looking for.

Rails: How to execute one task per user in parallel?

I have one simple rake task that execute some action per user. Something like these:
task users_actions: :environment do ||
User.all.each { |u|
# Some actions here
}
end
The problem it's that it doesn't start with the next user until it finished one. What I want is to execute these in parallel. How can I do that? It's even posible?
Thanks,
If there was a good library available, it would be better to use it rather than implementing everything from scratch. concurrent-ruby has all kinds of utility classes for writing concurrent code, but I'm not sure if they have something suitable for this use case; anyways, I'll show you how to do it from scratch.
First pull in the thread library:
require 'thread'
Make a thread-safe queue, and stick all the users on it:
queue = Queue.new
User.all.each { |user| queue << user }
Start some number of worker threads, and make them process items from the queue until all are done.
threads = 5.times.collect do
Thread.new do
while true
user = queue.pop(true) rescue break
# do something with the user
# this had better be thread-safe, or you will live to regret it!
end
end
end
Then wait until all the threads finish:
threads.each(&:join)
Again, please make sure that the code which processes each user is thread-safe! The power of multi-threading is in your hands, don't abuse it!
NOTE: If your user-processing code is very CPU-intensive, you might consider running this on Rubinius or JRuby, so that more than one thread can run Ruby code at the same time.

Resources