I would like to be able to connect to different databases in separate threads, and query the same model in each database. For instance, without threads, I can do something like:
# given 'db1' and 'db2' are rails environments with connection configurations
['db1', 'db2'].each do |db|
Post.establish_connection(db)
Post.where(title: "Foo")
end
Post.establish_connection(Rails.env)
This will loop over the two databases and look up the posts in each. I need to be able to do this in parallel using threads, like:
['db1', 'db2'].each do |db|
threads = Thread.new do
Post.establish_connection(db)
Post.where(title: "Foo")
end
end
threads.join
Post.establish_connection(Rails.env)
But quite clearly, establishing a new connection pool in each thread using the global Post class isn't threadsafe.
What I'd like to do is establish a new connection pool in each thread. I got this far:
['db1', 'db2'].each do |db|
threads = Thread.new do
conf = ActiveRecord::ConnectionAdapters::ConnectionSpecification.new(Rails.configuration.database_configuration[db], "mysql2_connection")
pool = ActiveRecord::ConnectionAdapters::ConnectionPool.new(conf)
pool.with_connection do |con|
# problem is, I have this con object but using the Post class still uses the thread's default connection.
Post.where(title: "Foo")
end
end
end
threads.join
There has to be a way for me to change the connection pool that ActiveRecord uses, on a thread by thread basis?
Related
I am using a second database with datasets within my API.
Every API request can have up to 3 queries on that Database so I am splitting them in three Threads. To keep it Thread safe I am using a connection pool.
But after the whole code is run the ConnectionPool thread is not terminated. So basically every time a request is made, we will have a new Thread on the server until basically there is no memory left.
Is there a way to close the connection pool thread? Or am I doing wrong on creating a connection pool per request?
I setup the Connection Pool this way:
begin
full_db = YAML::load(ERB.new(File.read(Rails.root.join("config","full_datasets_database.yml"))).result)
resolver = ActiveRecord::ConnectionAdapters::ConnectionSpecification::Resolver.new(full_db)
spec = resolver.spec(Rails.env.to_sym)
pool = ActiveRecord::ConnectionAdapters::ConnectionPool.new(spec)
Then I am running through the queries array and getting the results to the query
returned_responses = []
queries_array.each do |query|
threads << Thread.new do
pool.with_connection do |conn|
returned_responses << conn.execute(query).to_a
end
end
end
threads.map(&:join)
returned_responses
Finally I close the connections inside the connection pool:
ensure
pool.disconnect!
end
Since you want to make SQL queries directly without taking advantage of ActiveRecord as the ORM, but you do want to take advantage of ActiveRecord connection pooling, I suggest you create a new abstract class like ApplicationRecord:
# app/models/full_datasets.rb
class FullDatasets < ActiveRecord::Base
self.abstract_class = true
connects_to database: {
writing: :full_datasets_database,
reading: :full_datasets_database
}
end
You'll need to configure the database full_datasets_database in database.yml so that connects_to is able to connect to it.
Then you'll be able to connect directly to that database and make direct SQL queries against it by referencing that class instead of ActiveRecord::Base:
FullDatasets.connection.execute(query)
The connection pooling will happen transparently with different pools:
FullDatasets.connection_pool.object_id
=> 22620
ActiveRecord::Base.connection_pool.object_id
=> 9000
You may have to do additional configuration, like dumping the schema to db/full_datasets_schema.rb, but any additional troubleshooting or configuration you'll have to do will be in described in https://guides.rubyonrails.org/active_record_multiple_databases.html.
The short version of this explanation is that you should attempt to take advantage of ActiveRecord as much as possible so that your implementation is clean and straightforward while still allowing you to drop directly to raw SQL.
After some time spent, I ended up finding an answer. The generic idea came from #anothermg but I had to do some changes in order to work in my version of rails (5.2).
I setup the database in config/full_datasets_database.yml
I had the following initializer already:
#! config/initializers/db_full_datasets.rb
DB_FULL_DATASETS = YAML::load(ERB.new(File.read(Rails.root.join("config","full_datasets_database.yml"))).result)[Rails.env]
I created the following model to create a connection to the new database:
#! app/models/full_datasets.rb
class FullDatasets < ActiveRecord::Base
self.abstract_class = true
establish_connection DB_FULL_DATASETS
end
On the actual module I added the following code:
def parallel_queries(queries_array)
returned_responses = []
threads = []
conn = FullDatasets.connection_pool
queries_array.each do |query|
threads << Thread.new do
returned_responses << conn.with_connection { |c| c.execute(query).to_a }
end
end
threads.map(&:join)
returned_responses
end
Follow the official way of handling multiple databases in Rails:
https://guides.rubyonrails.org/active_record_multiple_databases.html
I can't give you an accurate answer as I do not have your source code to fully understand the whole context. If the setup that I sent above is not applicable to your use case, you might have missed some background clean up tasks. You can refer to this doc:
https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html
I have a requirement of batch imports. Files can contain 1000s of records and each record needs validation. User wants to be notified how many records were invalid. Originally I did this with Ruby's Mutex and Redis' Publish/Subscribe. Note that I have 20 concurrent threads processing each record via Sidekiq:
class Record < ActiveRecord::Base
class << self
# invalidated_records is SHARED memory for the Sidekiq worker threads
attr_accessor :invalidated_records
attr_accessor :semaphore
end
def self.batch_import
self.semaphore = Mutex.new
self.invalid_records = []
redis.subscribe_with_timeout(180, 'validation_update') do |on|
on.message do |channel, message|
if message.to_s =~ /\d+|import_.+/
self.semaphore.synchronize {
self.invalidated_records << message
}
elsif message == 'exit'
redis.unsubscribe
end
end
end
end
end
Sidekiq would publish to the Record object:
Redis.current.publish 'validation_update', 'import_invalid_address'
The problem is something weird happens. All the invalid imports are not populated in Record.invalidated_records. Many of them are but not all of them. I thought it was because multiple threads try to update the object concurrently, it taints the object. And I thought the Mutex lock would solve this problem. But still after adding Mutex lock, not all invalids are populated in Record.invalidated_records.
Ultimately, I used redis atomic decrement and increment to track invalid imports and that worked like a charm. But I am curious what is the issue with Ruby Mutex and multiple threads trying to update Record.invalidated_records?
i have not used mutex but i think what happens is that thread sees semaphore is locked and skip saving << message
u need to use https://apidock.com/ruby/ConditionVariable
wait mutex lock for unlock and then save data
I am trying to perform some calculations to populate some historic data in the database.
The database is SQL Server. The server is tomcat (using JRuby).
I am running the script file in a rails console pointed to the uat environment.
I am trying to use threads to speed up the execution. The idea being that each thread would take an object and run the calculations for it, and save the calculated values back to the database.
Problem: I keep getting this error:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds))
code:
require 'thread'
threads = []
items_to_calculate = Item.where("id < 11").to_a #testing only 10 items for now
for item in items_to_calculate
threads << Thread.new(item) { |myitem|
my_calculator = ItemsCalculator.new(myitem)
to_save = my_calculator.calculate_details
to_save.each do |dt|
dt.save!
end
}
end
threads.each { |aThread| aThread.join }
You're probably spawning more threads than ActiveRecord's DB connection pool has connections. Ekkehard's answer is an excellent general description; so here's a simple example of how to limit your workers using Ruby's thread-safe Queue.
require 'thread'
queue = Queue.new
items.each { |i| queue << i } # Fill the queue
Array.new(5) do # Only 5 concurrent workers
Thread.new do
until queue.empty?
item = queue.pop
ActiveRecord::Base.connection_pool.with_connection do
# Work
end
end
end
end.each(&:join)
I chose 5 because that's the ConnectionPool's default, but you can certainly tune that to the max that still works, or populate another queue with the result to save later and run an arbitrary number of threads for the calculation.
The with_connection method grabs a connection, runs your block, then ensures the connection is released. It's necessary because of a bug in ActiveRecord where the connection doesn't always get released otherwise. Check out this blog post for some details.
You are potentially starting a huge amount of threads at the same time if you leave the testing stage.
Each of these threads will need a DB connection. Either Rails is going to create a new one for every thread (possible creating a huge amount of DB connections at the same time), or it does not, in which case you'll run into trouble because several threads are trying to use the same connection in parallel. The first case would explain the error message because there will probably be a hard limit of open DB connections in your DB server.
Creating threads like this is usually not advisable. You're usually better off to create a handful (controlled/limited) amount of worker threads and using a queue to distribute work between them.
In your case, you could have a set of worker threads to do the calculations, and a second set of worker threads to write to the DB. I do not know enough about the details of your code to decide for you which is better. If the calculation is expensive and the DB-work is not, then you will probably have only one worker for writing to the DB in a serial fashion. If your DB is a beast and highly optimized for parallel writing and you need to write a lot of data, then you will maybe want a (small) amount of DB workers.
In our Rails application we need to use different databases depending on the subdomain of the request (different DB per country).
Right now we're doing something similar to what's recommended in this question. That is, calling ActiveRecord::Base.establish_connection on each request.
But it seems ActiveRecord::Base.establish_connection drops the current connection pool and establishes a new connection each time it's called.
I made this quick benchmark to see if there was any significant difference between calling establish_connection each time and having the connections already established:
require 'benchmark/ips'
$config = Rails.configuration.database_configuration[Rails.env]
$db1_config = $config.dup.update('database' => 'db1')
$db2_config = $config.dup.update('database' => 'db2')
# Method 1: call establish_connection on each "request".
Benchmark.ips do |r|
r.report('establish_connection:') do
# Simulate two requests, one for each DB.
ActiveRecord::Base.establish_connection($db1_config)
MyModel.count # A little query to force the DB connection to establish.
ActiveRecord::Base.establish_connection($db2_config)
MyModel.count
end
end
# Method 2: Have different subclasses of my models, one for each DB, and
# call establish_connection only once
class MyModelDb1 < MyModel
establish_connection($db1_config)
end
class MyModelDb2 < MyModel
establish_connection($db2_config)
end
Benchmark.ips do |r|
r.report('different models:') do
MyModelDb1.count
MyModelDb2.count
end
end
I run this script with rails runner and pointing to a local mysql with some couple thousand records on the DBs and the results seem to indicate that there in fact is a pretty big difference (of an order of magnitude) between the two methods (BTW, i'm not sure if the benchmark is valid or i screwed up and therefore the results are misleading):
Calculating -------------------------------------
establish_connection: 8 i/100ms
-------------------------------------------------
establish_connection: 117.9 (±26.3%) i/s - 544 in 5.001575s
Calculating -------------------------------------
different models: 119 i/100ms
-------------------------------------------------
different models: 1299.4 (±22.1%) i/s - 6188 in 5.039483s
So, basically, i'd like to know if there's a way to maintain a connection pool for each subdomain and then re-use those connections instead of establishing a new connection on each request. Having a subclass of my models for each subdomain is not feasible, as there are many models; i just want to change the connection for all the models (in ActiveRecord::Base)
Well, i've been digging into this a bit more and managed to get something working.
After reading tenderlove's post about connection management in ActiveRecord, which explains how the class hierarchy gets unnecessarily coupled with the connection management, i understood why doing what i'm trying to do in not as straightforward as one would expect.
What i ended up doing was subclassing ActiveRecord's ConnectionHandler and using that new connection handler at the top of my model hierarchy (some fiddling on the ConnectionHandler code was needed to understand how it works internally; so of course this solution could be very tied to the Rails version i'm using (3.2)). Something like:
# A model class that connects to a different DB depending on the subdomain
# we're in
class ModelBase < ActiveRecord::Base
self.abstract_class = true
self.connection_handler = CustomConnectionHandler.new
end
# ...
class CustomConnectionHandler < ActiveRecord::ConnectionAdapters::ConnectionHandler
def initialize
super
#pools_by_subdomain = {}
end
# Override the behaviour of ActiveRecord's ConnectionHandler to return a
# connection pool for the current domain.
def retrieve_connection_pool(klass)
# Get current subdomain somehow (Maybe store it in a class variable on
# each request or whatever)
subdomain = ##subdomain
#pools_by_subdomain[subdomain] ||= create_pool(subdomain)
end
private
def create_pool(subdomain)
conf = Rails.configuration.database_configuration[Rails.env].dup
# The name of the DB for that subdomain...
conf.update!('database' => "db_#{subdomain}")
resolver = ActiveRecord::Base::ConnectionSpecification::Resolver.new(conf, nil)
# Call ConnectionHandler#establish_connection, which receives a key
# (in this case the subdomain) for the new connection pool
establish_connection(subdomain, resolver.spec)
end
end
This still needs some testing to check if there is in fact a performance gain, but my initial tests running on a local Unicorn server suggest there is.
As far as I know Rails does not maintain it's database pool between requests, except if you use multi-threaded env. like Sidekiq. But if you use Passenger or Unicorn on your production server, it will create a new database connection for each Rails instance.
So basically using a database connection pool is useless, which means that creating a new database connection on each request should not be a concern.
I'm using rufus-scheduler to run a number of frequent jobs that do some various tasks with ActiveRecord objects. If there is any sort of network or postgresql hiccup, even after recovery, all the threads will throw the following error until the process is restarted:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5 seconds (waited 5.000122687 seconds). The max pool size is currently 5; consider increasing it.
The error can easily be reproduced by restarting postgres. I've tried playing (up to 15) with the pool size, but no luck there.
That leads me to believe the connections are just in a stale state, which I thought would be fixed with the call to clear_stale_cached_connections!.
Is there a more reliable pattern to do this?
The block that is passed is a simple select and update active record call, and happens to matter what the AR object is.
The rufus job:
scheduler.every '5s' do
db do
DataFeed.update #standard AR select/update
end
end
wrapper:
def db(&block)
begin
ActiveRecord::Base.connection_pool.clear_stale_cached_connections!
#ActiveRecord::Base.establish_connection # this didn't help either way
yield block
rescue Exception => e
raise e
ensure
ActiveRecord::Base.connection.close if ActiveRecord::Base.connection
ActiveRecord::Base.clear_active_connections!
end
end
Rufus scheduler starts a new thread for every job.
ActiveRecord on the other hand cannot share connections between threads, so it needs to assign a connection to a specific thread.
When your thread doesn't have a connection yet, it will get one from the pool.
(If all connections in the pool are in use, it will wait untill one is returned from another thread. Eventually timing out and throwing ConnectionTimeoutError)
It is your responsibility to return it back to the pool when you are done with it, in a Rails app, this is done automatically. But if you are managing your own threads (as rufus does), you have to do this yourself.
Lucklily, there is an api for this:
If you put your code inside a with_connection block, it will get a connection form the pool, and release it when it is done
ActiveRecord::Base.connection_pool.with_connection do
#your code here
end
In your case:
def db
ActiveRecord::Base.connection_pool.with_connection do
yield
end
end
Should do the trick....
http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html#method-i-with_connection
The reason can be that you have many threads which are using all connections, if DataFeed.update method takes more than 5 seconds, than your block can be overlapped.
try
scheduler.every("5s", :allow_overlapping => false) do
#...
end
Also try release connection instead of closing it.
ActiveRecord::Base.connection_pool.release_connection
I don't really know about rufus-scheduler, but I got some ideas.
The first problem could be a bug on rufus-scheduler that does not checkout database connection properly. If it's the case the only solution is to clear stale connections manually as you already do and to inform the author of rufus-scheduler about your issue.
Another problem that could happen is that your DataFeed operation takes a really long time and because it is performed every 5 secondes Rails is running out of database connections, but it's rather unlikely.