I have a controller that spins off 6 sidekiq threads for faster parallel processing of a large file. Before that however I want to provide these threads with a few variables that should be available accross all threads because they variables themselves are fairly memory intensive. (it is only reading from that, not writing, so the concurrency issues doesn't exist)
In other words my controller looks like this
def foo
$bar1 = ....
$bar2 = ...
worker.perform_async()...
worker2.perform_async()...
end
I don't want to put those global vars into the perform methods because serializing those to redis chokes the entire thing. My issue is that the workers cannot see these variables and die because of a no method error (i.e. trying to call .first on on of them gives that error because the var is nil for the workers).
How come? Is there any other way to do this that won't kill my memory? (i.e. I don't want to take up most of the mem with 6x the same large array)
Sidekiq runs on a separate process, so it doesn't share the same memory as the initiator of the worker.
If the data is static, you might want to load it on the start of the sidekiq process (maybe when you configure the sidekiq server).
If it changes per task, you should model it in a way where you can create a global repository to hold it (if redis is not good for this, maybe you can try memcached)...
Related
EDIT:
My question was horrifically put so I delete it and rephrase entirely here.
I'll give a tl;dr:
I'm trying to assign each computation to a designated worker that fits the computation type.
In long:
I'm trying to run a simulation, so I represent it using a class of the form:
Class Simulation:
def __init__(first_Client: Client, second_Client: Client)
self.first_client = first_client
self.second_client = second_client
def first_calculation(input):
with first_client.as_current():
return output
def second_calculation(input):
with second_client.as_current():
return output
def run(input):
return second_calculation(first_calculation(input))
This format has downsides like the fact that this simulation object is not pickleable.
I could edit the Simulation object to contain only addresses and not clients for example, but I feel as if there must be a better solution. For instance, I would like the simulation object to work the following way:
Class Simulation:
def first_calculation(input):
client = dask.distributed.get_client()
with client.as_current():
return output
...
Thing is, the dask workers best fit for the first calculation, are different than the dask workers best fit for the second calculation, which is the reason my Simulation object has two clients that connect to tow different schedulers to begin with. Is there any way to make it so there is only one client but two types of schedulers and to make it so the client knows to run the first_calculation to the first scheduler and the second_calculation to the second one?
Dask will chop up large computations in smaller tasks that can run in paralell. Those tasks will then be submitted by the client to the scheduler which in turn wil schedule those tasks on the available workers.
Sending the client object to a Dask scheduler will likely not work due to the serialization issue you mention.
You could try one of two approaches:
Depending on how you actually run those worker machines, you could specify different types of workers for different tasks. If you run on kubernetes for example you could try to leverage the node pool functionality to make different worker types available.
An easier approach using your existing infrastructure would be to return the results of your first computation back to the machine from which you are using the client using something like .compute(). And then use that data as input for the second computation. So in this case you're sending the actual data over the network instead of the client. If the size of that data becomes an issue you can always write the intermediary results to something like S3.
Dask does support giving specific tasks to specific workers with annotate. Here's an example snippet, where a delayed_sum task was passed to one worker and the doubled task was sent to the other worker. The assert statements check that those workers really were restricted to only those tasks. With annotate you shouldn't need separate clusters. You'll also need the most recent versions of Dask and Distributed for this to work because of a recent bug fix.
import distributed
import dask
from dask import delayed
local_cluster = distributed.LocalCluster(n_workers=2)
client = distributed.Client(local_cluster)
workers = list(client.scheduler_info()['workers'].keys())
with dask.annotate(workers=workers[0]):
delayed_sum = delayed(sum)([1, 2])
with dask.annotate(workers=workers[1]):
doubled = delayed_sum * 2
# use persist so scheduler doesn't clean up
# wrap in a distributed.wait to make sure they're there when we check the scheduler
distributed.wait([doubled.persist(), delayed_sum.persist()])
worker_restrictions = local_cluster.scheduler.worker_restrictions
assert worker_restrictions[delayed_sum.key] == {workers[0]}
assert worker_restrictions[doubled.key] == {workers[1]}
I am trying to perform some calculations to populate some historic data in the database.
The database is SQL Server. The server is tomcat (using JRuby).
I am running the script file in a rails console pointed to the uat environment.
I am trying to use threads to speed up the execution. The idea being that each thread would take an object and run the calculations for it, and save the calculated values back to the database.
Problem: I keep getting this error:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds))
code:
require 'thread'
threads = []
items_to_calculate = Item.where("id < 11").to_a #testing only 10 items for now
for item in items_to_calculate
threads << Thread.new(item) { |myitem|
my_calculator = ItemsCalculator.new(myitem)
to_save = my_calculator.calculate_details
to_save.each do |dt|
dt.save!
end
}
end
threads.each { |aThread| aThread.join }
You're probably spawning more threads than ActiveRecord's DB connection pool has connections. Ekkehard's answer is an excellent general description; so here's a simple example of how to limit your workers using Ruby's thread-safe Queue.
require 'thread'
queue = Queue.new
items.each { |i| queue << i } # Fill the queue
Array.new(5) do # Only 5 concurrent workers
Thread.new do
until queue.empty?
item = queue.pop
ActiveRecord::Base.connection_pool.with_connection do
# Work
end
end
end
end.each(&:join)
I chose 5 because that's the ConnectionPool's default, but you can certainly tune that to the max that still works, or populate another queue with the result to save later and run an arbitrary number of threads for the calculation.
The with_connection method grabs a connection, runs your block, then ensures the connection is released. It's necessary because of a bug in ActiveRecord where the connection doesn't always get released otherwise. Check out this blog post for some details.
You are potentially starting a huge amount of threads at the same time if you leave the testing stage.
Each of these threads will need a DB connection. Either Rails is going to create a new one for every thread (possible creating a huge amount of DB connections at the same time), or it does not, in which case you'll run into trouble because several threads are trying to use the same connection in parallel. The first case would explain the error message because there will probably be a hard limit of open DB connections in your DB server.
Creating threads like this is usually not advisable. You're usually better off to create a handful (controlled/limited) amount of worker threads and using a queue to distribute work between them.
In your case, you could have a set of worker threads to do the calculations, and a second set of worker threads to write to the DB. I do not know enough about the details of your code to decide for you which is better. If the calculation is expensive and the DB-work is not, then you will probably have only one worker for writing to the DB in a serial fashion. If your DB is a beast and highly optimized for parallel writing and you need to write a lot of data, then you will maybe want a (small) amount of DB workers.
I have an application that makes thousands of requests to a web service API. Each request takes about 2 seconds, then the response creates new record in the database. I want to just fire off as many of those requests as I can simultaneously, and save the response to the database as as soon as I get the response.
Is this something I should be using a gem like sidekiq for, or the ruby Thread class? I don't want to just hand off the requests to be handled synchronously.
Sounds like you need a thread pool for performing the operation, and a database thread to commit the results.
You can build one of these really simply:
require 'thread'
db_queue = Queue.new
Thread.new do
while (item = db_queue.pop)
# ... Deal with item in queue
end
end
# Example of supplying a job
db_queue.push(api_response)
# When finished
db_queue.push(nil)
Due to the Global Interpreter Lock in the standard Ruby runtime threads are only really useful for managing many lightly loaded threads. If you need something more heavy-duty, JRuby might be what you're looking for.
In my project there is one script that returns the list of products which I have to display in a table.
To store the input of the script I used IO.popen:
#device_list = []
IO.popen("device list").each do |device|
#device_list << device
end
device list is the command that will give me the product list.
I return the #device_list array to my view for displaying by iterating it.
When I run it I got an error:
Errno::ENOMEM (Cannot allocate memory):
for IO.popen
I have on another script device status that returns only true and false but I got the same error:
def check_status(device_id)
#stat = system("status device_id")
if #stat == true
"sold"
else
"not sold"
end
end
What should I do?
Both IO.popen and Kernel#system can be expensive operations in terms of memory because they both rely on fork(2). Fork(2) is a Unix system call which creates a child process that clones the parent's memory and resources. That means, if your parent process uses 500mb of memory, then your child would also use 500mb of memory. Each time you do Kernel#system or IO.popen you increase your application's memory usage by the amount of memory it takes to run your Rails app.
If your development machine has more RAM than your production server or if your production server produces a lot more output, there are two things you could do:
Increase memory for your production server.
Do some memory management using something like Resque.
You can use Resque to queue those operations as jobs. Resque will then spawn "workers"/child processes to get a job from the queue, work on it and then exit. Resque still forks, but the important thing is that the worker exits after working on the task so that frees up memory. There'll be a spike in memory every time a worker does a job, but it will go back to the baseline memory of your app every after it.
You might have to do both options above and look for other ways to minimize the memory-usage of your app.
It seems your output from device list is too large.
"Cannot allocate memory (Errno::ENOMEM)" is a useful link which describes the question.
Limit the output of device list and check. Then you can know if it is a memory issue or not.
I have my Application Controller called McController which extends ApplicationController, and i set a class variable in McController called ##scheduler_map like below:
class McController < ApplicationController
##scheduler_map = {}
def action
...
end
private
def get_scheduler(host, port)
scheduler = ##scheduler_map[host+"_"+port]
unless scheduler
scheduler = Scheduler.create(host, port)
##scheduler_map[host+"_"+port] = scheduler
end
scheduler
end
end
but i found that from second request start on ##scheduler_map is always an empty hash, i run it in development env, could someone know the reason? is that related to the running env?
Thank you in advance.
You answered your own question :-)
Yes this is caused by the development environment (i tested it) and to be more precise the config option "config.cache_classes = false" in config/environments/development.rb
This flag will cause all classes to be reloaded at every request.
This is done because then you dont have to restart the whole server when you make a small change to your controllers.
You might want to take in consideration that what you are trying can cause HUGE memory leaks when later run in production with a lot of visits.
Every time a user visits your site it will create a new entree in that hash and never gets cleaned.
Imagine what will happen if 10.000 users have visited your site? or what about 1.000.000?
All this data is kept in the systems memory so this can take a lot of space the longer the server is online.
Also, i'm not really sure this solution will work on a production server.
The server will create multiple threats to handle a lot of visitors on the same time.
I think (not sure) that every threat will have his own instances of the classes.
This means that in treat 1 the schedule map for ip xx might exist but in treat 2 it doesn't.
If you give me some more information about what this scheduler is i might be able to give a suggestion for a different solution.