In my project there is one script that returns the list of products which I have to display in a table.
To store the input of the script I used IO.popen:
#device_list = []
IO.popen("device list").each do |device|
#device_list << device
end
device list is the command that will give me the product list.
I return the #device_list array to my view for displaying by iterating it.
When I run it I got an error:
Errno::ENOMEM (Cannot allocate memory):
for IO.popen
I have on another script device status that returns only true and false but I got the same error:
def check_status(device_id)
#stat = system("status device_id")
if #stat == true
"sold"
else
"not sold"
end
end
What should I do?
Both IO.popen and Kernel#system can be expensive operations in terms of memory because they both rely on fork(2). Fork(2) is a Unix system call which creates a child process that clones the parent's memory and resources. That means, if your parent process uses 500mb of memory, then your child would also use 500mb of memory. Each time you do Kernel#system or IO.popen you increase your application's memory usage by the amount of memory it takes to run your Rails app.
If your development machine has more RAM than your production server or if your production server produces a lot more output, there are two things you could do:
Increase memory for your production server.
Do some memory management using something like Resque.
You can use Resque to queue those operations as jobs. Resque will then spawn "workers"/child processes to get a job from the queue, work on it and then exit. Resque still forks, but the important thing is that the worker exits after working on the task so that frees up memory. There'll be a spike in memory every time a worker does a job, but it will go back to the baseline memory of your app every after it.
You might have to do both options above and look for other ways to minimize the memory-usage of your app.
It seems your output from device list is too large.
"Cannot allocate memory (Errno::ENOMEM)" is a useful link which describes the question.
Limit the output of device list and check. Then you can know if it is a memory issue or not.
Related
Scenario:
I have a job running a process (sidekiq) in production (heroku). The process imports data (CSV) from S3 into a DB model using activerecord-import gem. This gem helps to bulk insertion of data. Thus dbRows variable sets a considerable amount of memory from all ActiveRecord objects stored when iterating CSV lines (all good). Once data is imported (in: db_model.import dbRows) dbRows is cleared (should be!) and next object is processed.
Such as: (script simplified for better understanding)
def import
....
s3_objects.contents.each do |obj|
#cli.get_object({..., key: obj.key}, target: file)
dbRows = []
csv = CSV.new(file, headers: false)
while line = csv.shift
# >> here dbRows grows and grows and never is freed!
dbRows << db_model.new(
field1: field1,
field2: field2,
fieldN: fieldN
)
end
db_model.import dbRows
dbRows = nil # try 1 to freed array
GC.start # try 2 to freed memory
end
....
end
Issue:
Job memory grows while process runs BUT once the job is done memory does not goes down. It stays forever and ever!
Debugging I found that dbRows does not look to be never garbage collected
and I learned about RETAINED objects in and how memory works in rails. Although I did not find yet a way to apply it to solve my problem.
I would like that once the job finished all references set on dbRows are GC and worker memory is freed.
any help appreciated.
UPDATE: I read about weakref but I don't know if is would be useful. any insights there?
Try importing lines from the CSV in batches, e.g. import lines into the DB 1000 lines at a time so you're not holding onto previous rows, and the GC can collect them. This is good for the database, in any case (and for the download from s3, if you hand CSV the IO object from S3.
s3_io_object = s3_client.get_object(*s3_obj_params).body
csv = CSV.new(s3_io_object, headers: true, header_converters: :symbol)
csv.each_slice(1_000) do |row_batch|
db_model.import ALLOWED_FIELDS, row_batch.map(&:to_h), validate: false
end
Note that I'm not instantiating AR models either to save on memory, and only passing in hashes and telling activerecord-import to validate: false.
Also, where does the file reference come from? It seems to be long-lived.
It's not evident from your example, but is it possible for references to objects are still being held globally by a library or extension in your environment?
Sometimes these things are very difficult to track down, as any code from anywhere that's called (including external library code) could do something like:
Dynamically defining constants, since they never get GC'd
Any::Module::Or:Class.const_set('NewConstantName', :foo)
or adding data to anything referenced/owned by a constant
SomeConstant::Referenceable::Globally.array << foo # array will only get bigger and contents will never be GC'd
Otherwise, the best you can do is use some memory profiling tools, either inside of Ruby (memory profiling gems) or outside of Ruby (job and system logs) to try and find the source.
I am trying to perform some calculations to populate some historic data in the database.
The database is SQL Server. The server is tomcat (using JRuby).
I am running the script file in a rails console pointed to the uat environment.
I am trying to use threads to speed up the execution. The idea being that each thread would take an object and run the calculations for it, and save the calculated values back to the database.
Problem: I keep getting this error:
ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds))
code:
require 'thread'
threads = []
items_to_calculate = Item.where("id < 11").to_a #testing only 10 items for now
for item in items_to_calculate
threads << Thread.new(item) { |myitem|
my_calculator = ItemsCalculator.new(myitem)
to_save = my_calculator.calculate_details
to_save.each do |dt|
dt.save!
end
}
end
threads.each { |aThread| aThread.join }
You're probably spawning more threads than ActiveRecord's DB connection pool has connections. Ekkehard's answer is an excellent general description; so here's a simple example of how to limit your workers using Ruby's thread-safe Queue.
require 'thread'
queue = Queue.new
items.each { |i| queue << i } # Fill the queue
Array.new(5) do # Only 5 concurrent workers
Thread.new do
until queue.empty?
item = queue.pop
ActiveRecord::Base.connection_pool.with_connection do
# Work
end
end
end
end.each(&:join)
I chose 5 because that's the ConnectionPool's default, but you can certainly tune that to the max that still works, or populate another queue with the result to save later and run an arbitrary number of threads for the calculation.
The with_connection method grabs a connection, runs your block, then ensures the connection is released. It's necessary because of a bug in ActiveRecord where the connection doesn't always get released otherwise. Check out this blog post for some details.
You are potentially starting a huge amount of threads at the same time if you leave the testing stage.
Each of these threads will need a DB connection. Either Rails is going to create a new one for every thread (possible creating a huge amount of DB connections at the same time), or it does not, in which case you'll run into trouble because several threads are trying to use the same connection in parallel. The first case would explain the error message because there will probably be a hard limit of open DB connections in your DB server.
Creating threads like this is usually not advisable. You're usually better off to create a handful (controlled/limited) amount of worker threads and using a queue to distribute work between them.
In your case, you could have a set of worker threads to do the calculations, and a second set of worker threads to write to the DB. I do not know enough about the details of your code to decide for you which is better. If the calculation is expensive and the DB-work is not, then you will probably have only one worker for writing to the DB in a serial fashion. If your DB is a beast and highly optimized for parallel writing and you need to write a lot of data, then you will maybe want a (small) amount of DB workers.
I have a controller that spins off 6 sidekiq threads for faster parallel processing of a large file. Before that however I want to provide these threads with a few variables that should be available accross all threads because they variables themselves are fairly memory intensive. (it is only reading from that, not writing, so the concurrency issues doesn't exist)
In other words my controller looks like this
def foo
$bar1 = ....
$bar2 = ...
worker.perform_async()...
worker2.perform_async()...
end
I don't want to put those global vars into the perform methods because serializing those to redis chokes the entire thing. My issue is that the workers cannot see these variables and die because of a no method error (i.e. trying to call .first on on of them gives that error because the var is nil for the workers).
How come? Is there any other way to do this that won't kill my memory? (i.e. I don't want to take up most of the mem with 6x the same large array)
Sidekiq runs on a separate process, so it doesn't share the same memory as the initiator of the worker.
If the data is static, you might want to load it on the start of the sidekiq process (maybe when you configure the sidekiq server).
If it changes per task, you should model it in a way where you can create a global repository to hold it (if redis is not good for this, maybe you can try memcached)...
See the updated question below.
Original question:
In my current Rails project, I need to parse large xml/csv data file and save it into mongodb.
Right now I use this steps:
Receive uploaded file from user, store the data into mongodb
Use sidekiq to perform async process of the data in mongodb.
After process finished, delete the raw data.
For small and medium data in localhost, the steps above run well. But in heroku, I use hirefire to dynamically scale worker dyno up and down. When worker still processing the large data, hirefire see empty queue and scale down worker dyno. This send kill signal to the process, and leave the process in incomplete state.
I'm searching a better way to do the parsing, allow the parsing process got killed anytime (saving the current state when receiving kill signal), and allow the process got re-queued.
Right now I'm using Model.delay.parse_file and it don't get re-queued.
UPDATE
After reading sidekiq wiki, I found article about job control. Can anyone explain the code, how it works, and how it preserve it's state when receiving SIGTERM signal and the worker get re-queued?
Is there any alternative way to handle job termination, save current state, and continue right from the last position?
Thanks,
Might be easier to explain the process and the high level steps, give a sample implementation (a stripped down version of one that I use), and then talk about throw and catch:
Insert the raw csv rows with an incrementing index (to be able to resume from a specific row/index later)
Process the CSV stopping every 'chunk' to check if the job is done by checking if Sidekiq::Fetcher.done? returns true
When the fetcher is done?, store the index of the currently processed item on the user and return so that the job completes and control is returned to sidekiq.
Note that if a job is still running after a short timeout (default 20s) the job will be killed.
Then when the job runs again simply, start where you left off last time (or at 0)
Example:
class UserCSVImportWorker
include Sidekiq::Worker
def perform(user_id)
user = User.find(user_id)
items = user.raw_csv_items.where(:index => {'$gte' => user.last_csv_index.to_i})
items.each_with_index do |item, i|
if (i+1 % 100) == 0 && Sidekiq::Fetcher.done?
user.update(last_csv_index: item.index)
return
end
# Process the item as normal
end
end
end
The above class makes sure that each 100 items we check that the fetcher is not done (a proxy for if shutdown has been started), and ends execution of the job. Before the execution ends however we update the user with the last index that has been processed so that we can start where we left off next time.
throw catch is a way to implement this above functionality a little cleaner (maybe) but is a little like using Fibers, nice concept but hard to wrap your head around. Technically throw catch is more like goto than most people are generally comfortable with.
edit
Also you could not make call to Sidekiq::Fetcher.done? and record the last_csv_index on each row or on each chunk of rows processed, that way if your worker is killed without having the opportunity to record the last_csv_index you can still resume 'close' to where you left off.
You are trying to address the concept of idempotency, the idea that processing a thing multiple times with potential incomplete cycles does not cause problems. (https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-jobs-idempotent-and-transactional)
Possible steps forward
Split the file up into parts and process those parts with a job per part.
Lift the threshold for hirefire so that it will scale when jobs are likely to have fully completed (10 minutes)
Don't allow hirefire to scale down while a job is working (set a redis key on start and clear on completion)
Track progress of the job as it is processing and pick up where you left off if the job is killed.
I am tryin to resolve a memory leak problem with Rails. I can see through New Relic that the usage of memory is increasing without ever decreasing.
This is a spinoff question from a large thread ( Memory constantly increasing in Rails app ) where I am trouble shooting the problem. What I need to know now is just:
What are major reasons/factors when it comes to memory leaks in Rails?
As far as I understand:
Global variables (such as ##variable) - I have none of these
Symbols (I have not created any specifically)
Sessions - What should one avoid here? Let's say I have a session keeping track of the last query one particular user used when to text-search the site. How should I kill it off?
"Leaving references" - What does this really mean? Could you give an example?
Any other bad coding examples you can give that will typically create memory leaks?
I want to use this information to look through my code so please provide examples!
Lastly, would this be "memory leaking code"?
ProductController
...
#last_products << Product.order("ASC").limit(5)
end
will that make #last_products bloat?
The following will destroy applications.
Foo.each do |bar|
#Whatever
end
If you have a lot of Foos that will pull them all into memory. I have seen apps blow up because they have a bunch of "Foos" and they have a rake task that runs through all the foos, and this rake task takes forever, lets say Y seconds, but is run every X seconds, where X < Y. So what happens is they now have all Foos in memory, more than once because they just keep pulling stuff into memory, over and over again.
While this can't exactly happen inside an front facing web app in the same way it isn't exactly efficient or wanted.
Instead of the above do the following
Foo.find_each do |bar|
#Whatever
end
Which retrieves things and batches and doesn't put a whole bunch of stuff into your memory all at once.
And right as I finished typing this I realized this question was asked in September of last year... oh boy...