I am using ActiveRecord to bulk migrate some data from a table in one database to a different table in another database. About 4 million rows.
I am using find_each to fetch in batches. Then I do a little bit of logic to each record fetched, and write it to a different db. I have tried both directly writing one-by-one, and using the nice activerecord-import gem to batch write.
However, in either case, my ruby process memory usage is growing quite a bit throughout the life of the export/import. I would think that using find_each, I'm getting batches of 1000, there should only be 1000 of them in memory at a time... but no, each record I fetch seems to be consuming memory forever, until the process is over.
Any ideas? Is ActiveRecord caching something somewhere that I can turn off?
update 17 Jan 2012
I think I'm going to give up on this. I have tried:
* Making sure everything is wrapped in a ActiveRecord::Base.uncached do
* Adding ActiveRecord::IdentityMap.enabled = false (I think that should turn off the identity map for the current thread, although it's not clearly documented, and I think the identity map isn't on by default in current Rails anyhow)
Neither of those seem to have much effect, memory is still leaking.
I then added a periodic explicit:
GC.start
That seems to slow down the rate of memory leak, but the memory leak still happens (eventually exhausting all memory and bombing).
So I think I'm giving up, and deciding it is not currently possible to use AR to read millions of rows from one db and insert them into another. Perhaps there is a memory leak in MySQL-specific code being used (that's my db), or somewhere else in AR, or who knows.
I would suggest queueing each unit of work into a Resque queue . I have found that ruby has some quirks when iterating over large arrays like these.
Have one main thread that queue's up the work by ID, then have multiple resque workers hitting that queue to get the work done.
I have used this method on approx 300k records, so it would most likely scale to millions.
Change line #86 to bulk_queue = [] since bulk_queue.clear only sets the length of the arrya to 0 makeing it impossible for the GC to clear it.
Related
I have a couple of database management tasks that need to go through every record in the database. It was my understanding that with the CakePHP 3.x ORM, I could do something like this, and it would only ever have one record in memory at a time:
$records = TableRegistry::get('Whatever')->find();
foreach ($records as $record) {
// do some processing
}
However, this is eventually crashing with an "out of memory" exception. I've added a bit of logging of memory_get_peak_usage, and it's increasing with every iteration, even if there is nothing other than the logging happening inside the foreach loop. The delta is around 12K every time through the loop.
I'm running 3.2.7, and results are similar whether I have debugging and/or SQL logging enabled or not. Adding frequent calls to gc_collect_cycles() only slows the process down, it doesn't help with the memory usage.
Is this expected, or a bug? If the former, is there anything I can do differently in this code to prevent it? (Obviously, I could process it in smaller batches, but that's not an elegant solution.)
CakePHP 3.x ORM has built in query caching for the ResultSet object. When you iterator over the result set the entities are stored in an internal array. This is done so that you can rewind the iterator and loop again.
If you are going to iterate over a large result set only once, and you want to reduce memory usage then you have to disable result buffering.
$records = TableRegistry::get('Whatever')->find()->bufferResults(false);
foreach ($records as $record) {
// do some processing
}
With buffering turned off the entity is fetched from the result set and there should be no references to it afterwards.
Documentation for this feature is available in the CakePHP book: https://book.cakephp.org/3.0/en/orm/retrieving-data-and-resultsets.html#working-with-result-sets
Here's the API reference: https://api.cakephp.org/3.6/class-Cake.Database.Query.html#_bufferResults
From my understanding it is the expected behaviour, as you execute the query build with the ORM when you start iterating over the object($records). Thus all the data is loaded into memory, and you then iterate over each entry one by one.
If you want to limit the memory usage I would suggest you look into limit and offset. With these you can extract subsets to work on, thus limiting memory usage.
I have few m of records in db, and need to process it from time to time. However this operation takes all memory on my server. I'm running this operation using sidekiq. So while this task using all memory, my rails app becomes very slow.
In general(no logic included) my code looks like
Model.each do |m|
//do some logic code here
end
How do i make garbage collector to run after some amount of records(for ex. 10k records) so i wouldn't face out of memory situations. Will splitting it in chunks help me?
You should always use find_each when dealing with potentially large tables.
That way, models will be retrieved from the database and loaded in memory batch by batch (the default size is 1000 but you can customize it to your needs).
Just be aware that sorting by arbitrary columns doesn't play well with find_each, as it implicitly sorts records by ID so that it has a way to fetch records by batches.
You can force the garbage collector to run with GC.start, but if you are doing
Model.all.each do |m|
end
then garbage collection cannot free the already processed records - they are still referenced by the array that each is iterating over, so running the garbage collector explicitly won't do anything.
Instead use find_each (or its close relative, find_in_batches) which fetches records and processes them in batches (you can control the batch size - I think it is 1000 by default). This way the entire result set is never in memory and previously processed batches are not referenced by anything and so can be disposed of.
I use Delphi XE2 along with DISQLite v3 (which is basically a port of SQLite3). I love everything about SQLite3, except the lack of concurrent writing, especially that I extensively rely on multi-threading in this project :(
My profiler made it clear I needed to do something about it, so I decided to use this approach:
Whenever I need to insert a record in DB, Instead of doing an INSERT, I write the SQL query in a special foler, ie.
WriteToFile_Inline(SPECIAL_FOLDER_PATH + '\' + GUID, FileName + '|' + IntToStr(ID) + '|' + Hash + '|' + FloatToStr(ModifDate) + '|' + ...);
I added a timer (in the main app thread) that fires every minute, parse these files and then INSERT the queries using a transaction.
Delete those temporary files at the end.
The result is I have like 500% performance gain. Plus this technique is ACID, as I can always scan the SPECIAL_FOLDER_PATH after a power failure and execute the INSERTs I find.
Despite the good results, I'm not very happy with the method used (hackish to say the least), I keep thinking that if I could have a generics-like with fast lookup access, thread-safe, ACID list, this would be much cleaner (and possibly faster?)
So my question is: do you know anything like that for Delphi XE2?
PS. I trust many of you reading the code above be in shock and will start insulting me at this point! Please be my guest, but if you know a better (ie. faster) ACID approach, please share your thoughts!
Your idea of sending the inserts to a queue, which will rearrange the inserts, and join them via prepared statements is very good. Using a timer in the main thread or a separated thread is up to you. It will avoid any locking.
Do not forget to use a transaction, then commit it every 100/1000 inserts for instance.
About high performance using SQLite3, see e.g. this blog article (and graphic below):
In this graphic, best performance (file off) comes from:
PRAGMA synchronous = OFF
Using prepared statements
Inside a transaction
In WAL mode (especially in concurrency mode)
You may also change the page size, or the journal size, but settings above are the best. See https://stackoverflow.com/search?q=sqlite3+performance
If you do not want to use a background thread, ensure WAL is ON, prepare your statements, use batchs, and regroup your process to release the SQLite3 lock as soon as possible.
The best performance will be achieved by adding a Client-Server layer, just as we did for mORMot.
With files you organized an asynchronous job queue with persistance. It allows you to avoid one-by-one and use batch (records group) approach to insert the records. Comparing one-by-one and batch:
first works in auto-commit mode (probably) for each record, second wraps a batch into a single transaction and gives greatest performance gain.
first prepares an INSERT command each time when you need to insert a record (probably), second once per batch and gives second by value gain.
I dont think, that SQLite concurrency is a problem in your case (at least not the main issue). Because in SQLite a single insert is comparably fast and concurrency performance issues you will get with high workload. Probably similar results you will get with other DBMS, like Oracle.
To improve your batch approach, consider the following:
consider to set journal_mode to WAL and disable shared cache mode.
use a background thread to process your queue. Instead of a fixed time interval (1 min), check SPECIAL_FOLDER_PATH more often. And if the queue has more than X Kb of data, then start processing. Or use a count of queued records and event to notify the thread, that the queue should start processing.
use multy-record prepared INSERT instead of single-record INSERT. You can build an INSERT for 100 records and process your queue data in a single batch, but by 100 record chanks.
consider to write / read a binary field values instead of a text values.
consider to use a set of files with preallocated size.
etc
sqlite3_busy_timeout is pretty inefficient because it doesn't return immediately when the table it's waiting on is unlocked.
I would try creating a critical section (TCriticalSection?) to protect each table. If you enter the critical section before inserting a row and exit it immediately thereafter, you will create better table locks than SQLite provides.
Without knowing your access patterns, though, it's hard to say if this will be faster than batching up a minute's worth of inserts into single transactions.
When I run this and then watch the memory consumption of my ruby process in OSX Activity Monitor, the memory increases at about 3 MB/s.
If I remove the transaction it about halves the memory consumption but still, the memory footprint keeps going up. I have an issue on my production app where Heroku kills the process because of its memory consumption.
Is there a way of doing the below, in a way that won't increase memory? If I comment out the .save line then it's okay but of course this isn't a solution.
ActiveRecord::Base.transaction do
10000000.times do |time|
puts "---- #{time} ----"
a = Activity.new(:name => "#{time} Activity")
a.save!(:validate => false)
a = nil
end
end
I am running this using delayed_job.
The a = nil line is unnecessary and you can remove that.
You're creating a lot of objects every time you loop - two strings, two hashes, and an Activity object so I'm not surprised you're experiencing high memory usage, especially as you're looping 10 million times! There doesn't appear to be a more memory efficient way to write this code.
The only way I can think of to reduce memory usage is to manually start the garbage collector every x number of iterations. Chances are Ruby's GC isn't being aggressive enough. You don't, however, want to invoke it every iteration as this will radically slow your code. Maybe you could use every 100 iterations as a starting point and go from there. You'll have to profile and test what is most effective.
The documentation for the GC is here.
I know this is an ancient issue, but I have to suggest another radical approach:
ActiveRecord::Base.transaction do
10000000.times do |time|
puts "---- #{time} ----"
sql = <<SQL
INSERT INTO activities ("name") VALUES ("#{time}")
SQL
Activity.connection.execute(sql)
end
end
The point is that if the insert is that simple, and you're already skipping any ActiveModel validation, there's no reason to instantiate an activerecord object in the first place. Normally it wouldn't hurt, but since it is hurting you in this case, I think you'll use a lot less memory this way.
I am planning on using delayed job to run some background analytics. In my initial test I saw tremendous amount of memory usage, so I basically created a very simple task that runs every 2 minutes just to observe how much memory is is being used.
The task is very simple and the analytics_eligbile? method always return false, given where the data is now, so basically none of the heavy hitting code is being called. I have around 200 Posts in my sample data in development. Post has_one analytics_facet.
Regardless of the internal logic/business here, the only thing this task is doing is calling the analytics_eligible? method 200 times every 2 minutes. In a matter of 4 hours my physical memory usage is at 110MB and Virtual memory at 200MB. Just for doing something this simple! I can't even begin to imagine how much memory this will eat if its doing real analytics on 10,000 Posts with real production data!! Granted it may not run evevery 2 minutes, more like every 30, still I don't think it will fly.
This is running ruby 1.9.7, rails 2.3.5 on Ubuntu 10.x 64 bit. My laptop has 4GB memory, dual core CPU.
Is rails really this bad or am I doing something wrong?
Delayed::Worker.logger.info('RAM USAGE Job Start: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)
Post.not_expired.each do |p|
if p.analytics_eligible?
#this method is never called
Post.find_for_analytics_update(p.id).update_analytics
end
end
Delayed::Worker.logger.info('RAM USAGE Job End: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)
Delayed::Job.enqueue PeriodicAnalyticsJob.new(), 0, 2.minutes.from_now
Post Model
def analytics_eligible?
vf = self.analytics_facet
if self.total_ratings > 0 && vf.nil?
return true
elsif !vf.nil? && vf.last_update_tv > 0
ratio = self.total_ratings / vf.last_update_tv
if (ratio - 1) >= Constants::FACET_UPDATE_ELIGIBILITY_DELTA
return true
end
end
return false
end
ActiveRecord is fairly memory-hungry - be very careful when doing selects, and be mindful that Ruby automatically returns the last statement in a block as the return value, potentially meaning that you're passing back an array of records that get saved as a result somewhere and thus aren't eligible for GC.
Additionally, when you call "Post.not_expired.each", you're loading all your not_expired posts into RAM. A better solution is find_in_batches, which specifically only loads X records into RAM at a time.
Fixing it could be something as simple as:
def do_analytics
Post.not_expired.find_in_batches(:batch_size => 100) do |batch|
batch.each do |post|
if post.analytics_eligible?
#this method is never called
Post.find_for_analytics_update(post.id).update_analytics
end
end
end
GC.start
end
do_analytics
A few things are happening here. First, the whole thing is scoped in a function to prevent variable collisions from holding onto references from the block iterators. Next, find_in_batches retrieves batch_size objects from the DB at a time, and as long as you aren't building references to them, become eligible for garbage collection after each iteration runs, which will keep total memory usage down. Finally, we call GC.start at the end of the method; this forces the GC to start a sweep (which you wouldn't want to do in a realtime app, but since this is a background job, it's okay if it takes an extra 300ms to run). It also has the very distinct benefit if returning nil, which means that the result of the method is nil, which means we can't accidentally hang on to AR instances returned from the finder.
Using something like this should ensure that you don't end up with leaked AR objects, and should vastly improve both performance and memory usage. You'll want to make sure you aren't leaking elsewhere in your app (class variables, globals, and class references are the worst offenders), but I suspect that this'll solve your problem.
All that said, this is a cron problem (periodic recurring work), rather than a DJ problem, in my opinion. You can have a one-shot analytics parser that runs your analytics every X minutes with script/runner, invoked by cron, which very neatly cleans up any potential memory leaks or misuses per-run (since the whole process terminates at the end)
Loading data in batches and using the garbage collector aggressively as Chris Heald has suggested is going to give you some really big gains, but another area people often overlook is what frameworks they're loading in.
Loading a default Rails stack will give you ActionController, ActionMailer, ActiveRecord and ActiveResource all together. If you're building a web application you may not be using all of these, but you're probably using most.
When you're building a background job, you can avoid loading things you don't need by creating a custom environment for that:
# config/environments/production_bg.rb
config.frameworks -= [ :action_controller, :active_resource, :action_mailer ]
# (Also include config directives from production.rb that apply)
Each of these frameworks will just be sitting around waiting for an email that will never be sent, or a controller that will never be called. There's simply no point in loading them. Adjust your database.yml file, set your background job to run in the production_bg environment, and you'll have a much cleaner slate to start with.
Another thing you can do is use ActiveRecord directly without loading Rails at all. This might be all that you need for this particular operation. I've also found using a light-weight ORM like Sequel makes your background job very light-weight if you're doing mostly SQL calls to reorganize records or delete old data. If you need access to your models and their methods you will need to use ActiveRecord, though. Sometimes it's worth re-implementing simple logic in pure SQL for reasons of performance and efficiency, though.
When measuring memory usage, the only number to be concerned with is "real" memory. The virtual amount contains shared libraries and the cost of these is spread amongst every process using them even though it is counted in full for each one.
In the end, if running something important takes 100MB of memory but you can get it down to 10MB with three weeks of work, I don't see why you'd bother. 90MB of memory costs at most about $60/year on a managed provider which is usually far less expensive than your time.
Ruby on Rails embraces the philosophy of being more concerned with your productivity and your time than about memory usage. If you want to trim it back, put it on a diet, you can do it but it will take a bit of effort.
If you are experiencing memory issues, one solution is to use another background processing tech, like resque. It is the BG processing used by github.
Thanks to Resque's parent / child
architecture, jobs that use too much
memory release that memory upon
completion. No unwanted growth
How?
On certain platforms, when a Resque
worker reserves a job it immediately
forks a child process. The child
processes the job then exits. When the
child has exited successfully, the
worker reserves another job and
repeats the process.
You can find more technical details in README.
It is a fact that Ruby consumes (and leaks) memory. I don't know if you can do much about it, but at least I recommend that you take a look on Ruby Enterprise Edition.
REE is an open source port which promises "33% less memory" among all the other good things. I have used REE with Passenger in production for almost two years now and I'm very pleased.