rake task memory leak - ruby-on-rails

I have a long running rake task that is gobbling up all my system memory over time? What is the quickest way to track down my issue and get to the bottom of it?
Im using rails 2.3.5, ruby 1.8.7, ubuntu on slicehost and mysql 5.
I have rails app that works fine. I have a nightly job that runs all night and does tons of work (some external calls to twitter, google etc, and lots of db calls using active record, over time that job grows in memory size to nearly 4 gig. I need to figure out why the rake task is not releasing memeory.
I started looking into bleak_house, but it seems complex to setup and hasnt been updated in over a year. I cant get it to work locally so im reluctant to try in production.
thanks
Joel

Throwing out two ideas. First, if you're looping as part of this job, make sure you're not holding onto references to objects you don't need, as this will prevent them from being collected. If you're done, remove them from your array, or whatever. Also, put a periodic GC.start into your loop as a way to see if it's simply not getting around to GC-ing.
Second idea is that ruby does not GC symbols, so if your API clients are storing values as symbols you can end up with a huge and growing set of symbols that will never be re-used. Symbols are tiny, but tiny things can still add up.
And of course, don't load more objects than you need to. use #find_each to load AR objects in batches if you have to iterate over lots of them.

Related

Does the use of many rails associations cause significant memory growth?

I have a Rails active job running on a Heroku dyno via Sidekiq.
When the job runs, the memory grows from 150MB to 550MB.
I found out the cause was n+1 queries to the DB, and fixed it by doing my calculations on the data in the DB query instead of in the code.
Afterwards I wanted to refactor a bit, as I generally like to keep the SQL somewhat simple, and have the logic in the code. For this reason I switched my "joins" with "includes", which allows me to use the associations of the objects. It turned out though that this refactoring introduced the memory issue once again though.
So my conclusion is, that it is the use of the associations that causes the memory growth, as the number of SQL queries are the same after I fixed the N+1 issue. Please note the job handles a lot of objects, around 250.000. It seems reasonable c.f. the delete vs. destroy functionality where delete performs better as the objects themselves are not instantiated as they are with destroy.
Is my conclusion accurate, or am I missing something?
Thanks,
-Louise

How to speed up rake tasks handling big amount of data?

I use rake tasks in my rails application, it's fine when dealing with "small amount of data" but if several scores of thousand of record needs to be retrieved / computed the tasks can take a lot of time.
Rake tasks are very easy to understand and develop and I'd really like to keep using them but is there some recommendations when it comes to huge amount of data ?
I was thinking of map/reduce algorithme for instance. Is that the way to go ?
It's not rake that's slow. Rake is just firing up an instance of your application and running whatever you sent to it.
You can try to re-factor your code and see if there are some shortcuts that you didn't see before.
You can try to thread or fork tasks off if it is stuff that can be done simultaneously.
I would recommend using Spawn if you are going to attempt this in your rails app.
Sometimes your jobs just need to take a long time. Big Data = Big Time.
Also, if you are running your rake tasks regularly throughout the day I would recommend using using something like Delayed_Job to handle this instead so you aren't firing up and quitting rails instances each time you need to run a task.
I recommend threach and jruby.

How to deal with expensive fixture/factory_girl object creation in tests?

For all users in our system, we generate a private/public key pair, which often takes a second or two. That's not a deal breaker on the live site, but it makes running the tests extremely slow, and slow tests won't get run.
Our setup is Rails 3.1 with factory_girl and rspec.
I tried creating (10 or so) some ahead of time with a method to return a random one, but this seems to be problematic: perhaps they're getting cleared out of the database and are unavailable for subsequent tests... I'm not sure.
This might be useful: https://github.com/pcreux/rspec-set - any other ideas?
You can always make a fake key pair for your tests. Pregenerating them won't work, at least not if you store them in the DB, because the DB should get cleared for every test. I suppose you could store them in a YAML file or something and read them from there...
https://github.com/pcreux/rspec-set was good enough for what we need, combined with an after/all block to clean up after the droppings it leaves in the database.

Rails DB Performance with several workers

I'm working on an application that works like a search engine, and all the time it has workers in the background searching the web and adding results to the Results table.
While everything works perfectly, lately I started getting huge response times while trying to browse, edit or delete the results. My guess is that the Results table is being constantly locked by the workers who keep adding new data, which means web requests must wait until the table is freed.
However, I can't figure out a way to lower that load on the Results table and get faster respose times for my web requests. Has anyone had to deal with something like that?
The search bots are constantly reading and adding new stuff, it adds new results as it finds them. I was wondering if maybe by only adding the bulk of the results to the database after the search would help, or if it would make things worse since it would take longer.
Anyway, I'm at a loss here and would appreciate any help or ideas.
I'm using RoR 2.3.8 and hosting my app on Heroku with PostgreSQL
PostgreSQL doesn't lock tables for reads nor writes. Start logging your queries and try to find out what is going on. Guessing doesn't help, you have to dig into it.
To check the current activity:
SELECT * FROM pg_stat_activity;
Try the NOWAIT command. Since you're only adding new stuff with your background workers, I'd assume there would be no lock conflicts when browsing/editing/deleting.
You might want to put a cache in front of the database. On Heroku you can use memcached as a cache store very easily.
This'll take some load off your db reads. You could even have your search bots update the cache when they add new stuff so that you can use a very long expiration time and your frontend Rails app will very rarely (if ever) hit the database directly for simple reads.

Application Context in Rails

Rails comes with a handy session hash into which we can cram stuff to our heart's content. I would, however, like something like ASP's application context, which instead of sharing data only within a single session, will share it with all sessions in the same application. I'm writing a simple dashboard app, and would like to pull data every 5 minutes, rather than every 5 minutes for each session.
I could, of course, store the cache update times in a database, but so far haven't needed to set up a database for this app, and would love to avoid that dependency if possible.
So, is there any way to get (or simulate) this sort of thing? If there's no way to do it without a database, is there any kind of "fake" database engine that comes with Rails, runs in memory, but doesn't bother persisting data between restarts?
Right answer: memcached . Fast, clean, supports multiple processes, integrates very cleanly with Rails these days. Not even that bad to set up, but it is one more thing to keep running.
90% Answer: There are probably multiple Rails processes running around -- one for each Mongrel you have, for example. Depending on the specifics of your caching needs, its quite possible that having one cache per Mongrel isn't the worst thing in the world. For example, supposing you were caching the results of a long-running query which
gets fresh data every 8 hours
is used every page load, 20,000 times a day
needs to be accessed in 4 processes (Mongrels)
then you can drop that 20,000 requests down to 12 with about a single line of code
##arbitrary_name ||= Model.find_by_stupidly_long_query(param)
The double at-mark, a Ruby symbol you might not be familiar with, is a global variable. ||= is the commonly used Ruby idiom to execute the assignment if and only if the variable is currently nil or otherwise evaluates to false. It will stay good until you explicitly empty it OR until the process stops, for any reason -- server restart, explicitly killed, what have you.
And after you go down from 20k calculations a day to 12 in about 15 seconds (OK, two minutes -- you need to wrap it in a trivial if block which stores the cache update time in a different global), you might find that there is no need to spend additional engineering assets on getting it down to 4 a day.
I actually use this in one of my production sites, for caching a few expensive queries which literally only need to be evaluated once in the life of the process (i.e. they change only at deployment time -- I suppose I could precalculate the results and write them to disk or DB but why do that when SQL can do the work for me).
You don't get any magic expiry syntax, reliability is pretty slim, and it can't be shared across processes -- but its 90% of what you need in a line of code.
You should have a look at memcached: http://wiki.rubyonrails.org/rails/pages/MemCached
There is a helpful Railscast on Rails 2.1 caching. It is very useful if you plan on using memcached with Rails.
Using the stock Rails cache is roughly equivalent to this.
#p3t0r- is right,MemCached is probably the best option, but you could also use the sqlite database that comes with Rails. That won't work over multiple machines though, where MemCached will. Also, sqlite will persist to disk, though I think you can set it up not to if you want. Rails itself has no application-scoped storage since it's being run as one-process-per-request-handler so it has no shared memory space like ASP.NET or a Java server would.
So what you are asking is quite impossible in Rails because of the way it is designed. What you ask is a shared object and Rails is strictly single threaded. Memcached or similar tool for sharing data between distributed processes is the only way to go.
The Rails.cache freezes the objects it stores. This kind of makes sense for a cache but NOT for an application context. I guess instead of doing a roundtrip to the moon to accomplish that simple task, all you have to do is create a constant inside config/environment.rb
APP_CONTEXT = Hash.new
Pretty simple, ah?

Resources