How to speed up rake tasks handling big amount of data?

How to speed up rake tasks handling big amount of data? - ruby-on-rails

I use rake tasks in my rails application, it's fine when dealing with "small amount of data" but if several scores of thousand of record needs to be retrieved / computed the tasks can take a lot of time.
Rake tasks are very easy to understand and develop and I'd really like to keep using them but is there some recommendations when it comes to huge amount of data ?
I was thinking of map/reduce algorithme for instance. Is that the way to go ?

It's not rake that's slow. Rake is just firing up an instance of your application and running whatever you sent to it.
You can try to re-factor your code and see if there are some shortcuts that you didn't see before.
You can try to thread or fork tasks off if it is stuff that can be done simultaneously.
I would recommend using Spawn if you are going to attempt this in your rails app.
Sometimes your jobs just need to take a long time. Big Data = Big Time.
Also, if you are running your rake tasks regularly throughout the day I would recommend using using something like Delayed_Job to handle this instead so you aren't firing up and quitting rails instances each time you need to run a task.

I recommend threach and jruby.

Related

Scheduling stuff in the future

I've been doing some research but I can't figure out on how to do this in Rails.
I need to execute some code after a certain amount of time. I found some gems that can handle this but I don't want to use any.
I basically have a create function in a Rails controller and some stuff need to happen there after 24 hours.
EDIT: I tried with sleep but it needs to be async and sleep will stop everything from running until it's done, even if it is in a if statement.

You need to create an ActionJob with rails generate job your_job_name.
Then you can delay its execution
YourJob.set(wait_until: 1.day.from_now)
For this to work you indeed need a Job queue to be configured for your project. I personally recommend Sidekiq
Sadly, you will have a hard time avoiding setting up a gem for this.

Background job taking twice the time that the same operation within rails

In my Rails application, I have a long calculation requiring a lot of database access.
To make it short, my calculation took 25 seconds.
When implementing the same calculation within a background job (a big single worker), the same calculation take twice the same time (ie 50 seconds). I have try several technics to put the job in a background process put none add an impact on my performances => using DelayJob / Sidekiq / doing the process within my rails but in a thread created for the work, but all have the same impact on my performances *2.
This performance difference only exist in rails 'production' environment. It looks like there is an optimisation done by rails that is not done in my background job.
My technical environment is the following =>
I am using ruby 2.0 / rails 4
I am using unicorn (but I have same problem without it).
The job is using Rails.cache to store some partial computation.
I am using postgresql
Does anybody has an clue where this impact might come from ?

I'm assuming you're comparing the background job speed to the speed of running the operation during a web request? If so, you're likely benefiting from Rails's QueryCache, which caches db queries during a web request. Try disabling it like described here:
Disabling Rails SQL query caching globally
If that causes the web request version of the job to take as long as the background job, you've found your culprit. You can then enable the query cache on your background job to speed it up (if it makes sense for your application).

Background job is not something that need to used for speed-up things. It's main meaning is to 'fire and forget' and remove 25 seconds of calculating synchronously and adding some more of calculating asynchronously. So you can give user response that she's request is processing and return with calculation later.
You may take speed gain from background job by splitting big task on some small and running them at same time. In your case I think it's something impossible to use, because of dependency of operations in yours calculation.
So if you want to speed you calculation, you need to look into denormalization of your data structure, storing some calculated values for your big calculation on moment when source data for this calculation updated. So you will calculate less on user request for results and more on data storage. And it's good place for use background job. So you finish your update of data, create background task for update caches. And if user request for calculation comes before this task is finished you will still need to wait for cache fill-up.
Update: I think I am still need to answer your main question. So basically this additional time on background task processing is comes from implementation. Because of 'fire and forget' approach no one need that background task scheduler will consume big amount of processor time just monitoring for new jobs. I am not sure completely but think that if your calculation will be two times more complex, time gain will be same 25 seconds.

My guess is that the extra time is coming from the need for your background worker to load rails and all of your application. My clue is that you said the difference was greatest with Rails in production mode. In production mode, subsequent calls to the app make use of the app and class cache.
How to check this hypotheses:
Change your background job to do the following:
print a log message before you initiate the worker
start the worker
run your calculation. As part of your calculation startup, print a log message
print another log message
run your calculation again
print another log message
Then compare the two times for running your calculation.
Of course, you'll also gain some extra time benefits from database caching, code might remain resident in memory, etc. But if the second run is much much faster, then the fact that the second run didn't restart Rails is more significant.
Also, the time between the log message from steps 1 and 3 will also help you understand the start up times.
Fixes
Why wait?
Most important: why do you need the results faster? Eg, tell your user that the result will be emailed to them after it is calculated. Or let your user see that the calculation is proceeding in the background, and later, show them the result.
The key for any long running calculation is to do it in the background and encourage the user to not wait for the result. They should be able to do something else until they get the result.
Start the calculation automatically As soon as the user logs in, or after they do something interesting, start the calculation. That way, when (and if) the user asks for the calculation, the answer will either be already done or will soon be done.
Cache the result and bust the cache as needed Similar to the above, start the calculation periodically and automatically. If the user changes some data, then restart the calculation by busting the cache. There are also ways to halt any on-going calculation if data is changed during the calculation.
Pre-calculate part of the calculation Why are you taking 25 seconds or more for a dbms calculation? Could be that you should change the calculation. Investigate adding indexes, summary tables, de-normalizing, splitting the calculation into smaller steps that can be pre-calculated, etc.

rake task memory leak

I have a long running rake task that is gobbling up all my system memory over time? What is the quickest way to track down my issue and get to the bottom of it?
Im using rails 2.3.5, ruby 1.8.7, ubuntu on slicehost and mysql 5.
I have rails app that works fine. I have a nightly job that runs all night and does tons of work (some external calls to twitter, google etc, and lots of db calls using active record, over time that job grows in memory size to nearly 4 gig. I need to figure out why the rake task is not releasing memeory.
I started looking into bleak_house, but it seems complex to setup and hasnt been updated in over a year. I cant get it to work locally so im reluctant to try in production.
thanks
Joel

Throwing out two ideas. First, if you're looping as part of this job, make sure you're not holding onto references to objects you don't need, as this will prevent them from being collected. If you're done, remove them from your array, or whatever. Also, put a periodic GC.start into your loop as a way to see if it's simply not getting around to GC-ing.
Second idea is that ruby does not GC symbols, so if your API clients are storing values as symbols you can end up with a huge and growing set of symbols that will never be re-used. Symbols are tiny, but tiny things can still add up.
And of course, don't load more objects than you need to. use #find_each to load AR objects in batches if you have to iterate over lots of them.

Timed server events with ruby on rails

I am attempting to create a web-based game in Ruby on Rails. I have a model named 'Game', which has a datetime in the database entry that corresponds to a time that I would like the server to call the Game model's update_game function. Depending on the game's settings, this could be every 30 seconds to every 12 hours.
Ruby on Rails only seems to work when it receives an HTTP request; is there a slick way to get my game to update on a periodic basis independent of HTTP requests?

I'd look into delayed_job for this. When the game starts, you can create a delayed_job for the first update, and every run after that can add a new job at the correct interval until it's done.
I'd do lots of testing though ;) - you don't want to let the jobs get away from you.

Rails itself doesn't do this; cron does this. Ruby does, however, have a gem named Whenever to make easier the declaration and deployment of new cron jobs.
However, if you are really going to expect a large amount of games to reliably update every 30 seconds, you may want to take a different approach if updating a game would take any significant amount of time. Perhaps once the game is accessed, the game could run the update as many times as necessary (e.g. if 3 minutes had passed and the interval is 30 seconds, run 6 updates once requested). This may or may not be a good option for your setup, however, so figure out which method is more viable for your purposes.

Look into background processing options and possibly cron.

I like the gem 'rufus-scheduler' which works within Rails, though I'm not sure you can programmatically add more tasks to it.

Application Context in Rails

Rails comes with a handy session hash into which we can cram stuff to our heart's content. I would, however, like something like ASP's application context, which instead of sharing data only within a single session, will share it with all sessions in the same application. I'm writing a simple dashboard app, and would like to pull data every 5 minutes, rather than every 5 minutes for each session.
I could, of course, store the cache update times in a database, but so far haven't needed to set up a database for this app, and would love to avoid that dependency if possible.
So, is there any way to get (or simulate) this sort of thing? If there's no way to do it without a database, is there any kind of "fake" database engine that comes with Rails, runs in memory, but doesn't bother persisting data between restarts?

Right answer: memcached . Fast, clean, supports multiple processes, integrates very cleanly with Rails these days. Not even that bad to set up, but it is one more thing to keep running.
90% Answer: There are probably multiple Rails processes running around -- one for each Mongrel you have, for example. Depending on the specifics of your caching needs, its quite possible that having one cache per Mongrel isn't the worst thing in the world. For example, supposing you were caching the results of a long-running query which
gets fresh data every 8 hours
is used every page load, 20,000 times a day
needs to be accessed in 4 processes (Mongrels)
then you can drop that 20,000 requests down to 12 with about a single line of code
##arbitrary_name ||= Model.find_by_stupidly_long_query(param)
The double at-mark, a Ruby symbol you might not be familiar with, is a global variable. ||= is the commonly used Ruby idiom to execute the assignment if and only if the variable is currently nil or otherwise evaluates to false. It will stay good until you explicitly empty it OR until the process stops, for any reason -- server restart, explicitly killed, what have you.
And after you go down from 20k calculations a day to 12 in about 15 seconds (OK, two minutes -- you need to wrap it in a trivial if block which stores the cache update time in a different global), you might find that there is no need to spend additional engineering assets on getting it down to 4 a day.
I actually use this in one of my production sites, for caching a few expensive queries which literally only need to be evaluated once in the life of the process (i.e. they change only at deployment time -- I suppose I could precalculate the results and write them to disk or DB but why do that when SQL can do the work for me).
You don't get any magic expiry syntax, reliability is pretty slim, and it can't be shared across processes -- but its 90% of what you need in a line of code.

You should have a look at memcached: http://wiki.rubyonrails.org/rails/pages/MemCached

There is a helpful Railscast on Rails 2.1 caching. It is very useful if you plan on using memcached with Rails.

Using the stock Rails cache is roughly equivalent to this.

#p3t0r- is right,MemCached is probably the best option, but you could also use the sqlite database that comes with Rails. That won't work over multiple machines though, where MemCached will. Also, sqlite will persist to disk, though I think you can set it up not to if you want. Rails itself has no application-scoped storage since it's being run as one-process-per-request-handler so it has no shared memory space like ASP.NET or a Java server would.

So what you are asking is quite impossible in Rails because of the way it is designed. What you ask is a shared object and Rails is strictly single threaded. Memcached or similar tool for sharing data between distributed processes is the only way to go.

The Rails.cache freezes the objects it stores. This kind of makes sense for a cache but NOT for an application context. I guess instead of doing a roundtrip to the moon to accomplish that simple task, all you have to do is create a constant inside config/environment.rb
APP_CONTEXT = Hash.new
Pretty simple, ah?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart