Processing large recordsets in Rails

Processing large recordsets in Rails - ruby-on-rails

I'm trying to perform a daily operation on a larger than normal dataset (2m+ records). However, Rails seems to take a very long time performing operations on such a dataset. Operations like
Dataset.all.each do |data|
...
end
take a very long time to complete (I assume this is because it can't fit all the items into memory at once, right?).
Does anyone have any strategies on how I could handle this situation? I know SQL would probably speed up the process, but I'm looking to use the Rails environment as I can do many more complicated things to the data than I can with just SQL statements.

You want to use ActiveRecord's find_each for this.
Dataset.find_each do |data|
...
end

When processing a large set of rows, a database is very fast and efficient, it what they were designed for. I would recommend attempting to do all this processing in SQL if you want max performance. If you prefer to use Rails, or it is impossible to do everything you want in SQL, you might attempt to do some pre-processing in SQL and the remainder in Rails. Short of that, 2m+ rows is a lot to loop over, even if each only takes a fraction of a second it add up to a long time.

Related

Trying to hack my way around SQLite3 concurrent writing, any better way to do this?

I use Delphi XE2 along with DISQLite v3 (which is basically a port of SQLite3). I love everything about SQLite3, except the lack of concurrent writing, especially that I extensively rely on multi-threading in this project :(
My profiler made it clear I needed to do something about it, so I decided to use this approach:
Whenever I need to insert a record in DB, Instead of doing an INSERT, I write the SQL query in a special foler, ie.
WriteToFile_Inline(SPECIAL_FOLDER_PATH + '\' + GUID, FileName + '|' + IntToStr(ID) + '|' + Hash + '|' + FloatToStr(ModifDate) + '|' + ...);
I added a timer (in the main app thread) that fires every minute, parse these files and then INSERT the queries using a transaction.
Delete those temporary files at the end.
The result is I have like 500% performance gain. Plus this technique is ACID, as I can always scan the SPECIAL_FOLDER_PATH after a power failure and execute the INSERTs I find.
Despite the good results, I'm not very happy with the method used (hackish to say the least), I keep thinking that if I could have a generics-like with fast lookup access, thread-safe, ACID list, this would be much cleaner (and possibly faster?)
So my question is: do you know anything like that for Delphi XE2?
PS. I trust many of you reading the code above be in shock and will start insulting me at this point! Please be my guest, but if you know a better (ie. faster) ACID approach, please share your thoughts!

Your idea of sending the inserts to a queue, which will rearrange the inserts, and join them via prepared statements is very good. Using a timer in the main thread or a separated thread is up to you. It will avoid any locking.
Do not forget to use a transaction, then commit it every 100/1000 inserts for instance.
About high performance using SQLite3, see e.g. this blog article (and graphic below):
In this graphic, best performance (file off) comes from:
PRAGMA synchronous = OFF
Using prepared statements
Inside a transaction
In WAL mode (especially in concurrency mode)
You may also change the page size, or the journal size, but settings above are the best. See https://stackoverflow.com/search?q=sqlite3+performance
If you do not want to use a background thread, ensure WAL is ON, prepare your statements, use batchs, and regroup your process to release the SQLite3 lock as soon as possible.
The best performance will be achieved by adding a Client-Server layer, just as we did for mORMot.

With files you organized an asynchronous job queue with persistance. It allows you to avoid one-by-one and use batch (records group) approach to insert the records. Comparing one-by-one and batch:
first works in auto-commit mode (probably) for each record, second wraps a batch into a single transaction and gives greatest performance gain.
first prepares an INSERT command each time when you need to insert a record (probably), second once per batch and gives second by value gain.
I dont think, that SQLite concurrency is a problem in your case (at least not the main issue). Because in SQLite a single insert is comparably fast and concurrency performance issues you will get with high workload. Probably similar results you will get with other DBMS, like Oracle.
To improve your batch approach, consider the following:
consider to set journal_mode to WAL and disable shared cache mode.
use a background thread to process your queue. Instead of a fixed time interval (1 min), check SPECIAL_FOLDER_PATH more often. And if the queue has more than X Kb of data, then start processing. Or use a count of queued records and event to notify the thread, that the queue should start processing.
use multy-record prepared INSERT instead of single-record INSERT. You can build an INSERT for 100 records and process your queue data in a single batch, but by 100 record chanks.
consider to write / read a binary field values instead of a text values.
consider to use a set of files with preallocated size.
etc

sqlite3_busy_timeout is pretty inefficient because it doesn't return immediately when the table it's waiting on is unlocked.
I would try creating a critical section (TCriticalSection?) to protect each table. If you enter the critical section before inserting a row and exit it immediately thereafter, you will create better table locks than SQLite provides.
Without knowing your access patterns, though, it's hard to say if this will be faster than batching up a minute's worth of inserts into single transactions.

ActiveRecord bulk data, memory grows forever

I am using ActiveRecord to bulk migrate some data from a table in one database to a different table in another database. About 4 million rows.
I am using find_each to fetch in batches. Then I do a little bit of logic to each record fetched, and write it to a different db. I have tried both directly writing one-by-one, and using the nice activerecord-import gem to batch write.
However, in either case, my ruby process memory usage is growing quite a bit throughout the life of the export/import. I would think that using find_each, I'm getting batches of 1000, there should only be 1000 of them in memory at a time... but no, each record I fetch seems to be consuming memory forever, until the process is over.
Any ideas? Is ActiveRecord caching something somewhere that I can turn off?
update 17 Jan 2012
I think I'm going to give up on this. I have tried:
* Making sure everything is wrapped in a ActiveRecord::Base.uncached do
* Adding ActiveRecord::IdentityMap.enabled = false (I think that should turn off the identity map for the current thread, although it's not clearly documented, and I think the identity map isn't on by default in current Rails anyhow)
Neither of those seem to have much effect, memory is still leaking.
I then added a periodic explicit:
GC.start
That seems to slow down the rate of memory leak, but the memory leak still happens (eventually exhausting all memory and bombing).
So I think I'm giving up, and deciding it is not currently possible to use AR to read millions of rows from one db and insert them into another. Perhaps there is a memory leak in MySQL-specific code being used (that's my db), or somewhere else in AR, or who knows.

I would suggest queueing each unit of work into a Resque queue . I have found that ruby has some quirks when iterating over large arrays like these.
Have one main thread that queue's up the work by ID, then have multiple resque workers hitting that queue to get the work done.
I have used this method on approx 300k records, so it would most likely scale to millions.

Change line #86 to bulk_queue = [] since bulk_queue.clear only sets the length of the arrya to 0 makeing it impossible for the GC to clear it.

How can I tell what is taking my request so long?

So here's the final line of my request:
Completed in 9209ms (View: 358, DB: 1582)
So I need to figure out what's holding up the request. This request can get to be upwards of a minute, so I really need to figure this one out. If the DB and View take 1.9 seconds, then that means there's 7.2 seconds that are unaccounted for in the log. How can I further analyze other parts of the code? It would be great to know if, say, there was a single callback that's largely responsible for the delay.

Generally code that isn't in the DB or the view is in the controller (or it could be in the logic part of the model). The biggest culprits I've had problems with are loops within loops, or gratuitous use of :include => {...stuff...}

Are you materializing a lot of records from the database? Query time might not be much, but if, say, I selected 10,000 records from the database using a very simple and fast query, I'd still expect my page load time to be high, since it would spend a considerable amount of time hydrating the large recordset.
I'd suggest running some profiling tests (rake test:profile). These should give you an indication of where most of the time is being spent. You'll probably find it's just one or two method calls that are being invoked with excessive frequency.

different between cfstoredproc and cfquery

I've found previous programmers using cfstoredproc in our existing project about insert records into database.
Just example he/she used like that:
<cfstoredproc procedure="myProc" datasource="myDsn">
<cfprocparam type="In" cfsqltype="CF_SQL_CHAR" value="ppshein" dbvarname="username">
<cfprocparam type="In" cfsqltype="CF_SQL_CHAR" value="28 yrs" dbvarname="age">
</cfstoredproc>
I can't convince why he/she used above code instead of:
<cfquery datasource="myDsn">
insert usertb
(name, age)
values
(<cfqueryparam cfsqltype="CF_SQL_CHAR" value="ppshein">, <cfqueryparam cfsqltype="CF_SQL_CHAR" value="28 yrs">)
</cfquery>
I feel there will be the hidden performance using cfstoredproc and cfquery for complex data manipulation. Please let me know the performance using cfstoredproc in coldfusion instead of cfquery for complex data manipulation. What I know is reusable.

CFSTOREDPROC should have better performance for the same reason a stored procedure will have better performance at the database level -- when created, the stored procedure is optimized internally by the database.
Whether this is noticeable depends on your database and your query. Use of CFQUERYPARAM inside your CFQUERY (as in your example) also speeds execution (at the db driver level).
Unless the application is VERY performance-sensitive, I tend to run my SQL code in a profiler first to optimize it, then put it into my CFQUERY parametrized with CFQUERYPARAM tags, rather than use a storedproc. That way, all the logic is in the CF code. This is a matter of taste, of course, and it's easy to move the SQL into a storedproc later when the application matures.

Some shops prefer to have all data logic controlled by the database, leaving CF to act almost exclusively as the front-end generator. Some places are so controlling that they won't let you write any SQL in your CF code.
Update: There might be more to that Stored Proc than a simple INSERT INTO. There may be some data lookup in another table. There could be validation. There may be conditional logic. There may be multiple transactions going on, like a log. A failure to perform the insert may return a specific status code rather than throwing an error.
Honestly, it's simply a matter of style. There are reasons for and against either way, and I've found that it usually comes down to who has more/more competent coders: The CF guys or the database guys.

Basically if you use a stored procedure, the stored procedure will be pre-complied by the database and the execution plan stored. This means that subsequent calls to the stored procedure do not incurr the that overhead. For large complex queries this can be substantial.
So as a rule of thumb: queries that are...
large and complex or
called very frequently or
both of the above
...are very good candidates for conversion to a stored procedure.
Hope that helps!

Are helpers really faster than partials? What about string building?

I've got a fancy-schmancy "worksheet" style view in a Rails app that is taking way too long to load. (In dev mode, and yes I know there's no caching there, "Completed in 57893ms (View: 54975, DB: 855)") The worksheet is rendered using helper methods, because I couldn't stand maintaining umpteen teeny little partials for the different sorts of rows in the worksheet. Now I'm wondering whether partials might actually be faster?
I've profiled the page load and identified a few cases where object caching will shave a few seconds off, but the profile output suggests that a large chunk of time is spent simply looping through the Worksheet model's constituent objects and appending the string output from the helper. Here's an example of what I'm talking about:
def header_row(wksht)
content_tag(:thead, :class => "ioe") do
content_tag(:tr) do
html_row = []
for i in (0...wksht.class::NUM_COLS) do
html_row << content_tag(:th, h(wksht.column_headings[i].upcase),
:class => wksht.column_classes[i])
end
html_row.join("\n")
end
end
end
OTOH using partials means opening files, spinning off the Ruby interpreter, and in the long run, aggregating a bunch of strings, right? So I'm wondering whether there is another way to speed things up in the helpers. Should I be using something like a stringstream (does that exist in Ruby?), should I get rid of content_tag calls in favor of my own "" string interpolation... I'm willing to write my own performance tests, and share the results, if you have any suggested alternatives to the approach I've already taken.
As it's a fairly complex view (and has an editable version as well), I'd rather not rewrite-and-profile the whole thing more than once. :)
Some related reading:
http://www.viget.com/extend/helpers-vs-partials-a-performance-question/ (old)
http://www.breakingpointsystems.com/community/blog/ruby-string-processing-overhead/
http://blog.purepistos.net/index.php/2008/07/14/benchmarking-ruby-string-interpolation-concatenation-and-appending/
#tadman:
There are row totals and column totals (and more columnar arithmetic), and since they're not all just totals, but also depend on other "magic numbers" from the database, I implemented them in the Ruby code rather than Javascript. (DRY and unit testable.) Javascript is used only in the edit view, and just to add/delete rows (client side only) and to fetch a sheet with fresh totals when the cell contents change. It fetches the whole table because nearly half of the values get updated when an input cell changes.
The worksheet and its rows are actually virtual models; they don't live in the DB, but rather aggregate a boatload of real AR objects. They get created every time a view renders (but that takes 1.7 secs in dev mode, so I'm not worried about it).
I suppose I could transmit a matrix of numbers, rather than marked-up content, and have JS unpack it into the sheet. But that gets unmaintainable fast.

I ended up reading an excellent article at http://www.infoq.com/articles/Rails-Performance ("A Look At Common Performance Problems In Rails"). Then I followed the author's suggestion to cache computations during request processing:
def estimated_costs
#estimated_costs ||=
begin
# tedious vector math
end
end
Because my worksheet does stuff like the above over and over, and then builds on those results to calculate some more rows, this resulted in a 90% speedup right off the bat. Should have been plain as day, but it started with just a few totals, then I showed the prototype to the customer, and it snowballed from there :)
I also wondered whether my array-based math might be inefficient, so I replaced the Ruby Arrays of numbers with NArray (http://narray.rubyforge.org/). The speedup was negligible but the code's cleaner, so it's staying that way.
Finally, I put some object caching in place. The "magic numbers" in the database only change a few times a year at most, and some of them are encrypted, but they need to be used in most of the calculations. That's low-hanging fruit ripe for caching, and it shaved off another 1.25 seconds.
I'll look at eager loading of associations next, as there's probably some time to save there, and I'll do a quick comparison of sending "just the data" vs sending the HTML, as #tadman suggested. About the only partial I can cache is the navigation sidebar. All of the other content depends on the request parameters.
Thanks for your suggestions!

Internally all partials are translated into a block of executable Ruby code and run through exactly the same runtime as any helper methods. Periodically you can see glimpses of this when a malformed template causes the generated code to fail to compile.
Although it stands to reason that helper methods are faster than partials, and a straightforward string interpolation is faster still, it's hard to say if the performance gain from this would make it worth pursuing. Rendering a very large number of partials can be a bottleneck in terms of logging in the development environment, but in a production environment their impact seems less severe.
The only way to figure this one out is to benchmark your pages using two different rendering methods.
As you point out, caching is where you get the big gains. Using Memcached to save large chunks of pre-rendered HTML content can give you exponentially faster load times. Rendering 10,000 rows into HTML will always be slower than retrieving the same snippet from the Rails.cache subsystem.
It's also the case that the content you don't render is always rendered the quickest, so anything you can do to reduce the amount of content you generate for each helper call will provide big gains. If you're building a large spread-sheet style app that's entirely dependent on JavaScript, you may find that bundling up the data as a JSON array and expanding it client-side is significantly faster than unrolling the HTML on the server and shipping it over that way.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart