Rails design doubt: Should/could I load the whole dictionary/table into memory?

Rails design doubt: Should/could I load the whole dictionary/table into memory? - ruby-on-rails

I am a newbie working in a simple Rails app that translates a document (long string) from a language to another. The dictionary is a table of terms (a string regexp to find and substitute, and a block that ouputs a substituting string). The table is 1 million records long.
Each request is a document that wants to be translated. In a first brutish force approach I need to run the whole dictionary against each request/document.
Since the dictionary will run whole every time (from the first record to the last), instead of loading the table of records of the dictionary with each document, I think the best would be to have the whole dictionary as an array in memory.
I know it is not the most efficient, but the dictionary has to run whole at this point.
1.- If no efficiency can be gained by restructuring the document and dictionary (meaning it is not possible to create smaller subsets of the dictionary). What is the best design approach?
2.- Do you know of similar projects that I can learn from?
3.- Where should I look to learn how to load such a big table into memory (cache?) at rails startup?
Any answer to any of the posed questions will be greatly appreciated. Thank you very much!

I don't think your web hoster will be happy with a solution like this. This script
dict = {}
(0..1000_000).each do | num |
dict[/#{num}/] = "#{num}_subst"
end
consumes a gigabyte of RAM on my MBP for storing the hash table. Another approach will be to store your substitutions marshaled in memcached so that you could (at least) store them across machines.
require 'rubygems'
require 'memcached'
#table = Memcached.new("localhost:11211")
retained_keys = (0..1000_000).each do | num |
stored_blob = Marshal.dump([/#{num}/, "#{num}_subst"])
#table.set("p#{num}", stored_blob)
end
You will have to worry about keeping the keys "hot" since memcached will expire them if they are not needed.
The best approach however, for your case, would be very simple - write your substitutions to a file (one line per substitution) and make a stream-filter that reads the file line by line, and replaces from this file. You can also parallelize that by mapping work on this, say, per letter of substitution and replacing markers.
But this should get you started:
require "base64"
File.open("./dict.marshal", "wb") do | file |
(0..1000_000).each do | num |
stored_blob = Base64.encode64(Marshal.dump([/#{num}/, "#{num}_subst"]))
file.puts(stored_blob)
end
end
puts "Table populated (should be a 35 meg file), now let's run substitutions"
File.open("./dict.marshal", "r") do | f |
until f.eof?
pattern, replacement = Marshal.load(Base64.decode64(f.gets))
end
end
puts "All replacements out"
To populate the file AND load each substitution, this takes me:
real 0m21.262s
user 0m19.100s
sys 0m0.502s
To just load the regexp and the string from file (all the million, piece by piece)
real 0m7.855s
user 0m7.645s
sys 0m0.105s
So this is 7 seconds IO overhead, but you don't lose any memory (and there is huge room for improvement) - the RSIZE is about 3 megs. You should easily be able to make it go faster if you do IO in bulk, or make one file for 10-50 substitutions and load them as a whole. Put the files on an SSD or a RAID and you got a winner, but you get to keep your RAM.

In production mode, Rails will not reload classes between requests. You can keep something in memory easily by putting it into a class variable.
You could do something like:
class Dictionary < ActiveRecord::Base
##cached = nil
mattr_accessor :cached
def self.cache_dict!
##cached = Dictionary.all
end
end
And then in production.rb:
Dictionary.cache_dict!
For your specific questions:
Possibly write the part that's inefficient in C or Java or a faster language
Nope, sorry. Maybe you could do a MapReduce algorithm to distribute the load across servers.
See above.

This isn't so much a specific answer to one of your questions as a process recommendation. If you're having (or anticipating) performance issues, you should be using a profiler from the get-go.
Check out this tutorial: How to Profile Your Rails Application.
My experience on a number of platforms (ANSI C, C#, Ruby) is that performance problems are very hard to deal with in advance; rather, you're better off implementing something that looks like it might be performant then load-testing it through a profiler.
Then, once you know where your time is being spent, you can expend some effort on optimisation.
If I had to take a guess, I'd say the regex work you'll be performing will be as much of a performance bottleneck as any ActiveRecord work. But without verifying that with a profiler, that guess is of little value.

If you use something like cache_fu, you can then leverage something like memcache without doing any work yourself. If you are trying to bring 1 MM rows into memory, being able to leverage the distributed nature of memcache will probably be useful.

Related

Why I always fail on loading big files in lua? [duplicate]

The overview is I am prototyping code to understand my problem space, and I am running into 'PANIC: unprotected error in call to Lua API (not enough memory)' errors. I am looking for ways to get around this limit.
The environment bottom line is Torch, a scientific computing framework that runs on LuaJIT, and LuaJIT runs on Lua. I need Torch because I eventually want to hammer on my problem with neural nets on a GPU, but to get there I need a good representation of the problem to feed to the nets. I am (stuck) on Centos Linux, and I suspect that trying to rebuild all the pieces from source in 32bit mode (this is reported to extend the LuaJIT memory limit to 4gb) will be a nightmare if it works at all for all of the libraries.
The problem space itself is probably not particularly relevant, but in overview I have datafiles of points that I calculate distances between and then bin (i.e. make histograms of) these distances to try and work out the most useful ranges. Conveniently I can create complicated Lua tables with various sets of bins and torch.save() the mess of counts out, then pick it up later and inspect with different normalisations etc. -- so after one month of playing I am finding this to be really easy and powerful.
I can make it work looking at up to 3 distances with 15 bins each (15x15x15 plus overhead), but this only by adding explicit garbagecollection() calls and using fork()/wait() for each datafile so that the outer loop will keep running if one datafile (of several thousand) still blows the memory limit and crashes the child. This gets extra painful as each successful child process now has to read, modify and write the current set of bin counts -- and my largest files for this are currently 36mb. I would like to go larger (more bins), and would really prefer to just hold the counts in the 15 gigs of RAM I can't seem to access.
So, here are some paths I have thought of; please do comment if you can confirm/deny that any of them will/won't get me outside of the 1gb boundary, or will just improve my efficiency within it. Please do comment if you can suggest another approach that I have not thought of.
am I missing a way to fire off a Lua process that I can read an arbitrary table back in from? No doubt I can break my problem into smaller pieces, but parsing a return table from stdio (as from a system call to another Lua script) seems error prone, and writing/reading small intermediate files will be a lot of disk i/o.
am I missing a stash-and-access-table-in-high-memory module ? This seems like what I really want, but not found it yet
can FFI C data structures be put outside the 1gb? Doesn't seem like that would be the case but certainly I lack a full understanding of what is causing the limit in the first place. I suspect that this will just get me an efficiency improvement over generic Lua tables for the few pieces that have moved beyond prototyping? (unless I do a bunch of coding for each change)
Surely I can get out by writing an extension in C (Torch appears to support nets that should go outside of the limit), but my brief investigation there turns up references to 'lightuserdata' pointers -- does this mean that a more normal extension won't get outside 1gb either? This also seems like it has the heavy development cost for what should be a prototyping exercise.
I know C well so going the FFI or extension route doesn't bother me - but I know from experience that encapsulating algorithms in this way can be both really elegant and really painful with two places to hide bugs. Working through data structures containing tables within tables on the stack doesn't seem great either. Before I make this effort I would like to be certain that the end result really will solve my problem.
Thanks for reading the long post.

Only object allocated by LuaJIT itself are limited to the first 2GB of memory. This means that tables, strings, full userdata (i.e. not lightuserdata), and FFI objects allocated with ffi.new will count towards the limit, but objects allocated with malloc, mmap, etc. are not subjected to this limit (regardless if called by a C module or the FFI).
An example for allocating a structure with malloc:
ffi.cdef[[
typedef struct { int bar; } foo;
void* malloc(size_t);
void free(void*);
]]
local foo_t = ffi.typeof("foo")
local foo_p = ffi.typeof("foo*")
function alloc_foo()
local obj = ffi.C.malloc(ffi.sizeof(foo_t))
return ffi.cast(foo_p, obj)
end
function free_foo(obj)
ffi.C.free(obj)
end
The new GC to be implemented in LuaJIT 3.0 IIRC will not have this limit, but I haven't heard any news on it's development recently.
Source: http://lua-users.org/lists/lua-l/2012-04/msg00729.html

Here is some follow-up information for those who find this question later:
The key information is as posted by Colonel Thirty Two, that C module extensions and FFI code can easily get outside of the limit. (and the referenced lua list post reminds that plain Lua tables that go outside the limit will be very slow to garbage collect)
It took me some time to pull the pieces together to both access and save/load my objects, so here it is in one place:
I used lds at https://github.com/neomantra/lds as a starting point, in particular the 1-D Array code.
This broke using torch.save(), as it doesn't know how to write the new objects. For each object I added the code below (using Array as the example):
function Array:load(inp)
for i=1,#inp do
self._data[i-1] = tonumber(inp[i])
end
return self
end
function Array:serialize ()
local siz = tonumber(self._size)
io.write(' lds.ArrayT( ffi.typeof("double"), lds.MallocAllocator )( ', siz , "):load({")
for i=0,siz-1 do
io.write(string.format("%a,", self._data[i]))
end
io.write("})")
end
Note that my application specifically uses doubles and malloc(), so a better implementation would store and use these in self rather than hard coding above.
Then as discussed in PiL and elsewhere, I needed a serializer that would handle the object:
function serialize (o)
if type(o) == "number" then
io.write(o)
elseif type(o) == "string" then
io.write(string.format("%q", o))
elseif type(o) == "table" then
io.write("{\n")
for k,v in pairs(o) do
io.write(" ["); serialize(k); io.write("] = ")
serialize(v)
io.write(",\n")
end
io.write("}\n")
elseif o.serialize then
o:serialize()
else
error("cannot serialize a " .. type(o))
end
end
and this needs to be wrapped with:
io.write('do local _ = ')
serialize( myWeirdTable )
io.write('; return _; end')
and then the output from that can be loaded back in with
local myWeirdTableReloaded = dofile('myWeirdTableSaveFile')
See PiL (Programming in Lua book) for dofile()
Hope that helps someone!

You can use the torch tds module. From the README:
Data structures which do not rely on Lua memory allocator, nor being limited by Lua garbage collector.
Only C types can be stored: supported types are currently number, strings, the data structures themselves (see nesting: e.g. it is possible to have a Hash containing a Hash or a Vec), and torch tensors and storages. All data structures can store heterogeneous objects, and support torch serialization.

HHVM staticly typing lookup tables and keeping them fully cached in RAM

I'm doing scientific research, processing through millions of combinations of multi-megabyte arrays.
For you to be capable of answering this question you will need to have knowledge/experience of all of the following
how HHVM is able to cache data structures in RAM between requests
how to tell HHVM data structures will be constant
how to declare array index and value types
I need to process the entire arrays, so it's a lot of data to be loaded and processed. (millions of requests within minutes on a LAN). The faster I can complete requests the quicker I can complete my work. If HHVM has to do work loading this data on each request, it accounts for a significant fraction of the time to complete the request (sometimes more than half, it depends on the complexity of the analysis I'm doing at the time).
I have found a method that has allowed me to keep these data structures cached in RAM (no loading from files, interpreting code, pushing to the array hundreds of thousands of times for no reason, no pointless repetitive unserialize etc), and thus I have eliminated this massive measurable delay.
I have 3 questions regarding how I can make this even faster:
Is the way I'm doing it now creating a global scope penalty?
How can I declare my arrays as constant and tell HHVM what data types to expect?
If I declare my arrays as constant is it even necessary to declare the types for HHVM?
Instead of using nested arrays, if I use 3 separate data structures ImmVector, PackedArray, or define a class would it be faster?
Keep in mind that anything that prevents HHVM from caching the data structure in RAM between requests should be regarded as unacceptable.
Lookuptable35543.php
<?php
$data = [
["uuid (20 chars)", 5336, 7373],
["uuid (20 chars)", 5336, 7373],
#more lines as above
];
?>
Some of these files are many MB in size and there are a lot of them
Main.php
<?php
function main() {
require /path/to/Lookuptable35543.php;
#(Do stuff with $data)
}
?>
This is working quite well, as Main.php gets thousands of requests, in a short period of time, HHVM keeps Lookuptable.php's data structure in memory. Avoiding pointless processing and IO, as it just sits in RAM, ready for use. (I have more than enough RAM)
Unfortunately, the only way I know how to make HHVM hold the lookup table in RAM is, I set $data in the global scope inside my lookup####.php file (then require the lookup file into a function in the data processing file: Main.php)? This way HHVM doesnt bother loading the file or re executing the code to create $data, because it can see that $data can be determined at compile time, and it will not ever change during runtime. This works but I dont know if there is a penalty from having the $data exist in the lookup###.php file's global scope. (Or maybe its not global at all because it is required into main.php's function?)
What if I return $data from a function inside Lookup.php and call that function from Main.php like this
Main.php
Would the HHVM JIT the result of getData() in RAM?
Somehow I associate functions with unpredictability... but maybe HHVM is clever enough to know that the functions result can be determined at compile time, and never changes?
I can't put the lookup table inside Main.php because I require different lookup tables based on the type of request.
Is there a way I can tell HHVM that my outer array will always have an integer index that never changes, and the values of the outer array will always be an array?
Perhaps I need to use ImmVector?
Then is there a way to tell HHVM that my inner array will always be a fixed length string followed by 2 integers, always, no extra elements, contents never changes?
I'd prefer not to use OO or create a class. How can I declare types, procedural style?
If a class is absolutely necessary can you please give example code suitable for my requirements above?
Will it be faster if I dont nest arrays?
I just realized I could have one array with integer index and values of fixed length string. Then a 2nd array with integer index and integer values, and a 3rd one with integer index and integer values.
If you're not familiar with this HHVM caching technique please do not waste mutual time suggesting a database, redis, APC, unserialize, etc. The fastest is for HHVM to just keep my various $data variables in RAM. Even unserializing $data from a ramdisk file is slow, because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know. I dont want to even have to copy $data. The lookup tables are immutable, read only. They must just stay fully structured in RAM. My current caching solution (at the top of this question) has already given me huge gains, but as per my 3 questions I think there may be more gains to be had?
Incase you're wondering, I have measured the latency of various data loading or caching methods.
Now I basically want to keep the caching situation I have, but give the HHVM JIT maximum confidence about how to type my data, so it can save time not running type or even bound (array size) checks.
Edit
Ok so nobody has been able to give me any code examples yet, so I'm just trying stuff out.
Here's what I've found out so far.
const arrays don't work yet in HHVM. const foo = ['uuid1',43,43];
throws an error about HHVM only supporting constants with scalar values.
Vector with Array values: I don't know how it will perform yet... I expect it will be better than a normal array. This is valid HH code.
This is progress, because HHVM should be able to cache this in the same way, HHVM knows this whole structure is constant, and HHVM knows the indexes are all integers.
What I'm still not entirely happy about with this structure is this:
Consider this code
for ($n=0;$n<count($iv);++$n) if ($x > $iv[$n][1]) dosomething();
Will HHVM perform a type check on $if[$n][2] on every loop iteration?
In my definition of $iv above, there is nothing that says the 2nd element of the inner array will be an integer.
How can I improve on this?
Can disabling the type checker be of any use? Does this only hide errors from the external type checker, or does it prevent HHVM from constantly doing type checks? (I'm thinking it's the first thing)
Perhaps if I could make my own user-defined type that would solve the problem?
<?hh
#I don't know what mechanisms for UDT's exist, so this code is made-up
CreateUDT foo = <string,int,int>;
$iv = ImmVector<foo> {
['uuid1',425,244],
['uuid2',658,836]
};
print_r($iv);
I found a reference to this at Hack Collections Literal Syntax Vector<Foo> unfortunately it might not be available to use yet.

I'm a software engineer at Facebook working on HHVM.
This entire question reeks of premature optimization to me. Have you done profiling and determined that loading this array is actually a bottleneck for your app? (Not just microbenchmarks, but how it actually affects the performance, latency, RPS, etc of realistic pageloads.) And also isolated from other effects, e.g., if this array is a cache or some sort of precomputed data, you need to isolate the win of precomputing the data from the actual time to load it by caching it in various different ways.
In general, HHVM is very good at dealing with arrays, since they are so hot in nearly every codepath -- and in particular at constant arrays like this one. To your questions about how to inform it of the shape and types of things in the arrays, HHVM can figure that all out for itself, and is very good at doing so on constant arrays composed entirely of constants. (And the ways it thinks about arrays aren't quite the ways you think about arrays, so it can probably do a better job anyway!) Basically, unless profiling says this is actually a hotspot -- which I'm pretty skeptical of -- I wouldn't worry too much about it. A couple general notes to be aware of:
Measure every performance diff. Don't prematurely optimize -- use profiling to guide. The developer productivity lost by premature optimizations getting in the way can be lethal.
Get things out of toplevel ("pseudomains") as much as possible. A function which returns a static or constant array should be just fine, and will in general help HHVM optimize code even better.
Avoid references as much as possible, especially in this array if you care about performance so much.
You probably should look into repo authoritiative mode which can help HHVM optimize lots of things even more -- but in particular for this case, the more aggressive inlining that repo auth mode can do might be a win.
Edit, aside:
because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know
This is exactly what I mean by premature optimization: you're rejecting APC without even trying it, even if it might be a cleaner way of doing what you want. It turns out that, in most cases, HHVM actually can optimize away the serialization/deserialization of storing arrays in APC, particularly if they are constant arrays that are never modified. As above, HHVM is very good at optimizing lots of common patterns. Just write code that's clean, profile it, and fix the hotspots.

Okay I've solved my first question.
I don't have any global scope issues. My require is being done from inside function main(), so it's as if the code from lookuptable####.php is being inserted into function main().
HHVM docs: "If the include occurs inside a function..."
Basically if you were to open lookuptable####.php it looks like the code is in global scope, but that's not the file that is being requested from hhvm. main.php is the one being requested, thus there is no code in global scope.
I think I've answered my 2nd question, it's currently at the bottom of my question. I'm not 100% convinced, but I'm pretty happy to move ahead and test it.

Lua tables: performance hit for starting array indexing at 0?

I'm porting FFT code from Java to Lua, and I'm starting to worry a bit about the fact that in Lua the array part of a table starts indexing at 1 while in Java array indexing starts at 0.
For the input array this causes no problem because the Java code is set up to handle the possibility that the data under consideration is not located at the start of the array. However, all of the working arrays internal to the code are assumed to starting indexing at 0. I know that the code will work as written -- Lua tables are awesome like that -- but I have no sense at all about the performance hit I might incur by having the "0" element of the array going into the hash table part of the underlying C structure (or indeed, if that is what will happen).
My question: is this something worth worrying about? Should I be planning to profile and hand-optimize the code? (The code will eventually be used to transform many relatively small (> 100 time points) signals of varying lengths not known in advance.)

I have made small, probably not that reliable, test:
local arr = {}
for i=0,10000000 do
arr[i] = i*2
end
for k, v in pairs(arr) do
arr[k] = v*v
end
And similar version with 1 as the first index. On my system:
$ time lua example0.lua
real 2.003s
$ time lua example1.lua
real 2.014s
I was also interested how table.insert will perform
for i=1,10000000 do
table.insert(arr, 2*i)
...
and, suprisingly
$ time lua example2.lua
real 6.012s
Results:
Of course, it depends on what system you're running it, probably also whic lua version, but it seems that it makes little to no difference between zero-start and one-start. Bigger difference is caused by the way you insert things to array.

I think the correct answer in this case is changing the algorithm so that everything is indexed with 1. And consider that part of the conversion.
Your FFT will be less surprising to another Lua user (like me), given that all "array-like" tables are indexed by one.
It might not be as stressful as you might think, given the way numeric loops are structured in Lua (where the "start" and the "end" are "inclusive"). You would be exchanging this:
for i=0,#array-1 do
... (do stuff with i)
end
By this:
for i=1,#array do
... (do stuff with i)
end
The non-numeric loops would remain unchanged (except that you will be able to use ipairs too, if you so desire).

How to count occurrences of a substring within string fast with Ruby

I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?

Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?

I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.

You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.

Are helpers really faster than partials? What about string building?

I've got a fancy-schmancy "worksheet" style view in a Rails app that is taking way too long to load. (In dev mode, and yes I know there's no caching there, "Completed in 57893ms (View: 54975, DB: 855)") The worksheet is rendered using helper methods, because I couldn't stand maintaining umpteen teeny little partials for the different sorts of rows in the worksheet. Now I'm wondering whether partials might actually be faster?
I've profiled the page load and identified a few cases where object caching will shave a few seconds off, but the profile output suggests that a large chunk of time is spent simply looping through the Worksheet model's constituent objects and appending the string output from the helper. Here's an example of what I'm talking about:
def header_row(wksht)
content_tag(:thead, :class => "ioe") do
content_tag(:tr) do
html_row = []
for i in (0...wksht.class::NUM_COLS) do
html_row << content_tag(:th, h(wksht.column_headings[i].upcase),
:class => wksht.column_classes[i])
end
html_row.join("\n")
end
end
end
OTOH using partials means opening files, spinning off the Ruby interpreter, and in the long run, aggregating a bunch of strings, right? So I'm wondering whether there is another way to speed things up in the helpers. Should I be using something like a stringstream (does that exist in Ruby?), should I get rid of content_tag calls in favor of my own "" string interpolation... I'm willing to write my own performance tests, and share the results, if you have any suggested alternatives to the approach I've already taken.
As it's a fairly complex view (and has an editable version as well), I'd rather not rewrite-and-profile the whole thing more than once. :)
Some related reading:
http://www.viget.com/extend/helpers-vs-partials-a-performance-question/ (old)
http://www.breakingpointsystems.com/community/blog/ruby-string-processing-overhead/
http://blog.purepistos.net/index.php/2008/07/14/benchmarking-ruby-string-interpolation-concatenation-and-appending/
#tadman:
There are row totals and column totals (and more columnar arithmetic), and since they're not all just totals, but also depend on other "magic numbers" from the database, I implemented them in the Ruby code rather than Javascript. (DRY and unit testable.) Javascript is used only in the edit view, and just to add/delete rows (client side only) and to fetch a sheet with fresh totals when the cell contents change. It fetches the whole table because nearly half of the values get updated when an input cell changes.
The worksheet and its rows are actually virtual models; they don't live in the DB, but rather aggregate a boatload of real AR objects. They get created every time a view renders (but that takes 1.7 secs in dev mode, so I'm not worried about it).
I suppose I could transmit a matrix of numbers, rather than marked-up content, and have JS unpack it into the sheet. But that gets unmaintainable fast.

I ended up reading an excellent article at http://www.infoq.com/articles/Rails-Performance ("A Look At Common Performance Problems In Rails"). Then I followed the author's suggestion to cache computations during request processing:
def estimated_costs
#estimated_costs ||=
begin
# tedious vector math
end
end
Because my worksheet does stuff like the above over and over, and then builds on those results to calculate some more rows, this resulted in a 90% speedup right off the bat. Should have been plain as day, but it started with just a few totals, then I showed the prototype to the customer, and it snowballed from there :)
I also wondered whether my array-based math might be inefficient, so I replaced the Ruby Arrays of numbers with NArray (http://narray.rubyforge.org/). The speedup was negligible but the code's cleaner, so it's staying that way.
Finally, I put some object caching in place. The "magic numbers" in the database only change a few times a year at most, and some of them are encrypted, but they need to be used in most of the calculations. That's low-hanging fruit ripe for caching, and it shaved off another 1.25 seconds.
I'll look at eager loading of associations next, as there's probably some time to save there, and I'll do a quick comparison of sending "just the data" vs sending the HTML, as #tadman suggested. About the only partial I can cache is the navigation sidebar. All of the other content depends on the request parameters.
Thanks for your suggestions!

Internally all partials are translated into a block of executable Ruby code and run through exactly the same runtime as any helper methods. Periodically you can see glimpses of this when a malformed template causes the generated code to fail to compile.
Although it stands to reason that helper methods are faster than partials, and a straightforward string interpolation is faster still, it's hard to say if the performance gain from this would make it worth pursuing. Rendering a very large number of partials can be a bottleneck in terms of logging in the development environment, but in a production environment their impact seems less severe.
The only way to figure this one out is to benchmark your pages using two different rendering methods.
As you point out, caching is where you get the big gains. Using Memcached to save large chunks of pre-rendered HTML content can give you exponentially faster load times. Rendering 10,000 rows into HTML will always be slower than retrieving the same snippet from the Rails.cache subsystem.
It's also the case that the content you don't render is always rendered the quickest, so anything you can do to reduce the amount of content you generate for each helper call will provide big gains. If you're building a large spread-sheet style app that's entirely dependent on JavaScript, you may find that bundling up the data as a JSON array and expanding it client-side is significantly faster than unrolling the HTML on the server and shipping it over that way.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart