Below is a method which inserts records into the devices database. I am having a problem where I get a 'failed to allocate memory' error.
It is being run on a Windows Mobile device with quite limited memory.
There are 10 models, one is reasonably large with 108,000 records.
The error occurs when executing this line (f.readlines().each do |line|) but it occurs after the largest model has already been inserted.
Is the memory not being released by the block that is iterating through the lines? Or is there something else happening?
Any help on this matter would be greatly appreciated!
def insertRecordsIntoRhom(models)
updateAmount = 45 / models.length
models.each_with_index do |model,i|
csvColumns =
db = ::Rho::RHO.get_src_db(model)
f ="#{model}.csv")
f.readlines().each do |line|
#extract columns from header line of csv
if j==0
csvColumns = getCsvFieldFromHeader(line)
eval(models[i] + ".create(#{csvPutIntoHash(line,csvColumns)})")

IO#readlines returns an Array, i.e. it reads the whole file and returns a list of all the lines. No line can be garbage collected until you are completely done iterating that list.
Since you only need one line at a time, you should use IO#each_line instead. This will read only a little bit at a time and pass you lines one by one. Once you are done with a line, it can be garbage collected while the rest of the file is being processed.
Finally, note that Ruby comes bundled with a good CSV library, you probably want to use that if you can instead of rolling your own.


Read data from csv file with foreach function

I have been reading data from csv, if there is a large csv file, for avoid this time-out(rack 12 sec timeout) i have read only 25 rows from csv after 25 rows it return and again make a request so this will continue until read all the rows.
def read_csv(offset)
r_count = 1
CSV.foreach(file.tempfile, options) do |row|
if r_count > offset.to_i
r_count += 1
But here it is creating a new issue, let say first read 25 rows then when the next request comes offset is 25 that time it will read upto first 25 rows then it will start read from 26 and do process, so how can i skip this rows which already read?, i tried this if next to skip iteration but that fails, or is there any other efficient way to do this?
def read_csv(fileName)
lines = (`wc -l #{fileName}`).to_i + 1
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
lines_processed += 1
Pure Ruby - SLOWER
def read_csv(fileName)
lines = open("sample.csv").count
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
lines_processed += 1
I ran a new benchmark comparing your original method provided and my own. I also included the test file information.
"File Information"
Lines: 1172319
Size: 126M
"django's original method"
Time: 18.58 secs
Memory: 0.45 MB
"OneNeptune's method"
Time: 0.58 secs
Memory: 2.18 MB
"Pure Ruby method"
Time: 0.96
Memory: 2.06 MB
NOTE: I added a pure ruby method, since using wc is sort of cheating, and not portable. In most cases it's important to use pure language solutions.
You can use this method to process a very large CSV file.
~2MB memory I feel is pretty optimal considering the file size, it's a bit of an increase of memory usage, but the time savings seems to be a fair trade, and this will prevent timeouts.
I did modify the method to take a fileName, but this was just because I was testing many different CSV files to make sure they all worked correctly. You can remove this if you'd like, but it'll likely be helpful.
I also removed the concept of an offset, since you stated you originally included it to try to optimize the parsing yourself, but this is no longer necessary.
Also, I keep track of how many lines are in the file, and how many were processed since you needed to use that information. Note, that lines only works on unix based systems, and it's a trick to avoid loading the entire file into memory, it counts the new lines, and I add 1 to account for the last line. If you're not going to count headers as line though, you could remove the +1 and change lines to "rows" to be more accurate.
Another logistical problem you may run into is the need to figure how to handle if the CSV file has headers.
You could use lazy reading to speed this up, the whole of the file wouldn't be read, just from the beginning of the file until the chunk you use.
See and for examples.
You could also use SmarterCSV to work in chunks like this.
SmarterCSV.process(file_path, {:chunk_size => 1000}) do |chunk|
chunk.each do |row|
# Do your processing
The way I did this was by streaming the result to the user, if you see what is happening it doesn't bother that much you have to wait. The timeout you mention won't happen here.
I'm not a Rails user so I give an example from Sinatra, this can be done with Rails also. See eg
require 'sinatra'
get '/' do
line = 0
stream :keep_open do |out|
1.upto(100) do |line| # this would be your CSV file opened
out << "processing line #{line}<br>"
# process line
sleep 1 # for simulating the delay
A still better but somewhat complicated solution would be to use websockets, the browser would receive the results from the server once the processing is finished. You will need some javascript in the client also to handle this. See

CSV Parse Taking Too Much Memory

I'm trying to read a 5MM line file and right now it is exceeding my alotted memory usage on heroku. My method is somewhat fast ~200 inserts/second.. I believe it's crashing on the import.. So my plan was to import in batches of 1,000 or 10,000. My question is how do I tell that I'm at the end of my file, ruby has a .eof method but its a File method, and I'm not sure how to call it in my loop
def self.import_parts_db(file)
time = Benchmark.measure do
Part.transaction do
parts_db = []
CSV.parse(, headers: true) do |row|
row_hash = row.to_hash
part =
part_num: row_hash["part_num"],
description: row_hash["description"],
manufacturer: row_hash["manufacturer"],
model: row_hash["model"],
cage_code: row_hash["cage_code"],
nsn: row_hash["nsn"]
parts_db << part
Part.import parts_db
puts time
1st problem
As soon as you use with a huge file, your script will use a lot of memory (possibly too much). You read the entire file into 1 huge string, even though CSV reads it line by line.
It might work fine when you use files with thousands of rows. Still, you should use CSV.foreach.
CSV.parse(, headers: true) do |row|
CSV.foreach(file, headers: true) do |row|
In this example, memory usage went from 1GB to 0.5MB.
2nd problem
parts_db becomes a huge Array of Parts, which keeps growing until the very end of the CSV file.
You need to either remove the transaction (import will be slow but won't require more memory than for 1 row) or process the CSV in batches.
Here's one possibility to do it. We use CSV.parse again, but only with batches of 2000 lines :
def self.import_parts_db(filename)
time = Benchmark.measure do do |file|
headers = file.first
file.lazy.each_slice(2000) do |lines|
Part.transaction do
rows = CSV.parse(lines.join, write_headers: true, headers: headers)
parts_db = do |_row|
part_num: row_hash['part_num'],
description: row_hash['description'],
manufacturer: row_hash['manufacturer'],
model: row_hash['model'],
cage_code: row_hash['cage_code'],
nsn: row_hash['nsn']
Part.import parts_db
puts time
3rd problem?
The previous answer shouldn't use much memory, but it still could take a long time to import everything, possibly too much for a remote server.
The advantage of using an Enumerator is that it's easy to skip batches, and get just the ones you want.
Let's say your import takes too long, and stops for some reason after 424000 successful imports.
You could replace :
file.lazy.each_slice(2000) do |lines|
file.lazy.drop(424_000).take(300_000).each_slice(2000) do |lines|
To skip the first 424000 CSV lines, and parse the next 300000 ones.
For the next import, use :
file.lazy.drop(424_000+300_000).take(300_000).each_slice(2000) do |lines|
and then :
file.lazy.drop(424_000+2*300_000).take(300_000).each_slice(2000) do |lines|
CSV.parse is pretty efficient, passing one parsed CSV line to the block that does the processing.
The problem doesn't come from the CSV parser, it comes from building the parts_db array in memory. I suggest rewriting the Part.import method to import the data line by line, instead of an entire array of records at once.
Try a different CSV. Had one that was about 30 megs that used 8 remaining gigs of RAM, resaving the file seemed to have corrected my issue.

How to read a file block in Rails without read it again from beginning

I have a growing file (log) that I need to read by blocks.
I make a call by Ajax to get a specified number of lines.
I used File.foreach to read the lines I want, but it reads always from the beginning and I need to return only the lines I want, directly.
Example Pseudocode:
#First call: and return 0 to 10 lines
#Second call: and return 11 to 20 lines
#Third call: and return 21 to 30 lines
#And so on...
Is there anyway to make this?
Solution 1: Reading the whole file
The proposed solution here: not an efficient solution in your case, because it requires you to read all the lines from the file for each AJAX request, even if you just need the last 10 lines of the logfile.
That's an enormous waste of time, and in computing terms the solving time (i.e. process the whole logfile in blocks of size N) approaches exponential solving time.
Solution 2: Seeking
Since your AJAX calls request sequential lines we can implement a much more efficient approach by seeking to the correct position before reading, using and IO.pos.
This requires you to return some extra data (the last file position) back to the AJAX client at the same time you return the requested lines.
The AJAX request then becomes a function call of this form request_lines(position, line_count), which enables the server to before reading the requested count of lines.
Here's the pseudocode for the solution:
Client code:
pos = 0
loop {
data = server.request_lines(pos, LINE_COUNT)
pos = data.pos
break if pos == -1 # Reached end of file
Server code:
def request_lines(pos, line_count)
file ='logfile')
# Seek to requested position
# Read the requested count of lines while checking for EOF
lines = { file.readline if !file.eof? }.compact
# Mark pos with -1 if we reached EOF during reading
pos = file.eof? ? -1 : file.pos
# Return data
data = { lines: lines, pos: pos }

Newbie rails logic & syntax questions

I am learning rails and have come across the following code which I would like to use. The code in question is the answer by John F Miller (first answer) in the following link:
How to render all records from a nested set into a real html tree
def tree_from_set(set) #set must be in order
buf = START_TAG(set[0])
stack = []
stack.push set[0]
set[1..-1].each do |node|
if stack.last.lft < node.lft < stack.last.rgt
if node.leaf? #(node.rgt - node.lft == 1)
buf << NODE_TAG(node)
buf << START_TAG(node)
buf << END_TAG
buf <<END_TAG
def START_TAG(node) #for example
def NODE_TAG(node)
I am unsure of the following and would appreciate any guidance.
I see this will cycle through "set" assigning each item to the object "node" however I cannot determine what [1..-1] does.
set[1..-1].each do |node|
Following the logic I cannot understand the purpose of removing the last item from the array "stack"
It appears this command in this context is no longer supported in ruby after 1.9. I believe the intention was to return to the start of the loop and repeat.
A "subarray" with all but zeroth element.
Negative array indices -x in Ruby are shorthands for length-x. That is, -1'st element is the last. Range 1..-1 is "first to last", but since arrays are zero-indexed in Ruby, that means "all but zeroth element".
The stack holds "how deep you are", more precisely, which elements are you currently in. When examining the next element, if you "went out", you should close the list you left (possibly multiple times!) before adding the current item.
As for retry: I think it should be redo. If you stepped outside, you have to make sure you close every list you have to: once per iteration you pop from the stack, close the closest list and loop this until you are inside the top element on the stack, in the context for the current node to be inserted.
Actually, thus code assumes you only have one tree with set[0] its root. By adding to line 6 a check (with ||) if the stack is empty you eliminate this flaw, need for pushing set[0] manually, and thus exclusion of it from the loop. Because if the stack is empty, you are in hyperspace that contains everything, so you shouldn't bother comparing anything. This gives you the possibility of rendering multiple element trees (possibly without common root) from one list.
I believe the clean solution to this is a recursive one, replacing "home-made stack" with Ruby's call stack. I can't come up with a solution too quickly on this though.

Why is this RegExp taking 16 minutes to process on Rails?

I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?
