Ruby CSV.foreach start at specific row - ruby-on-rails

I've seen a couple posts for this with no real answers or out-of-date answers, so I'm wondering if there are any new solutions. I have an enormous CSV I need to read in. I can't call open() on it bc it kills my server. I have no choice but to use .foreach().
Doing it this way, my script will take 6 days to run. I want to see if I can cut that down by using Threads and splitting the task in two or four. So one thread reads lines 1-n and one thread simultaneously will read lines n+1-end.
So I need to be able to only read in the last half of the file in one thread (and later if I split it into more threads, just a specific line through a specific line).
Is there anyway in Ruby to do this? Can this start at a certain row?
CSV.foreach(FULL_FACT_SHEET_CSV_PATH) do |trial|
EDIT:
Just to give an idea of what one of my threads looks like:
threads << Thread.new {
CSV.open('matches_thread3.csv', 'wb') do |output_csv|
output_csv << HEADER
count = 1
index = 0
CSV.foreach(CSV_PATH) do |trial|
index += 1
if index > 120000
break if index > 180000
#do stuff
end
end
end
}
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.

If still relevant, you can do something like this using .with_index after :
rows_array = []
CSV.foreach(path).with_index do |row, i|
next if i == 0 #skip first row
rows_array << columns.map { |n| row[n] }
end

But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
Impossible. Content of a CSV file is just a blob of text, with some commas and newlines. You can't know at which offset in the file row N starts without knowing where row N-1 ends. And to know this, you have to know where row N-1 starts (see recursion?) and read the file until you see where it ends (encounter a newline that is not part of field value).
Exception to this is if all your rows are of fixed size. In which case, you can seek directly to offset 120_000 * row_size. I am yet to see a file like this, though.

As per my understanding towards your Question in Ruby way it may help you.
require 'csv'
csv_file = "matches_thread3.csv"
# define one Constant Chunk Size for Jobs
CHUNK_SIZE = 120000
# split - by splitting (\n) will generate an array of CSV records
# each_slice - will create array of records of CHUNK_SIZE defined
File.read(csv_file).split("\n").drop(1).each_slice(CHUNK_SIZE).with_index
do |chunk, index|
data = []
# chunk will be work as multiple Jobs of 120000 records
chunk.each do |row|
data << r
##do stuff
end
end

Related

Ruby/ Rails split array into N groups with remainder being added to last group.

I am attempting to make a batch process which will take a parameter that specifies the number of background workers, and split a collection into that many arrays. For example if
def split_for_batch(number_of_workers)
<code>
end
array = [1,2,3,4,5,6,7,8,9,10]
array.split_for_batch(3)
=> [[1,2,3],[4,5,6],[7,8,9,10]]
the thing is that I don't want to have to load all of the users into memory at once because it is a batch. What I have now is
def initialize_audit_run_threads
total_users = tax_audit_run_users.count
partition_size = (total_users / thread_count).round
tax_audit_run_users.in_groups_of(partition_size).each do |group|
thread = TaxAuditRunThread.create(:tax_audit_run_id => id, :status_code => 1)
group.each do |user|
if user
user.tax_audit_run_thread_id = thread.id
user.save
end
end
end
where the thread_count is an attribute of the class that determines the number of background workers. Currently this code will create 4 threads rather than 3. I have also tried using find_in_batches but I am having the same problem where if I have 10 tax_audit_run_users in the array I have no way to let the last worker know to process the last record. Is there a way in ruby or rails to divide a collection into n parts and have the last part include the stragglers?
How to split (chunk) a Ruby array into parts of X elements?
You will of course need to modify it a bit to add the last chunk if it's less than the chunk size, or not, up to you.

Yaml , counting lines including blank lines

I am parsing a yaml file and searching for specific values, after the search matches i want to get the line number and print it. I managed to do exactly that but the major problem is that while parsing the yaml file using YAML.LOAD , the blank lines are ignored.
i can count the rest of the lines using keys i.e. 1 key per line but i an unable to count blank lines. please help, been stuck with this for a few days now.
this is how my code looks like:
hash = YAML.load(IO.read(File.join(File.dirname(__FILE__), 'en.yml')))
def recursive_hash_to_yml_string(input, hash, depth = 0)
hash.keys.each do |search|
#count = #count + 1
if hash[search].is_a?(String) && hash[search] == input
#yml_array.push(#count)
elsif hash[search].is_a?(Hash)
recursive_hash_to_yml_string(input, hash[search], depth + 1)
end
end
end
I agree with #Wukerplank - parsing a file should ignore blank lines. You might want to think about finding the line number using a different approach.
Perhaps you don't need to parse the YAML at all. If you are just searching the file for some matching text and returning the line number, maybe you'd manage better reading each line of the file using File.each_line.
You could iterate over each line in the file until you found a match and then do something with the line number.

large data in array

I am developing a Rails app. I would like to use an array to hold 2,000,000 data, then insert the data into database like following:
large_data = Get_data_Method() #get 2,000,000 raw data
all_values = Array.new
large_data.each{ |data|
all_values << data[1] #e.g. data[1] has the format "(2,'john','2002-09-12')"
}
sql="INSERT INTO cars (id,name,date) VALUES "+all_values.join(',')
ActiveRecord::Base.connection.execute(sql)
When I run the code, it takes a long long time at the point of large_data.each{...} . Actually I am now still waiting for it to finish(it has been running for 1 hour already still not finish the large_data.each{...} part).
Is it because of the number of elements is too large for the ruby array that the array can not hold 2,000,000 elements ? or ruby array can hold that much elements and it is reasonable to wait this long?
Since I would like to use bulk insertion in SQL to speed up the large data insertion time in mysql database, so I would like to use only one INSERT INTO statement, that's why I did the above thing. If this is a bad design, can you recommand me a better way?
Some notes:
Don't use the pattern "empty array + each + push", use Enumerable#map.
all_values = large_data.map { |data| data[1] }
Is it possible to write get_data to return items lazily? if the answer is yes, check enumerators and use them to do batched inserts into the database instead of puting all objects at once. Something like this:
def get_data
Enumerator.new do |yielder|
yielder.yield some_item
yielder.yield another_item
# yield all items.
end
end
get_data.each_slice(1000) do |data|
# insert those 1000 elements into the database
end
That said, there're projects for doing efficient bulk insertions, check ar-extensions and activerecord-import for Rails >= 3.
An array of 2m items is never going to be the easyist thing to manage, have you taken a look at MongoDB, this is a database which can be accessed just like an array and could be the answer to your issues.
An easy fix would be to split your inserts into blocks of 1000, that would make the whole process more manageable.

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

searching within an already retrieved mysql result

I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop

Resources