large data in array - ruby-on-rails

I am developing a Rails app. I would like to use an array to hold 2,000,000 data, then insert the data into database like following:
large_data = Get_data_Method() #get 2,000,000 raw data
all_values = Array.new
large_data.each{ |data|
all_values << data[1] #e.g. data[1] has the format "(2,'john','2002-09-12')"
}
sql="INSERT INTO cars (id,name,date) VALUES "+all_values.join(',')
ActiveRecord::Base.connection.execute(sql)
When I run the code, it takes a long long time at the point of large_data.each{...} . Actually I am now still waiting for it to finish(it has been running for 1 hour already still not finish the large_data.each{...} part).
Is it because of the number of elements is too large for the ruby array that the array can not hold 2,000,000 elements ? or ruby array can hold that much elements and it is reasonable to wait this long?
Since I would like to use bulk insertion in SQL to speed up the large data insertion time in mysql database, so I would like to use only one INSERT INTO statement, that's why I did the above thing. If this is a bad design, can you recommand me a better way?

Some notes:
Don't use the pattern "empty array + each + push", use Enumerable#map.
all_values = large_data.map { |data| data[1] }
Is it possible to write get_data to return items lazily? if the answer is yes, check enumerators and use them to do batched inserts into the database instead of puting all objects at once. Something like this:
def get_data
Enumerator.new do |yielder|
yielder.yield some_item
yielder.yield another_item
# yield all items.
end
end
get_data.each_slice(1000) do |data|
# insert those 1000 elements into the database
end
That said, there're projects for doing efficient bulk insertions, check ar-extensions and activerecord-import for Rails >= 3.

An array of 2m items is never going to be the easyist thing to manage, have you taken a look at MongoDB, this is a database which can be accessed just like an array and could be the answer to your issues.
An easy fix would be to split your inserts into blocks of 1000, that would make the whole process more manageable.

Related

How to iterate over an ActiveRecord resultset in one line with nil check in Ruby

I have an array of Active Record result and I want to iterate over each record to get a specific attribute and add all of them in one line with a nil check. Here is what I got so far
def total_cost(cost_rec)
total= 0.0
unless cost_rec.nil?
cost_rec.each { |c| total += c.cost }
end
total
end
Is there an elegant way to do the same thing in one line?
You could combine safe-navigation (to "hide" the nil check), summation inside the database (to avoid pulling a bunch of data out of the database that you don't need), and a #to_f call to hide the final nil check:
cost_rec&.sum(:cost).to_f
If the cost is an integer, then:
cost_rec&.sum(:cost).to_i
and if cost is a numeric inside the database and you don't want to worry about precision issues:
cost_rec&.sum(:cost).to_d
If cost_rec is an array rather than a relation (i.e. you've already pulled all the data out of the database), then one of:
cost_rec&.sum(&:cost).to_f
cost_rec&.sum(&:cost).to_i
cost_rec&.sum(&:cost).to_d
depending on what type cost is.
You could also use Kernel#Array to ignore nils (since Array(nil) is []) and ignore the difference between arrays and ActiveRecord relations (since #Array calls #to_ary and relations respond to that) and say:
Array(cost_rec).sum(&:cost)
that'll even allow cost_rec to be a single model instance. This also bypasses the need for the final #to_X call since [].sum is 0. The downside of this approach is that you can't push the summation into the database when cost_rec is a relation.
anything like these?
def total_cost(cost_rec)
(cost_rec || []).inject(0) { |memo, c| memo + c.cost }
end
or
def total_cost(cost_rec)
(cost_rec || []).sum(&:cost)
end
Either one of these should work
total = cost_rec.map(&:cost).compact.sum
total = cost_rec.map{|c| c.cost }.compact.sum
total = cost_rec.pluck(:cost).compact.sum
Edit: if cost_rec is nil
total = (cost_rec || []).map{|c| c.cost }.compact.sum
When cost_rec is an ActiveRecord::Relatation then this should work out of the box:
cost_rec.sum(:cost)
See ActiveRecord::Calculations#sum.

How to speed up these ActiveRecord queries?

I have three tables in my database: values, keys, and sources. Both keys and sources have many values. sources has a datetime column called file_date. For each key, I have to select all values that have a source that is between a specific date range (usually within two years). Not all dates within that range have a source, and not all sources have a value. I need to create an array that contains all values from within that range.
I at first tried to simply query the entire sources array at once, like so:
Value.where(key: 1, source: sources_array)
However, since there are several values that are nil, it simply returned records that have a value at that date.
So now, I've created an array containing all of the sources between that date range. Days that don't have a source simply have nil. Then, for each key, I iterate through the sources array, returning the value that matches that foo and source. That is obviously not ideal, and it takes about 7 seconds for the page to load.
Here is that code:
sources.map do |source|
Value.where(source: source, key: key)
end
Any ideas?
You could also generate a single SQL query if you only need to retrieve the data and getting back arrays will suffice.
output = []
sources.each do |source|
output << "select * from values where (source like #{source} and key like #{key})"
end
result = ActiveRecord::Base.connection.execute(output.join(" union ")).to_a
The big issue with your .map is that it runs several queries, instead of one query with all the results you want. With each additional query, you add the overhead of sending the request to the database and receiving the response back. What you need to do is generate the query in a way that it returns all the results you may be looking for, and then dealing with any grouping or sorting application side.
You could try wrapping your queries in a single transaction:
ActiveRecord::Base.transaction do
sources.map do |source|
Value.where(source: source, key: key)
end
end
I'll need to refactor this like crazy, but I have something that has cut it down to 3 seconds:
source_ids = sources.map do |source|
if source
source.id
else
nil
end
end
values = Value.where(source: sources, key: key)
value_sources = values.pluck(:source_id)
value_decimals = values.pluck(:value_decimal)
source_ids.map do |id|
index = value_sources.index(id)
if index
value_decimals[index]
else
nil
end
end

How to include duplicate records in query?

Given an array of part ids containing duplicates, how can I find the corresponding records in my Part model, including the duplicates?
An example array of part ids would be ["1B", "4", "3421", "4"]. If we assume I have a record corresponding to each of those values I would like to see 4 records returned in total, not 3. If possible, I was hoping to be able to make additional SQL operations on whatever is returned.
Here's what I'm currently using which doesn't include the duplicates:
#parts = Part.where(:part_id => params[:ids])
To give a little background, I'm trying to upload an XML file containing a list of parts used in some item. My application is meant to parse the XML file and compare the parts listed within against my Parts database so that I can see how much the part weighs. These items will sometimes contain duplicates of various parts so that's what I'm trying to account for here.
The only way I can think of doing it is using map...
#parts = params[:ids].map { |id| Part.find_by_id(id) }
hard to tell exactly what you are doing, are you looking up weight from the xml or from your data?
parts_xml = some_method_that_loads_xml
part_ids_from_xml = part_xml.... # pull out the ids
parts = Part.where("id IN (?)", part_ids_from_xml)
now you have two arrays (xml data and your 'matching' database records) and you can use select or detect to do in memory lookups by part_id
part_ids_from_xml.each do |part_id|
weight = parts.detect { |item| item.id == part_id }.weight
puts "#{id} weighs #{weight}"
end
see http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-detect
and http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-select

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

searching within an already retrieved mysql result

I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop

Resources