Explain Rake query to create csv - ruby-on-rails

I have inherited a ruby app which connects to a mongodb. I have no idea about mongo or ruby unfortunately so im on a rapid googling and learning curve.
The app stores placenames as well as their lat longs, alternative name, peoples memories, and comments. It also counts how many times a place has been discussed.
The following rake file when run, grabs all the locations from the mongodb and creates a csv,spitting out one line for each location with the user, number of times mentioned, the memories etc etc.
task :data_dump => :environment do
File.open("results.csv","w") do |file|
Location.all.each_with_index do |l,index|
puts "done #{index}"
file.puts [l.id, l.classification_count, l.position, l.created_at, l.classifications.collect{|c| c.text}, l.classifications.collect{|c| c.alternative_names }.flatten.join(";"), l.classifications.collect{|c| c.comment }.flatten.join(";"), l.memories.collect{|m| m.text}.flatten.join(";") ].join(",")
end
end
end
It works great and generates a CSV I can then pull into other programmes. The problem is that the content contains plain text fields which breaks the validity of the csv with line breaks etc and I want to make sure all plain text fields are properly enclosed within the CSV.
So if I can understand the above query better, I can then input the correct field enclosures to ensure the csv is valid when loaded into GIS software.
Also the above takes about an hour 45 to run on my laptop so I want to find out if it is the most efficient way to do the query.
To date we have around 300000 placename listed and this is going to rise to a few million so will only get slower.

You can generate the CSV with Ruby's 'csv' module:
require 'csv'
task :data_dump => :environment do
CSV.open("results.csv","w") do |csv|
Location.all.each_with_index do |l,index|
puts "done #{index}"
csv << [l.id, l.classification_count, ...]
end
end
end
This will ensure that the CSV is generated properly. As for the speed, I've only used ActiveRecord with relational databases, but I imaging the problem is the same - The 1 + N Problem. Basically it says that each time you are using l.classifications.collect or l.memories.collect it needs to do a query to get all the classifications/memories from the database. The solution is eager loading:
require 'csv'
task :data_dump => :environment do
CSV.open("results.csv","w") do |csv|
Location.all.includes(:classifications, :memories).each_with_index do |l,index|
puts "done #{index}"
csv << [l.id, l.classification_count, l.position, l.created_at, l.classifications.collect{|c| c.text}, l.classifications.collect{|c| c.alternative_names }.flatten.join(";"), l.classifications.collect{|c| c.comment }.flatten.join(";"), l.memories.collect{|m| m.text}.flatten.join(";") ]
end
end
end
(and you might need to do so for alternative_names - I don't remember the syntax for nested eager loading). This will make a single query to the database, which should be much faster.

Related

Creating a comma seperated csv file in ruby

I have array reference like this
a1 = [["http://ww.amazon.com"],["failed"]]
When i write it to csv file it is written like
["http://ww.amazon.com"]
["failed"]
But i want to write like
http://ww.amazon.com failed
First you need to flatten the array a1
b1 = a1.flatten # => ["http://ww.amazon.com", "failed"]
Then you need to generate the CSV by passing every row (array) to the following csv variable:
require 'csv'
csv_string = CSV.generate({:col_sep => "\t"}) do |csv|
csv << b1
end
:col_sep =>"\t" is used to insert a tab separator in each row.
Change the value of :col_sep => "," for using comma.
Finally you have the csv_string containing the correct form of the csv
Ruby's built-in CSV class is your starting point. From the documentation for writing to a CSV file:
CSV.open("path/to/file.csv", "wb") do |csv|
csv << ["row", "of", "CSV", "data"]
csv << ["another", "row"]
# ...
end
For your code, simply flatten your array:
[['a'], ['b']].flatten # => ["a", "b"]
Then you can assign it to the parameter of the block (csv) which will cause the array to be written to the file:
require 'csv'
CSV.open('file.csv', 'wb') do |csv|
csv << [["row"], ["of"], ["CSV"], ["data"]].flatten
end
Saving and running that creates "file.csv", which contains:
row,of,CSV,data
Your question is written in such a way that it sounds like you're trying to generate the CSV file by hand, rather than rely on a class designed for that particular task. On the surface, creating a CSV seems easy, however it has nasty corner cases and issues to be handled when a string contains spaces and the quoting character used to delimit strings. A well-tested, pre-written class can save you a lot of time writing and debugging code, or save you from having to explain to a customer or manager why your data won't load correctly into a database.
But that leaves the question, why does your array contain sub-arrays? Usually that happens because you're doing something wrong as you gather the elements, and makes me think your question should really be about how do you avoid doing that. (It's called an XY problem.)

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

searching within an already retrieved mysql result

I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop

Ruby on Rails: compare two strings in terms of database collation

I have a list of words and want to find which ones already exist in the database.
Instead of making tens of SQL queries, I decided to use "SELECT word FROM table WHERE word IN(array_of_words)" and then loop through the result.
The problem is database collation.
http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html
There are many different characters, which MySQL treats as the same. However, in Ruby code string1 would not be equal to string2.
For example: if the word is "šuo", database might also return "suo", if it's found (and it's ok), but, when I want to check, if something by "šuo" is found, Ruby, of course, returns false (šuo != suo).
So, is there any way to compare two strings in Ruby in terms of the same collation?
I've used iconv like this for something similar:
require 'iconv'
class String
def to_ascii_iconv
Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8').iconv(self).unpack('U*').select { |cp| cp < 127 }.pack('U*')
end
end
puts 'suo'.to_ascii_iconv
# => suo
puts 'šuo'.to_ascii_iconv
# => suo
puts 'suo'.to_ascii_iconv == 'šuo'.to_ascii_iconv
# => true
Hope that helps!
Zubin

Rails: rake db:migrate *very* slow on Oracle

I'm using rails with the oracleenhanced adaptor to create a new interface for a legacy application.
Database migrations work successfully, but take an incredibly long amount of time before rake finishes. The database changes happen pretty quickly (1 or 2 seconds), but the db/schema.db dump takes over an hour to complete. (See example migration below.)
It's a relatively large schema (about 150 tables), but I'm sure it shouldn't be taking this long to dump out each table description.
Is there anyway to speed this up by just taking the last schema.db and applying the change specified in the migration to it? Or am I able to skip this schema dump altogether?
I understand this schema.db is used to create the test database from scratch each time, but this case, there's a large chunk of the database logic in table triggers which aren't included in the schema.rb anyway, so the rake tests are no good to us in any case. (That's a whole different issue that I need to sort out at some other point.)
dgs#dgs-laptop:~/rails/voyager$ time rake db:migrate
(in /home/dgs/rails/voyager)
== 20090227012452 AddModuleActionAndControllerNames: migrating ================
-- add_column(:modules, :action_name, :text)
-> 0.9619s
-> 0 rows
-- add_column(:modules, :controller_name, :text)
-> 0.1680s
-> 0 rows
== 20090227012452 AddModuleActionAndControllerNames: migrated (1.1304s) =======
real 87m12.961s
user 0m12.949s
sys 0m2.128s
After all migrations are applied to database then rake db:migrate calls db:schema:dump task to generate schema.rb file from current database schema.
db:schema:dump call adapter's "tables" method to get the list of all tables, then for each table calls "indexes" method and "columns" method. You can find SQL SELECT statements that are used in these methods in activerecord-oracle_enhanced-adapter gem's oracle_enhanced_adapter.rb file. Basically it does selects from ALL% or USER% data dictionary tables to find all the information.
Initially I had issues with original Oracle adapter when I used it with databases with lot of different schemas (as performance might be affected by the total number of table in the database - not just in your schema) and therefore I did some optimizations in Oracle enhanced adapter. It would be good to find out which methods are slow in your case (I suspect that it could be either "indexes" or "columns" method which is executed for each table).
One way hoe to debug this issue would be if you would put some debug messages in oracle_enhanced_adapter.rb file so that you could identify which method calls are taking so long time.
Problem mostly solved after some digging round in oracle_enhanced_adapter.rb.
The problem came down to way too many tables in the local schema (many EBA_%, EVT_%, EMP_%, SMP_% tables had been created in there coincidently at some point), archive tables being included in the dump and a select from the data dictionaries taking 14 seconds to execute.
To fix the speed, I did three things:
Dropped all unneeded tables (about 250 out of 500)
Excluded archive tables from the schema dump
Cached the result of the long running query
This improved the time from the migration/schema dump for the remaining 350 tables from about 90 minutes to about 15 seconds. More than fast enough.
My code as follows (for inspiration not copying and pasting - this code is fairly specific to my database, but you should be able to get the idea). You need to create the temp table manually. It takes about 2 or 3 minutes for me to do - still too long to generate with each migration, and it's fairly static anyway =)
module ActiveRecord
module ConnectionAdapters
class OracleEnhancedAdapter
def tables(name = nil)
select_all("select lower(table_name) from all_tables where owner = sys_context('userenv','session_user') and table_name not like 'A!_%' escape '!' ").inject([]) do | tabs, t |
tabs << t.to_a.first.last
end
end
# TODO think of some way to automatically create the rails_temp_index table
#
# Table created by:
# create table rails_temp_index_table as
# SELECT lower(i.index_name) as index_name, i.uniqueness,
# lower(c.column_name) as column_name, i.table_name
# FROM all_indexes i, user_ind_columns c
# WHERE c.index_name = i.index_name
# AND i.owner = sys_context('userenv','session_user')
# AND NOT exists (SELECT uc.index_name FROM user_constraints uc
# WHERE uc.constraint_type = 'P' and uc.index_name = i.index_name);
def indexes(table_name, name = nil) #:nodoc:
result = select_all(<<-SQL, name)
SELECT index_name, uniqueness, column_name
FROM rails_temp_index_table
WHERE table_name = '#{table_name.to_s.upcase}'
ORDER BY index_name
SQL
current_index = nil
indexes = []
result.each do |row|
if current_index != row['index_name']
indexes << IndexDefinition.new(table_name, row['index_name'], row['uniqueness'] == "UNIQUE", [])
current_index = row['index_name']
end
indexes.last.columns << row['column_name']
end
indexes
end
end

Resources