How to perform a rails migration with dynamically generated sql? - ruby-on-rails

I'm trying to run either a PG::Connection or a ActiveRecord::Base.connection via interpreting some txt file during a rails migration. Seems like the migration runs out of memory pretty fast. Much faster than when I run a PG::Connection in just a ruby script. What is the cause of this?
def import_movies(db_conn)
title_re = /^(#{$title}) \s+ \([0-9]+\) \s+ ([0-9]+)$/ix
i = 0
db_conn.transaction do |conn|
# ActiveRecord::Base.transaction do # tried this also, slow
stmt = prepare2(conn, "INSERT INTO movies (title, year) VALUES (?, ?);")
File.new("#{DUMP_PATH}/movies.list").each_line do |line|
print "." if (i = i + 1) % 5000 == 0; STDOUT.flush
if match = title_re.match(line)
stmt.execute!(match[1], match[2].to_i)
end
end
end
puts
end

Related

How to speed up a very frequently made query using raw SQL and without ORM?

I have an API endpoint that accounts for a little less than half of the average response time (on averaging taking about 514 ms, yikes). The endpoint simply returns some statistics about stored data scoped to particular time periods, such as this week, last week, this month, and so on...
There are a number of ways that we could reduce it's impact, like getting the clients to hit it less and with more particular queries such as only querying for "this week" when only that data is used. Here we focus on what can be done at the database-level first. In our current implementation we generate this data for all "time scopes" on-the-fly and the number of queries is enormous and made multiple times per second. No caching is used, but maybe there is a way to use Rails's cache_key, or the low-level Rails.cache?
The current implementation look something like this:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
#user = user
summaries = Struct::Summaries.new
TimeScope::TIME_SCOPES.each do |scope|
foos = user.foos.by_scope(scope.to_sym)
summary = Struct::Summary.new
# e.g: summaries.last_week = build_summary(foos)
summaries.send("#{scope}=", build_summary(summary, foos))
end
summaries
end
private_class_method
def self.build_summary(summary, foos)
summary.all_quuz = #user.foos_count
summary.all_quux = all_quux(foos)
summary.quuw = quuw(foos).to_f
%w[foo bar baz qux].product(
%w[quux quuz corge]
).each do |a, b|
# e.g: summary.foo_quux = quux(foos, "foo")
summary.send("#{a.downcase}_#{b}=", send(b, foos, a) || 0)
end
summary
end
def self.all_quuz(foos)
foos.count
end
def self.all_quux(foos)
foos.sum(:quux)
end
def self.quuw(foos)
foos.quuwable.total_quuw
end
def self.corge(foos, foo_type)
return if foos.count.zero?
count = self.quuz(foos, foo_type) || 0
count.to_f / foos.count
end
def self.quux(foos, foo_type)
case foo_type
when "foo"
foos.where(foo: true).sum(:quux)
when "bar"
foos.bar.where(foo: false).sum(:quux)
when "baz"
foos.baz.where(foo: false).sum(:quux)
when "qux"
foos.qux.sum(:quux)
end
end
def self.quuz(foos, foo_type)
case trip_type
when "foo"
foos.where(foo: true).count
when "bar"
foos.bar.where(foo: false).count
when "baz"
foos.baz.where(foo: false).count
when "qux"
foos.qux.count
end
end
end
To avoid making changes to the model, or creating migrations to create a table to store this data (both of which may be valid and better solutions) I decided maybe it would be easier to construct one large sql query that will be executed at once in the hopes that it will be faster to build the query string and execute it without the overhead of active record set up and tear down of SQL queries.
The new approach looks something like this, it is horrifying to me and I know there must be a more elegant way:
class FooSummaries
include SummaryStructs
def self.generate_for(user)
results = ActiveRecord::Base.connection.execute(build_query_for(user))
results.each do |result|
# build up summary struct from query results
end
end
def self.build_query_for(user)
TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[foo bar baz qux].map do |foo_type|
%[
select
'#{scope}_#{foo_type}',
sum(quux) as quux,
count(*), as quuz,
round(100.0 * (count(*) / #{user.foos_count.to_f}), 3) as corge
from
"foos"
where
"foo"."user_id" = #{user.id}
and "foos"."foo_type" = '#{foo_type.humanize}'
and "foos"."end_time" between '#{time_scope.from}' AND '#{time_scope.to}'
and "foos"."foo" = '#{foo_type == 'foo' ? 't' : 'f'}'
union
]
end
end.join.reverse.sub("union".reverse, "").reverse
end
end
The funny way of replacing the last occurance of union also horrifies but it seems to work. There must be a beter way as there are probably many things that are wrong with the above implementation(s). It may be helpful to note that I use Postgresql and have no problem with writing queries that are not portable to other DB's. Any advice is truly appreciated!
Thanks for reading!
Update: I found a solution that works for me and sped up the endpoint that uses this service object by 500% ! Essentially the idea is, instead of building a query string and then executing it for each set of parameters, we create a prepared statement using prepare followed by an exec_prepared passing in parameters to the query. Since this query is made many times over this is a useful optmization because, as per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
We prepare the query like so:
def prepare_query!
ActiveRecord::Base.transaction do
connection.prepare("foos_summary",
%[with scoped_foos as (
select
*
from
"foos"
where
"foos"."user_id" = $3
and ("foos"."end_time" between $4 and $5)
)
select
$1::text as scope,
$2::text as foo_type,
sum(quux)::float as quux,
sum(eggs + bacon + ham)::float as food,
count(*) as count,
round((sum(quux) / nullif(
(select
sum(quux)
from
scoped_foos), 0))::numeric,
5)::float as quuz
from
scoped_foos
where
(case $6
when 'Baz'
then (baz = 't')
else
(baz = 'f' and foo_type = $6)
end
)
])
end
You can see in this query we use a common table expression for more readability and to avoid writing the same select query twice over.
Then we execute the query, passing in the parameters we need:
def connection
#connection ||= ActiveRecord::Base.connection.raw_connection
end
def query_results
prepare_query! unless query_already_prepared?
#results ||= TimeScope::TIME_SCOPES.map do |scope|
time_scope = TimeScope.new(scope)
%w[bacon eggs ham spam].map do |foo_type|
connection.exec_prepared("foos_summary",
[scope,
foo_type,
#user.id,
time_scope.from,
time_scope.to,
foo_type.humanize])
end
end
end
Where query_already_prepared? is a simple check in the prepared statements table maintained by postgres:
def query_already_prepared?
connection.exec(%(select
name
from
pg_prepared_statements
where name = 'foos_summary')).count.positive?
end
A nice solution, I thought! Hopefully the technique illustrated here will help others with a similar problems.

Batching when using ActiveRecord::Base.connection.execute

I am busy writing an migration that will allow us to move our yamler from Syck to Psych and finally upgrade our project to ruby 2. This migration is going to be seriously resource intensive though so I am going to need to use chunking.
I wrote the following method to confirm that the result of the migration I plan to use produces the expected result and can be done without down time. To avoid Active record performing the serialization automatically I needed to use ActiveRecord::Base.connection.execute
My method that describes the transformation is as follows
def show_summary(table, column_name)
a = ActiveRecord::Base.connection.execute <<-SQL
SELECT id, #{column_name} FROM #{table}
SQL
all_rows = a.to_a; ""
problem_rows = all_rows.select do |row|
original_string = Syck.dump(Syck.load(row[1]))
orginal_object = Syck.load(original_string)
new_string = Psych.dump(orginal_object)
new_object = Syck.load(new_string)
Syck.dump(new_object) != original_string rescue true
end
problem_rows.map do |row|
old_string = Syck.dump(Syck.load(row[1]))
new_string = Psych.dump(Syck.load(old_string)) rescue "Parse failure"
roundtrip_string = begin
Syck.dump(Syck.load(new_string))
rescue => e
e.message
end
new_row = {}
new_row[:id] = row[0]
new_row[:original_encoding] = old_string
new_row[:new_encoding] = roundtrip_string
new_row
end
end
How can you use batching when making use of ActiveRecord::Base.connection.execute ?
For completeness my update function is as follows
# Migrate the given serialized YAML column from Syck to Psych
# (if any).
def migrate_to_psych(table, column)
table_name = ActiveRecord::Base.connection.quote_table_name(table)
column_name = ActiveRecord::Base.connection.quote_column_name(column)
fetch_data(table_name, column_name).each do |row|
transformed = ::Psych.dump(convert(Syck.load(row[column])))
ActiveRecord::Base.connection.execute <<-SQL
UPDATE #{table_name}
SET #{column_name} = #{ActiveRecord::Base.connection.quote(transformed)}
WHERE id = #{row['id']};
SQL
end
end
def fetch_data(table_name, column_name)
ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
SQL
end
Which I got from http://fossies.org/linux/openproject/db/migrate/migration_utils/legacy_yamler.rb
You can easily build something with SQL's LIMIT and OFFSET clauses:
def fetch_data(table_name, column_name)
batch_size, offset = 1000, 0
begin
batch = ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
LIMIT #{batch_size}
OFFSET #{offset}
SQL
batch.each do |row|
yield row
end
offset += batch_size
end until batch.empty?
end
which you can use almost exactly the same as before, just without the .each:
fetch_data(table_name, column_name) do |row| ... end
HTH!

How to import a large size (5.5Gb) CSV file to Postgresql using ruby on rails?

I have huge CSV file of 5.5 GB size, it has more than 100 columns in it. I want to import only specific columns from the CSV file. What are the possible ways to do this?
I want to import it to two different tables. Only one field to one table and rest of the fields into another table.
Should i use COPY command in Postgresql or CSV class or SmartCSV kind of gems for this purpose?
Regards,
Suresh.
If I had 5Gb of CSV, I'd better import it without Rails! But, you may have a use case that needs Rails...
Since you've said RAILS, I suppose you are talking about a web request and ActiveRecord...
If you don't care about waiting (and hanging one instance of your server process) you can do this:
Before, notice 2 things: 1) use of temp table, in case of errors you don't mess with your dest table - this is optional, of course. 2) use o option to truncate dest table first
CONTROLLER ACTION:
def updateDB
remote_file = params[:remote_file] ##<ActionDispatch::Http::UploadedFile>
truncate = (params[:truncate]=='true') ? true : false
if remote_file
result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file
if result[:result]
Model.updateFromTempTable(truncate)
flash[:notice] = 'sucess.'
else
flash[:error] = 'Errors: ' + result[:errors].join(" ==>")
end
else
flash[:error] = 'Error: no file given.'
end
redirect_to somewhere_else_path
end
MODEL METHODS:
# References:
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend
# http://www.postgresql.org/docs/9.1/static/sql-copy.html
#
def self.csv2tempTable(uploaded_name, uploaded_file)
erros = []
begin
#read csv file
file = uploaded_file
Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
# remove columns created_at/updated_at
rc.exec "drop table IF EXISTS #{TEMP_TABLE}; "
rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); "
rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;"
#copy it!
rc.exec("COPY #{TEMP_TABLE} FROM STDIN WITH CSV HEADER")
while !file.eof?
# Add row to copy data
l = file.readline
if l.encoding.name != 'UTF-8'
Rails.logger.info "line encoding is #{l.encoding.name}..."
# ENCODING:
# If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op,
# and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte
# sequences to be run, and replacements are done as needed.
# Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1
l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16')
end
Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}"
rc.put_copy_data( l )
end
# We are done adding copy data
rc.put_copy_end
# Display any error messages
while res = rc.get_result
e_message = res.error_message
if e_message.present?
erros << "Erro executando SQL: \n" + e_message
end
end
rescue StandardError => e
erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}"
end
if erros.present?
Rails.logger.error erros.join("*******************************\n")
{ result: false, erros: erros }
else
{ result: true, erros: [] }
end
end
# copy from TEMP_TABLE into self.table_name
# If <truncate> = true, truncates self.table_name first
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name
#
def self.updateFromTempTable(truncate)
erros = []
begin
Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
#
if truncate
rc.exec "TRUNCATE TABLE #{self.table_name}"
return false unless check_exec(rc)
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}"
return false unless check_exec(rc)
else
#remove lines from self.table_name that are present in temp
rc.exec "DELETE FROM #{self.table_name} WHERE id IN ( SELECT id FROM #{FARMACIAS_TEMP_TABLE} )"
return false unless check_exec(rc)
#copy lines from temp into self + includes timestamps
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};"
return false unless check_exec(rc)
end
rescue StandardError => e
Rails.logger.error "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}"
return false
end
true
end

how to set query timeout for oracle 11 in ruby

I saw other threads stating how to do it for mySql, and even how to do it in java, but not how to set the query timeout in ruby.
I'm trying to use the setQueryTimeout function in Jruby using OJDBC7, but can't find how to do it in ruby. I've tried the following:
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#query_timeout, 1)
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#read_timeout, 1)
#c.connection.setQueryTimeout(1)
I also tried modifying my database.yml file to include
adapter: jdbc
driver: oracle.jdbc.driver.OracleDriver
timeout: 1
none of the above had any effect, other then the setQueryTimeout which threw a method error.
Any help would be great
So I found a way to make it work, but I don't like it. It's very hackish and orphans queries on the database, but it at least allows my app to continue executing. I would still love to find a way to cancel the statement so i'm not orphaning queries that take longer then 10 seconds.
query_thread = Thread.new {
#execute query
}
begin
Timeout::timeout(10) do
query_thread.join()
end
rescue
Thread.kill(query_thread)
results = Array.new
end
Query timeout on Oracle-DB works for me with Rails 4 and JRuby
With JRuby you can use JBDC-function statement.setQueryTimeout to define query timeout.
Suddenly this requires patching of oracle-enhanced_adapter as shown below.
This example is an implementation of iterator-query without storing result in array, which also uses query timeout.
# hold open SQL-Cursor and iterate over SQL-result without storing whole result in Array
# Peter Ramm, 02.03.2016
# expand class by getter to allow access on internal variable #raw_statement
ActiveRecord::ConnectionAdapters::OracleEnhancedJDBCConnection::Cursor.class_eval do
def get_raw_statement
#raw_statement
end
end
# Class extension by Module-Declaration : module ActiveRecord, module ConnectionAdapters, module OracleEnhancedDatabaseStatements
# does not work as Engine with Winstone application server, therefore hard manipulation of class ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter
# and extension with method iterate_query
ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter.class_eval do
# Method comparable with ActiveRecord::ConnectionAdapters::OracleEnhancedDatabaseStatements.exec_query,
# but without storing whole result in memory
def iterate_query(sql, name = 'SQL', binds = [], modifier = nil, query_timeout = nil, &block)
type_casted_binds = binds.map { |col, val|
[col, type_cast(val, col)]
}
log(sql, name, type_casted_binds) do
cursor = nil
cached = false
if without_prepared_statement?(binds)
cursor = #connection.prepare(sql)
else
unless #statements.key? sql
#statements[sql] = #connection.prepare(sql)
end
cursor = #statements[sql]
binds.each_with_index do |bind, i|
col, val = bind
cursor.bind_param(i + 1, type_cast(val, col), col)
end
cached = true
end
cursor.get_raw_statement.setQueryTimeout(query_timeout) if query_timeout
cursor.exec
if name == 'EXPLAIN' and sql =~ /^EXPLAIN/
res = true
else
columns = cursor.get_col_names.map do |col_name|
#connection.oracle_downcase(col_name).freeze
end
fetch_options = {:get_lob_value => (name != 'Writable Large Object')}
while row = cursor.fetch(fetch_options)
result_hash = {}
columns.each_index do |index|
result_hash[columns[index]] = row[index]
row[index] = row[index].strip if row[index].class == String # Remove possible 0x00 at end of string, this leads to error in Internet Explorer
end
result_hash.extend SelectHashHelper
modifier.call(result_hash) unless modifier.nil?
yield result_hash
end
end
cursor.close unless cached
nil
end
end #iterate_query
end #class_eval
class SqlSelectIterator
def initialize(stmt, binds, modifier, query_timeout)
#stmt = stmt
#binds = binds
#modifier = modifier # proc for modifikation of record
#query_timeout = query_timeout
end
def each(&block)
# Execute SQL and call block for every record of result
ActiveRecord::Base.connection.iterate_query(#stmt, 'sql_select_iterator', #binds, #modifier, #query_timeout, &block)
end
end
Use above class SqlSelectIterator like this example:
SqlSelectIterator.new(stmt, binds, modifier, query_timeout).each do |record|
process(record)
end

optimize memory usage in rails loop

i develop a heroku rails application on the cedar stack and this is the bottle neck.
def self.to_csvAlt(options = {})
CSV.generate(options) do |csv|
column_headers = ["user_id", "session_id", "survey_id"]
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
page_attributes = ["a", "b", "c", "d", "e"]
pages.each do |p|
page_attributes.each do |pa|
column_headers << p + "_" + pa
end
end
csv << column_headers
session_ids = PageEvent.order(:session_id).select(:session_id).map(&:session_id).uniq
session_ids.each do |si|
session_user = PageEvent.find(:first, :conditions => ["session_id = ? AND page != ?", si, 'none']);
if session_user.nil?
row = [si, nil, nil, nil]
else
row = [session_user.username, si, session_user.survey_name]
end
pages.each do |p|
a = 0
b = 0
c = 0
d = 0
e = 0
allpages = PageEvent.where(:page => p, :session_id => si)
allpages.each do |ap|
a += ap.a
b += ap.b
c += ap.c
d += ap.d
e += ap.e
end
index = pages.index p
end_index = (index + 1)*5 + 2
if !p.nil?
row[end_index] = a
row[end_index-1] = b
row[end_index-2] = c
row[end_index-3] = d
row[end_index-4] = e
else
row[end_index] = nil
row[end_index-1] = nil
row[end_index-2] = nil
row[end_index-3] = nil
row[end_index-4] = nil
end
end
csv << row
end
end
end
as you can see, it generates a csv file from a table that contains data on each individual page taken from a group of surveys. the problem is that there are ~50,000 individual pages in the table and the heroku app continues to give me R14 errors (out of memory 512MB) and eventually dies when the dyno goes to sleep after an hour.
that being said, i really dont care how long it takes to run, i just need it to complete. i am waiting on approval to add a worker dyno to run the csv generation, which i know will help but in the meantime i still would like to optimize this code. There is potential for over 100,000 pages to be processed at at time and i realize this is incredibly memory heavy and really need to cut back its memory usage as much as possible. thank you for your time.
You can split it up into batches so that the work is completed in sensible chunks.
Try something like this:
def self.to_csvAlt(options = {})
# ...
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
pages.find_each(:batch_size => 5000) do |p|
# ...
Using find_each with a batch_size, you wont do one huge lookup for your loop. Instead it'll fetch 5000 rows, run your loop, fetch another, loop again ... etc, until you have no more records returned.
The other key thing to note here is that rather than rails trying to instantiate all of the objects returned from the database at the same time, it will only instantiate those returned in your current batch. This can save a huge memory overhead if you have a giant dataset.
UPDATE:
Using #map to restrict your results to a single attribute of your model is highly inefficient. You should instead use the pluck Active record method to just pull back the data you want from the DB directly rather than manipulating the results with Ruby, like this:
# Instead of this:
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
# Use this:
pages = PageEvent.order(:page).pluck(:page).uniq
I also personally prefer to use .distinct rather than the alias .uniq as I feel it sits more in line with the DB query rather than confusing things with what seems more like an array function:
pages = PageEvent.order(:page).pluck(:page).distinct
Use
CSV.open("path/to/file.csv", "wb")
This will stream CSV into the file.
Instead of CSV.generate.
generate will create a huge string that will end up exasting memory if it gets too large.

Resources