Batching when using ActiveRecord::Base.connection.execute - ruby-on-rails

I am busy writing an migration that will allow us to move our yamler from Syck to Psych and finally upgrade our project to ruby 2. This migration is going to be seriously resource intensive though so I am going to need to use chunking.
I wrote the following method to confirm that the result of the migration I plan to use produces the expected result and can be done without down time. To avoid Active record performing the serialization automatically I needed to use ActiveRecord::Base.connection.execute
My method that describes the transformation is as follows
def show_summary(table, column_name)
a = ActiveRecord::Base.connection.execute <<-SQL
SELECT id, #{column_name} FROM #{table}
SQL
all_rows = a.to_a; ""
problem_rows = all_rows.select do |row|
original_string = Syck.dump(Syck.load(row[1]))
orginal_object = Syck.load(original_string)
new_string = Psych.dump(orginal_object)
new_object = Syck.load(new_string)
Syck.dump(new_object) != original_string rescue true
end
problem_rows.map do |row|
old_string = Syck.dump(Syck.load(row[1]))
new_string = Psych.dump(Syck.load(old_string)) rescue "Parse failure"
roundtrip_string = begin
Syck.dump(Syck.load(new_string))
rescue => e
e.message
end
new_row = {}
new_row[:id] = row[0]
new_row[:original_encoding] = old_string
new_row[:new_encoding] = roundtrip_string
new_row
end
end
How can you use batching when making use of ActiveRecord::Base.connection.execute ?
For completeness my update function is as follows
# Migrate the given serialized YAML column from Syck to Psych
# (if any).
def migrate_to_psych(table, column)
table_name = ActiveRecord::Base.connection.quote_table_name(table)
column_name = ActiveRecord::Base.connection.quote_column_name(column)
fetch_data(table_name, column_name).each do |row|
transformed = ::Psych.dump(convert(Syck.load(row[column])))
ActiveRecord::Base.connection.execute <<-SQL
UPDATE #{table_name}
SET #{column_name} = #{ActiveRecord::Base.connection.quote(transformed)}
WHERE id = #{row['id']};
SQL
end
end
def fetch_data(table_name, column_name)
ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
SQL
end
Which I got from http://fossies.org/linux/openproject/db/migrate/migration_utils/legacy_yamler.rb

You can easily build something with SQL's LIMIT and OFFSET clauses:
def fetch_data(table_name, column_name)
batch_size, offset = 1000, 0
begin
batch = ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
LIMIT #{batch_size}
OFFSET #{offset}
SQL
batch.each do |row|
yield row
end
offset += batch_size
end until batch.empty?
end
which you can use almost exactly the same as before, just without the .each:
fetch_data(table_name, column_name) do |row| ... end
HTH!

Related

Rails audited create in batch

I work with rails 5 / audited 4.6.5
I have batch actions on more than 2000 items on once.
To make it usable, I need to use the updatre_all function.
Then, I would like to create the needed audited in one time
What I would like to do is something like that :
Followup.where(id: ids).in_batches(of: 500) do |group|
#quick update for user responsiveness
group.update_all(step_id: 1)
group.delay.newAudits(step_id: 1)
end
But the audited gem looks to be to basic for that.
I'm sure a lot of poeple faced issue like that before
after a lot of iteration, I manage to create an optimized query.
I put it in an AditBatch class and also add delay
class AuditBatch
############################################
## Init a batch update from a list of ids
## the aim is to avoid instantiations in controllers
## #param string class name (eq: 'Followup')
## #param int[] ids of object to update
## #param changet hash of changes {key1: newval, key2: newval}
############################################
def self.init_batch_creation(auditable_type, auditable_ids, changes)
obj = Object.const_get(auditable_type)
group = obj.where(id: auditable_ids)
AuditBatch.delay.batch_creation(group, changes, false)
end
############################################
## insert a long list of audits in one time
## #param group array array of auditable objects
## #param changet hash of changes {key1: newval, key2: newval}
############################################
def self.batch_creation(group, changes, delayed = true)
sql = 'INSERT INTO audits ("action", "audited_changes", "auditable_id", "auditable_type", "created_at", "version", "request_uuid")
VALUES '
total = group.size
group.each_with_index do |g, index|
parameters = 'json_build_object('
length = changes.size
i=1
changes.each do |key, val|
parameters += "'#{key}',"
parameters += "json_build_array("
parameters += "(SELECT ((audited_changes -> '#{key}')::json->>1) FROM audits WHERE auditable_id = #{g.id} order by id desc limit 1),"
parameters += val.is_a?(String) ? "'#{val.to_s}'" : val.to_s
parameters += ')'
parameters += ',' if i < length
i +=1
end
parameters += ')'
sql += "('update', #{parameters}, #{g.id}, '#{g.class.name}', '#{Time.now}', (SELECT max(version) FROM audits where auditable_id= #{g.id})+1, '#{SecureRandom.uuid}')"
sql += ", " if (index+1) < total
end
if delayed==true
AuditBatch.delay.execute_delayed_sql(sql)
else
ActiveRecord::Base.connection.execute(sql)
end
end
def self.execute_delayed_sql(sql)
ActiveRecord::Base.connection.execute(sql)
end
end
With group.update_all your callbacks are skipped and it doesn't end up recording changes in new audit.
You cannot manually create audits for those records, and even if you can create that audit changes manually, you will need the reference of "what changed?" (goes in audited_changes). But those changes are already lost when you did update_all on group earlier.
(`action`, `audited_changes`, `auditable_id`, `auditable_type`, `created_at`, `version`, `request_uuid`)
It is also documented in this audited issue - https://github.com/collectiveidea/audited/issues/352
paper_trail, another such gem which retains change_logs, also has this issue: https://github.com/airblade/paper_trail/issues/337

How to perform a rails migration with dynamically generated sql?

I'm trying to run either a PG::Connection or a ActiveRecord::Base.connection via interpreting some txt file during a rails migration. Seems like the migration runs out of memory pretty fast. Much faster than when I run a PG::Connection in just a ruby script. What is the cause of this?
def import_movies(db_conn)
title_re = /^(#{$title}) \s+ \([0-9]+\) \s+ ([0-9]+)$/ix
i = 0
db_conn.transaction do |conn|
# ActiveRecord::Base.transaction do # tried this also, slow
stmt = prepare2(conn, "INSERT INTO movies (title, year) VALUES (?, ?);")
File.new("#{DUMP_PATH}/movies.list").each_line do |line|
print "." if (i = i + 1) % 5000 == 0; STDOUT.flush
if match = title_re.match(line)
stmt.execute!(match[1], match[2].to_i)
end
end
end
puts
end

How to import a large size (5.5Gb) CSV file to Postgresql using ruby on rails?

I have huge CSV file of 5.5 GB size, it has more than 100 columns in it. I want to import only specific columns from the CSV file. What are the possible ways to do this?
I want to import it to two different tables. Only one field to one table and rest of the fields into another table.
Should i use COPY command in Postgresql or CSV class or SmartCSV kind of gems for this purpose?
Regards,
Suresh.
If I had 5Gb of CSV, I'd better import it without Rails! But, you may have a use case that needs Rails...
Since you've said RAILS, I suppose you are talking about a web request and ActiveRecord...
If you don't care about waiting (and hanging one instance of your server process) you can do this:
Before, notice 2 things: 1) use of temp table, in case of errors you don't mess with your dest table - this is optional, of course. 2) use o option to truncate dest table first
CONTROLLER ACTION:
def updateDB
remote_file = params[:remote_file] ##<ActionDispatch::Http::UploadedFile>
truncate = (params[:truncate]=='true') ? true : false
if remote_file
result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file
if result[:result]
Model.updateFromTempTable(truncate)
flash[:notice] = 'sucess.'
else
flash[:error] = 'Errors: ' + result[:errors].join(" ==>")
end
else
flash[:error] = 'Error: no file given.'
end
redirect_to somewhere_else_path
end
MODEL METHODS:
# References:
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend
# http://www.postgresql.org/docs/9.1/static/sql-copy.html
#
def self.csv2tempTable(uploaded_name, uploaded_file)
erros = []
begin
#read csv file
file = uploaded_file
Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
# remove columns created_at/updated_at
rc.exec "drop table IF EXISTS #{TEMP_TABLE}; "
rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); "
rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;"
#copy it!
rc.exec("COPY #{TEMP_TABLE} FROM STDIN WITH CSV HEADER")
while !file.eof?
# Add row to copy data
l = file.readline
if l.encoding.name != 'UTF-8'
Rails.logger.info "line encoding is #{l.encoding.name}..."
# ENCODING:
# If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op,
# and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte
# sequences to be run, and replacements are done as needed.
# Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1
l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16')
end
Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}"
rc.put_copy_data( l )
end
# We are done adding copy data
rc.put_copy_end
# Display any error messages
while res = rc.get_result
e_message = res.error_message
if e_message.present?
erros << "Erro executando SQL: \n" + e_message
end
end
rescue StandardError => e
erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}"
end
if erros.present?
Rails.logger.error erros.join("*******************************\n")
{ result: false, erros: erros }
else
{ result: true, erros: [] }
end
end
# copy from TEMP_TABLE into self.table_name
# If <truncate> = true, truncates self.table_name first
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name
#
def self.updateFromTempTable(truncate)
erros = []
begin
Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
#
if truncate
rc.exec "TRUNCATE TABLE #{self.table_name}"
return false unless check_exec(rc)
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}"
return false unless check_exec(rc)
else
#remove lines from self.table_name that are present in temp
rc.exec "DELETE FROM #{self.table_name} WHERE id IN ( SELECT id FROM #{FARMACIAS_TEMP_TABLE} )"
return false unless check_exec(rc)
#copy lines from temp into self + includes timestamps
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};"
return false unless check_exec(rc)
end
rescue StandardError => e
Rails.logger.error "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}"
return false
end
true
end

how to set query timeout for oracle 11 in ruby

I saw other threads stating how to do it for mySql, and even how to do it in java, but not how to set the query timeout in ruby.
I'm trying to use the setQueryTimeout function in Jruby using OJDBC7, but can't find how to do it in ruby. I've tried the following:
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#query_timeout, 1)
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#read_timeout, 1)
#c.connection.setQueryTimeout(1)
I also tried modifying my database.yml file to include
adapter: jdbc
driver: oracle.jdbc.driver.OracleDriver
timeout: 1
none of the above had any effect, other then the setQueryTimeout which threw a method error.
Any help would be great
So I found a way to make it work, but I don't like it. It's very hackish and orphans queries on the database, but it at least allows my app to continue executing. I would still love to find a way to cancel the statement so i'm not orphaning queries that take longer then 10 seconds.
query_thread = Thread.new {
#execute query
}
begin
Timeout::timeout(10) do
query_thread.join()
end
rescue
Thread.kill(query_thread)
results = Array.new
end
Query timeout on Oracle-DB works for me with Rails 4 and JRuby
With JRuby you can use JBDC-function statement.setQueryTimeout to define query timeout.
Suddenly this requires patching of oracle-enhanced_adapter as shown below.
This example is an implementation of iterator-query without storing result in array, which also uses query timeout.
# hold open SQL-Cursor and iterate over SQL-result without storing whole result in Array
# Peter Ramm, 02.03.2016
# expand class by getter to allow access on internal variable #raw_statement
ActiveRecord::ConnectionAdapters::OracleEnhancedJDBCConnection::Cursor.class_eval do
def get_raw_statement
#raw_statement
end
end
# Class extension by Module-Declaration : module ActiveRecord, module ConnectionAdapters, module OracleEnhancedDatabaseStatements
# does not work as Engine with Winstone application server, therefore hard manipulation of class ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter
# and extension with method iterate_query
ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter.class_eval do
# Method comparable with ActiveRecord::ConnectionAdapters::OracleEnhancedDatabaseStatements.exec_query,
# but without storing whole result in memory
def iterate_query(sql, name = 'SQL', binds = [], modifier = nil, query_timeout = nil, &block)
type_casted_binds = binds.map { |col, val|
[col, type_cast(val, col)]
}
log(sql, name, type_casted_binds) do
cursor = nil
cached = false
if without_prepared_statement?(binds)
cursor = #connection.prepare(sql)
else
unless #statements.key? sql
#statements[sql] = #connection.prepare(sql)
end
cursor = #statements[sql]
binds.each_with_index do |bind, i|
col, val = bind
cursor.bind_param(i + 1, type_cast(val, col), col)
end
cached = true
end
cursor.get_raw_statement.setQueryTimeout(query_timeout) if query_timeout
cursor.exec
if name == 'EXPLAIN' and sql =~ /^EXPLAIN/
res = true
else
columns = cursor.get_col_names.map do |col_name|
#connection.oracle_downcase(col_name).freeze
end
fetch_options = {:get_lob_value => (name != 'Writable Large Object')}
while row = cursor.fetch(fetch_options)
result_hash = {}
columns.each_index do |index|
result_hash[columns[index]] = row[index]
row[index] = row[index].strip if row[index].class == String # Remove possible 0x00 at end of string, this leads to error in Internet Explorer
end
result_hash.extend SelectHashHelper
modifier.call(result_hash) unless modifier.nil?
yield result_hash
end
end
cursor.close unless cached
nil
end
end #iterate_query
end #class_eval
class SqlSelectIterator
def initialize(stmt, binds, modifier, query_timeout)
#stmt = stmt
#binds = binds
#modifier = modifier # proc for modifikation of record
#query_timeout = query_timeout
end
def each(&block)
# Execute SQL and call block for every record of result
ActiveRecord::Base.connection.iterate_query(#stmt, 'sql_select_iterator', #binds, #modifier, #query_timeout, &block)
end
end
Use above class SqlSelectIterator like this example:
SqlSelectIterator.new(stmt, binds, modifier, query_timeout).each do |record|
process(record)
end

Removing “duplicate objects” with same attributes using Array.map

As you can see in the current code below, I am finding the duplicate based on the attribute recordable_id. What I need to do is find the duplicate based on four matching attributes: user_id, recordable_type, hero_type, recordable_id. How must I modify the code?
heroes = User.heroes
for hero in heroes
hero_statuses = hero.hero_statuses
seen = []
hero_statuses.sort! {|a,b| a.created_at <=> b.created_at } # sort by created_at
hero_statuses.each do |hero_status|
if seen.map(&:recordable_id).include? hero_status.recordable_id # check if the id has been seen already
hero_status.revoke
else
seen << hero_status # if not, add it to the seen array
end
end
end
Try this:
HeroStatus.all(:group => "user_id, recordable_type, hero_type, recordable_id",
:having => "count(*) > 1").each do |status|
status.revoke
end
Edit 2
To revoke the all the latest duplicate entries do the following:
HeroStatus.all(:joins => "(
SELECT user_id, recordable_type, hero_type,
recordable_id, MIN(created_at) AS created_at
FROM hero_statuses
GROUP BY user_id, recordable_type, hero_type, recordable_id
HAVING COUNT(*) > 1
) AS A ON A.user_id = hero_statuses.user_id AND
A.recordable_type = hero_statuses.recordable_type AND
A.hero_type = hero_statuses.hero_type AND
A.recordable_id = hero_statuses.recordable_id AND
A.created_at < hero_statuses.created_
").each do |status|
status.revoke
end
Using straight Ruby (not the SQL server):
heroes = User.heroes
for hero in heroes
hero_statuses = hero.hero_statuses
seen = {}
hero_statuses.sort_by!(&:created_at)
hero_statuses.each do |status|
key = [status.user_id, status.recordable_type, status.hero_type, status.recordable_id]
if seen.has_key?(key)
status.revoke
else
seen[key] = status # if not, add it to the seen array
end
end
remaining = seen.values
end
For lookups, always use Hash (or Set, but here I thought it would be nice to keep the statuses that have been kept)
Note: I used sort_by!, but that's new to 1.9.2, so use sort_by (or require "backports")

Resources