How to import a large size (5.5Gb) CSV file to Postgresql using ruby on rails? - ruby-on-rails

I have huge CSV file of 5.5 GB size, it has more than 100 columns in it. I want to import only specific columns from the CSV file. What are the possible ways to do this?
I want to import it to two different tables. Only one field to one table and rest of the fields into another table.
Should i use COPY command in Postgresql or CSV class or SmartCSV kind of gems for this purpose?
Regards,
Suresh.

If I had 5Gb of CSV, I'd better import it without Rails! But, you may have a use case that needs Rails...
Since you've said RAILS, I suppose you are talking about a web request and ActiveRecord...
If you don't care about waiting (and hanging one instance of your server process) you can do this:
Before, notice 2 things: 1) use of temp table, in case of errors you don't mess with your dest table - this is optional, of course. 2) use o option to truncate dest table first
CONTROLLER ACTION:
def updateDB
remote_file = params[:remote_file] ##<ActionDispatch::Http::UploadedFile>
truncate = (params[:truncate]=='true') ? true : false
if remote_file
result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file
if result[:result]
Model.updateFromTempTable(truncate)
flash[:notice] = 'sucess.'
else
flash[:error] = 'Errors: ' + result[:errors].join(" ==>")
end
else
flash[:error] = 'Error: no file given.'
end
redirect_to somewhere_else_path
end
MODEL METHODS:
# References:
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend
# http://www.postgresql.org/docs/9.1/static/sql-copy.html
#
def self.csv2tempTable(uploaded_name, uploaded_file)
erros = []
begin
#read csv file
file = uploaded_file
Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
# remove columns created_at/updated_at
rc.exec "drop table IF EXISTS #{TEMP_TABLE}; "
rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); "
rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;"
#copy it!
rc.exec("COPY #{TEMP_TABLE} FROM STDIN WITH CSV HEADER")
while !file.eof?
# Add row to copy data
l = file.readline
if l.encoding.name != 'UTF-8'
Rails.logger.info "line encoding is #{l.encoding.name}..."
# ENCODING:
# If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op,
# and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte
# sequences to be run, and replacements are done as needed.
# Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1
l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16')
end
Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}"
rc.put_copy_data( l )
end
# We are done adding copy data
rc.put_copy_end
# Display any error messages
while res = rc.get_result
e_message = res.error_message
if e_message.present?
erros << "Erro executando SQL: \n" + e_message
end
end
rescue StandardError => e
erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}"
end
if erros.present?
Rails.logger.error erros.join("*******************************\n")
{ result: false, erros: erros }
else
{ result: true, erros: [] }
end
end
# copy from TEMP_TABLE into self.table_name
# If <truncate> = true, truncates self.table_name first
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name
#
def self.updateFromTempTable(truncate)
erros = []
begin
Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
#
if truncate
rc.exec "TRUNCATE TABLE #{self.table_name}"
return false unless check_exec(rc)
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}"
return false unless check_exec(rc)
else
#remove lines from self.table_name that are present in temp
rc.exec "DELETE FROM #{self.table_name} WHERE id IN ( SELECT id FROM #{FARMACIAS_TEMP_TABLE} )"
return false unless check_exec(rc)
#copy lines from temp into self + includes timestamps
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};"
return false unless check_exec(rc)
end
rescue StandardError => e
Rails.logger.error "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}"
return false
end
true
end

Related

Batching when using ActiveRecord::Base.connection.execute

I am busy writing an migration that will allow us to move our yamler from Syck to Psych and finally upgrade our project to ruby 2. This migration is going to be seriously resource intensive though so I am going to need to use chunking.
I wrote the following method to confirm that the result of the migration I plan to use produces the expected result and can be done without down time. To avoid Active record performing the serialization automatically I needed to use ActiveRecord::Base.connection.execute
My method that describes the transformation is as follows
def show_summary(table, column_name)
a = ActiveRecord::Base.connection.execute <<-SQL
SELECT id, #{column_name} FROM #{table}
SQL
all_rows = a.to_a; ""
problem_rows = all_rows.select do |row|
original_string = Syck.dump(Syck.load(row[1]))
orginal_object = Syck.load(original_string)
new_string = Psych.dump(orginal_object)
new_object = Syck.load(new_string)
Syck.dump(new_object) != original_string rescue true
end
problem_rows.map do |row|
old_string = Syck.dump(Syck.load(row[1]))
new_string = Psych.dump(Syck.load(old_string)) rescue "Parse failure"
roundtrip_string = begin
Syck.dump(Syck.load(new_string))
rescue => e
e.message
end
new_row = {}
new_row[:id] = row[0]
new_row[:original_encoding] = old_string
new_row[:new_encoding] = roundtrip_string
new_row
end
end
How can you use batching when making use of ActiveRecord::Base.connection.execute ?
For completeness my update function is as follows
# Migrate the given serialized YAML column from Syck to Psych
# (if any).
def migrate_to_psych(table, column)
table_name = ActiveRecord::Base.connection.quote_table_name(table)
column_name = ActiveRecord::Base.connection.quote_column_name(column)
fetch_data(table_name, column_name).each do |row|
transformed = ::Psych.dump(convert(Syck.load(row[column])))
ActiveRecord::Base.connection.execute <<-SQL
UPDATE #{table_name}
SET #{column_name} = #{ActiveRecord::Base.connection.quote(transformed)}
WHERE id = #{row['id']};
SQL
end
end
def fetch_data(table_name, column_name)
ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
SQL
end
Which I got from http://fossies.org/linux/openproject/db/migrate/migration_utils/legacy_yamler.rb
You can easily build something with SQL's LIMIT and OFFSET clauses:
def fetch_data(table_name, column_name)
batch_size, offset = 1000, 0
begin
batch = ActiveRecord::Base.connection.select_all <<-SQL
SELECT id, #{column_name}
FROM #{table_name}
WHERE #{column_name} LIKE '---%'
LIMIT #{batch_size}
OFFSET #{offset}
SQL
batch.each do |row|
yield row
end
offset += batch_size
end until batch.empty?
end
which you can use almost exactly the same as before, just without the .each:
fetch_data(table_name, column_name) do |row| ... end
HTH!

Fluentd record with source filename parts

I'm using fluentd on a server to export logs.
My configuration uses something like this to capture several log files:
<source>
type tail
path /my/path/to/file/*/*.log
</source>
The different files are tracked properly, however, I have one more feature needed:
The two wildcards parts of the path should be added to the record as well (let's call them directory and filename).
If the in_tail plugin would add the filename to the record, I could write a formatter to split and edit.
Anything that I'm missing or rewriting in_tail to my heart wishes is the best way to go?
So, yes. Extending in_tail is the way to go.
I've written a new plugin that inherits from NewTailInput and uses a slightly different parse_singleline and parse_multilines to add the path to the record.
Much better than expected.
Update 6/3/2020:
I've dug up the code, this was the least Ruby I could muster to solve the problem.
Customize convert_line_to_event_with_path_names for your needs to add custom data to the records.
module Fluent
class DirParsingTailInput < NewTailInput
Plugin.register_input('dir_parsing_tail', self)
def initialize
super
end
def receive_lines(lines, tail_watcher)
es = #receive_handler.call(lines, tail_watcher)
unless es.empty?
tag = if #tag_prefix || #tag_suffix
#tag_prefix + tail_watcher.tag + #tag_suffix
else
#tag
end
begin
router.emit_stream(tag, es)
rescue
# ignore errors. Engine shows logs and backtraces.
end
end
end
def convert_line_to_event_with_path_names(line, es, path)
begin
directory = File.basename(File.dirname(path))
filename = File.basename(path, ".*")
line.chomp! # remove \n
#parser.parse(line) { |time, record|
if time && record
if directory != "logs"
record["parent"] = directory
record["child"] = filename
else
record["parent"] = filename
end
es.add(time, record)
else
log.warn "pattern not match: #{line.inspect}"
end
}
rescue => e
log.warn line.dump, :error => e.to_s
log.debug_backtrace(e.backtrace)
end
end
def parse_singleline(lines, tail_watcher)
es = MultiEventStream.new
lines.each { |line|
convert_line_to_event_with_path_names(line, es, tail_watcher.path)
}
es
end
def parse_multilines(lines, tail_watcher)
lb = tail_watcher.line_buffer
es = MultiEventStream.new
if #parser.has_firstline?
lines.each { |line|
if #parser.firstline?(line)
if lb
convert_line_to_event_with_path_names(lb, es, tail_watcher.path)
end
lb = line
else
if lb.nil?
log.warn "got incomplete line before first line from #{tail_watcher.path}: #{line.inspect}"
else
lb << line
end
end
}
else
lb ||= ''
lines.each do |line|
lb << line
#parser.parse(lb) { |time, record|
if time && record
convert_line_to_event_with_path_names(lb, es, tail_watcher.path)
lb = ''
end
}
end
end
tail_watcher.line_buffer = lb
es
end
end
end

How to use Rails postgresql connection raw.put_copy_data from string instead of a file

I usually copy data into my postgres database in rails using the following import module.
In this case I am uploading a file that is ready for postgres copy command to take in.
module Import #< ActiveRecord::Base
class Customer
include ActiveModel::Model
include EncodingSupport
attr_accessor :file
validates :file, :presence => true
def process file=nil
file ||= #file.tempfile
ActiveRecord::Base.connection.execute('truncate customers')
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',', NULL ' ', HEADER true)")
# open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
data = file.open
data::gets
ticker = 0
counter = 0
success_counter = 0
failed_records = []
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
# once all done...
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
ActiveRecord::Base.connection_pool.checkin(conn)
return { :csv => false, :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
end
I have another file now that needs to be formatted correctly though so I have another module that converts it from a text file to a csv file and trims out unnecessary content. Once it is ready I'd like to pass the data to the module above and have postgres take it into the database.
def pg_import file=nil
file ||= #file.tempfile
ticker = 0
counter = 0
col_order = [:warehouse_id, :customer_type_id, :pricelist_id]
data = col_order.to_csv
file.each do |line|
line.strip!
if item_line?(line)
row = built_line
data += col_order.map { |col| row[col] }.to_csv
else
line.empty?
end
ticker +=1
counter +=1
if ticker == 1000
p counter
ticker = 0
end
end
pg_import data
end
My problem is the 'process' method returns data as
"warehouse_id,customer_type_id,pricelist_id\n201,A01,0AA\n201,A02,0AC
which means when I pass it to pg_import I can't iterate over the data. Because it expects it to be in the following format.
[0] "201,A01,0AA\r\n",
[1] "201,A02,0AC\r\n",
[2] "201,A03,oAE\r\n"
What command can I use to convert the string data so that I can iterate over it in the
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
??
Probably has a really simple solution but just expecting to not be able to use the put_copy_data without having a file to iterate over...
Solved this with this process before and within the loop.
data = CSV.parse(data)
data.shift
data.each_with_index do |line, index|
line = line.to_csv
raw.put_copy_data line
counter += 1
end
Converted the string into an array of arrays with csv. Shifted to get rid of the header row. That allows me to iterate over the arrays. Grab an array and convert that array from an array back into a csv string and then pass it into the database.

how to set query timeout for oracle 11 in ruby

I saw other threads stating how to do it for mySql, and even how to do it in java, but not how to set the query timeout in ruby.
I'm trying to use the setQueryTimeout function in Jruby using OJDBC7, but can't find how to do it in ruby. I've tried the following:
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#query_timeout, 1)
#c.connection.instance_variable_get(:#connection).instance_variable_set(:#read_timeout, 1)
#c.connection.setQueryTimeout(1)
I also tried modifying my database.yml file to include
adapter: jdbc
driver: oracle.jdbc.driver.OracleDriver
timeout: 1
none of the above had any effect, other then the setQueryTimeout which threw a method error.
Any help would be great
So I found a way to make it work, but I don't like it. It's very hackish and orphans queries on the database, but it at least allows my app to continue executing. I would still love to find a way to cancel the statement so i'm not orphaning queries that take longer then 10 seconds.
query_thread = Thread.new {
#execute query
}
begin
Timeout::timeout(10) do
query_thread.join()
end
rescue
Thread.kill(query_thread)
results = Array.new
end
Query timeout on Oracle-DB works for me with Rails 4 and JRuby
With JRuby you can use JBDC-function statement.setQueryTimeout to define query timeout.
Suddenly this requires patching of oracle-enhanced_adapter as shown below.
This example is an implementation of iterator-query without storing result in array, which also uses query timeout.
# hold open SQL-Cursor and iterate over SQL-result without storing whole result in Array
# Peter Ramm, 02.03.2016
# expand class by getter to allow access on internal variable #raw_statement
ActiveRecord::ConnectionAdapters::OracleEnhancedJDBCConnection::Cursor.class_eval do
def get_raw_statement
#raw_statement
end
end
# Class extension by Module-Declaration : module ActiveRecord, module ConnectionAdapters, module OracleEnhancedDatabaseStatements
# does not work as Engine with Winstone application server, therefore hard manipulation of class ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter
# and extension with method iterate_query
ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter.class_eval do
# Method comparable with ActiveRecord::ConnectionAdapters::OracleEnhancedDatabaseStatements.exec_query,
# but without storing whole result in memory
def iterate_query(sql, name = 'SQL', binds = [], modifier = nil, query_timeout = nil, &block)
type_casted_binds = binds.map { |col, val|
[col, type_cast(val, col)]
}
log(sql, name, type_casted_binds) do
cursor = nil
cached = false
if without_prepared_statement?(binds)
cursor = #connection.prepare(sql)
else
unless #statements.key? sql
#statements[sql] = #connection.prepare(sql)
end
cursor = #statements[sql]
binds.each_with_index do |bind, i|
col, val = bind
cursor.bind_param(i + 1, type_cast(val, col), col)
end
cached = true
end
cursor.get_raw_statement.setQueryTimeout(query_timeout) if query_timeout
cursor.exec
if name == 'EXPLAIN' and sql =~ /^EXPLAIN/
res = true
else
columns = cursor.get_col_names.map do |col_name|
#connection.oracle_downcase(col_name).freeze
end
fetch_options = {:get_lob_value => (name != 'Writable Large Object')}
while row = cursor.fetch(fetch_options)
result_hash = {}
columns.each_index do |index|
result_hash[columns[index]] = row[index]
row[index] = row[index].strip if row[index].class == String # Remove possible 0x00 at end of string, this leads to error in Internet Explorer
end
result_hash.extend SelectHashHelper
modifier.call(result_hash) unless modifier.nil?
yield result_hash
end
end
cursor.close unless cached
nil
end
end #iterate_query
end #class_eval
class SqlSelectIterator
def initialize(stmt, binds, modifier, query_timeout)
#stmt = stmt
#binds = binds
#modifier = modifier # proc for modifikation of record
#query_timeout = query_timeout
end
def each(&block)
# Execute SQL and call block for every record of result
ActiveRecord::Base.connection.iterate_query(#stmt, 'sql_select_iterator', #binds, #modifier, #query_timeout, &block)
end
end
Use above class SqlSelectIterator like this example:
SqlSelectIterator.new(stmt, binds, modifier, query_timeout).each do |record|
process(record)
end

Ignore first line on csv parse Rails

I am using the code from this tutorial to parse a CSV file and add the contents to a database table. How would I ignore the first line of the CSV file? The controller code is below:
def csv_import
#parsed_file=CSV::Reader.parse(params[:dump][:file])
n = 0
#parsed_file.each do |row|
s = Student.new
s.name = row[0]
s.cid = row[1]
s.year_id = find_year_id_from_year_title(row[2])
if s.save
n = n+1
GC.start if n%50==0
end
flash.now[:message] = "CSV Import Successful, #{n} new students added to the database."
end
redirect_to(students_url)
end
This question kept popping up when i was searching for how to skip the first line with the CSV / FasterCSV libraries, so here's the solution that if you end up here.
the solution is...
CSV.foreach("path/to/file.csv",{:headers=>:first_row}) do |row|
HTH.
#parsed_file.each_with_index do |row, i|
next if i == 0
....
If you identify your first line as headers then you get back a Row object instead of a simple Array.
When you grab cell values, it seems like you need to use .fetch("Row Title") on the Row object.
This is what I came up with. I'm skipping nil with my if conditional.
CSV.foreach("GitHubUsersToAdd.csv",{:headers=>:first_row}) do |row|
username = row.fetch("GitHub Username")
if username
puts username.inspect
end
end
Using this simple code, you can read a CSV file and ignore the first line which is the header or field names:
CSV.foreach(File.join(File.dirname(__FILE__), filepath), headers: true) do |row|
puts row.inspect
end
You can do what ever you want with row. Don't forget headers: true
require 'csv'
csv_content =<<EOF
lesson_id,user_id
5,3
69,95
EOF
parse_1 = CSV.parse csv_content
parse_1.size # => 3 # it treats all lines as equal data
parse_2 = CSV.parse csv_content, headers:true
parse_2.size # => 2 # it ignores the first line as it's header
parse_1
# => [["lesson_id", "user_id"], ["5", "3"], ["69", "95"]]
parse_2
# => #<CSV::Table mode:col_or_row row_count:3>
here where it's the fun part
parse_1.each do |line|
puts line.inspect # the object is array
end
# ["lesson_id", "user_id"]
# ["5", " 3"]
# ["69", " 95"]
parse_2.each do |line|
puts line.inspect # the object is `CSV::Row` objects
end
# #<CSV::Row "lesson_id":"5" "user_id":" 3">
# #<CSV::Row "lesson_id":"69" "user_id":" 95">
So therefore I can do
parse_2.each do |line|
puts "I'm processing Lesson #{line['lesson_id']} the User #{line['user_id']}"
end
# I'm processing Lesson 5 the User 3
# I'm processing Lesson 69 the User 95
data_rows_only = csv.drop(1)
will do it
csv.drop(1).each do |row|
# ...
end
will loop it

Resources