How to use Rails postgresql connection raw.put_copy_data from string instead of a file - ruby-on-rails

I usually copy data into my postgres database in rails using the following import module.
In this case I am uploading a file that is ready for postgres copy command to take in.
module Import #< ActiveRecord::Base
class Customer
include ActiveModel::Model
include EncodingSupport
attr_accessor :file
validates :file, :presence => true
def process file=nil
file ||= #file.tempfile
ActiveRecord::Base.connection.execute('truncate customers')
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',', NULL ' ', HEADER true)")
# open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
data = file.open
data::gets
ticker = 0
counter = 0
success_counter = 0
failed_records = []
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
# once all done...
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
ActiveRecord::Base.connection_pool.checkin(conn)
return { :csv => false, :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
end
I have another file now that needs to be formatted correctly though so I have another module that converts it from a text file to a csv file and trims out unnecessary content. Once it is ready I'd like to pass the data to the module above and have postgres take it into the database.
def pg_import file=nil
file ||= #file.tempfile
ticker = 0
counter = 0
col_order = [:warehouse_id, :customer_type_id, :pricelist_id]
data = col_order.to_csv
file.each do |line|
line.strip!
if item_line?(line)
row = built_line
data += col_order.map { |col| row[col] }.to_csv
else
line.empty?
end
ticker +=1
counter +=1
if ticker == 1000
p counter
ticker = 0
end
end
pg_import data
end
My problem is the 'process' method returns data as
"warehouse_id,customer_type_id,pricelist_id\n201,A01,0AA\n201,A02,0AC
which means when I pass it to pg_import I can't iterate over the data. Because it expects it to be in the following format.
[0] "201,A01,0AA\r\n",
[1] "201,A02,0AC\r\n",
[2] "201,A03,oAE\r\n"
What command can I use to convert the string data so that I can iterate over it in the
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
??
Probably has a really simple solution but just expecting to not be able to use the put_copy_data without having a file to iterate over...

Solved this with this process before and within the loop.
data = CSV.parse(data)
data.shift
data.each_with_index do |line, index|
line = line.to_csv
raw.put_copy_data line
counter += 1
end
Converted the string into an array of arrays with csv. Shifted to get rid of the header row. That allows me to iterate over the arrays. Grab an array and convert that array from an array back into a csv string and then pass it into the database.

Related

How to check header exist before import data in Ruby CSV?

I want to write header only 1 time in first row when import data to csv in ruby, but the header is written many time on output file.
job_datas.each do |job_data|
#company_job = job data coverted etc....
save_job_to_csv(#company_job)
end
def save_job_to_csv(job_data)
filepath = "tmp/jobs/jobs.csv"
CSV.open(filepath, "a", :headers => true) do |csv|
if csv.blank?
csv << CompanyJob.attribute_names
end
csv << job_data.attributes.values
end
end
Any one can give me solution? Thank you so much!
You are calling save_job_to_csv the method for each job_data and pushing header every time csv << CompanyJob.attribute_names
filepath = "tmp/jobs/jobs.csv"
CSV.open(filepath, "a", :headers => true) do |csv|
# push header once
csv << CompanyJob.attribute_names
# push every job record
job_datas.each do |job_data|
#company_job = job data coverted etc....
csv << #company_job.attributes.values
end
end
The above script can be created wrapped a method but if you like to write a separate method that just saves the CSV, then you need to refactor the script when you first prepare an array of values holding header and pass it to a method that just saves to CSV.
You could do something similar to this:
def save_job_to_csv(job_data)
filepath = "tmp/jobs/jobs.csv"
unless File.file?(filepath)
File.open(filepath, 'w') do |file|
file.puts(job_data.attribute_names.join(','))
end
end
CSV.open(filepath, "a", :headers => true) do |csv|
csv << job_data.attributes.values
end
end
It just checks beforehand if the file exists and if not it adds the header. If you want tabs as column separators, you just have to change the value for the join function and add the col_sep parameter to CSV.open():
file.puts(job_data.attribute_names.join("\t"))
CSV.open(filepath, "a", :headers => true, col_sep: "\t") do |csv|

Data is overwriting instead of appending to CSV

I am using a rake task and the csv module to loop through one csv, extract and alter the data I need and then append each new row of data to a second csv. However each row seems to be overwriting/replacing the previous row in the new csv instead of appending it as a new row after it. I've looked at the documentation and googled but can't find any examples of appending rows to the csv differently.
require 'csv'
namespace :replace do
desc "replace variant id with variant sku"
task :sku => :environment do
file="db/master-list-3-28.csv"
CSV.foreach(file) do |row|
msku, namespace, key, valueType, value = row
valueArray = value.split('|')
newValueString = ""
valueArray.each_with_index do |v, index|
recArray = v.split('*')
handle = recArray[0]
vid = recArray[1]
newValueString << handle
newValueString << "*"
variant = ShopifyAPI::Variant.find(vid)
newValueString << variant.sku
end
#end of value save the newvaluestring to new csv
newFile = Rails.root.join('lib/assets', 'newFile.csv')
CSV.open(newFile, "wb") do |csv|
csv << [newValueString]
end
end
end
end
Your mode when opneing the file is wrong and should be a+. See details in the docs: http://ruby-doc.org/core-2.2.4/IO.html#method-c-new
Also, you might want to open that file just once and not with every line.

How to import a large size (5.5Gb) CSV file to Postgresql using ruby on rails?

I have huge CSV file of 5.5 GB size, it has more than 100 columns in it. I want to import only specific columns from the CSV file. What are the possible ways to do this?
I want to import it to two different tables. Only one field to one table and rest of the fields into another table.
Should i use COPY command in Postgresql or CSV class or SmartCSV kind of gems for this purpose?
Regards,
Suresh.
If I had 5Gb of CSV, I'd better import it without Rails! But, you may have a use case that needs Rails...
Since you've said RAILS, I suppose you are talking about a web request and ActiveRecord...
If you don't care about waiting (and hanging one instance of your server process) you can do this:
Before, notice 2 things: 1) use of temp table, in case of errors you don't mess with your dest table - this is optional, of course. 2) use o option to truncate dest table first
CONTROLLER ACTION:
def updateDB
remote_file = params[:remote_file] ##<ActionDispatch::Http::UploadedFile>
truncate = (params[:truncate]=='true') ? true : false
if remote_file
result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file
if result[:result]
Model.updateFromTempTable(truncate)
flash[:notice] = 'sucess.'
else
flash[:error] = 'Errors: ' + result[:errors].join(" ==>")
end
else
flash[:error] = 'Error: no file given.'
end
redirect_to somewhere_else_path
end
MODEL METHODS:
# References:
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend
# http://www.postgresql.org/docs/9.1/static/sql-copy.html
#
def self.csv2tempTable(uploaded_name, uploaded_file)
erros = []
begin
#read csv file
file = uploaded_file
Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
# remove columns created_at/updated_at
rc.exec "drop table IF EXISTS #{TEMP_TABLE}; "
rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); "
rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;"
#copy it!
rc.exec("COPY #{TEMP_TABLE} FROM STDIN WITH CSV HEADER")
while !file.eof?
# Add row to copy data
l = file.readline
if l.encoding.name != 'UTF-8'
Rails.logger.info "line encoding is #{l.encoding.name}..."
# ENCODING:
# If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op,
# and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte
# sequences to be run, and replacements are done as needed.
# Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1
l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16')
end
Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}"
rc.put_copy_data( l )
end
# We are done adding copy data
rc.put_copy_end
# Display any error messages
while res = rc.get_result
e_message = res.error_message
if e_message.present?
erros << "Erro executando SQL: \n" + e_message
end
end
rescue StandardError => e
erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}"
end
if erros.present?
Rails.logger.error erros.join("*******************************\n")
{ result: false, erros: erros }
else
{ result: true, erros: [] }
end
end
# copy from TEMP_TABLE into self.table_name
# If <truncate> = true, truncates self.table_name first
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name
#
def self.updateFromTempTable(truncate)
erros = []
begin
Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n "
#init connection
conn = ActiveRecord::Base.connection
rc = conn.raw_connection
#
if truncate
rc.exec "TRUNCATE TABLE #{self.table_name}"
return false unless check_exec(rc)
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}"
return false unless check_exec(rc)
else
#remove lines from self.table_name that are present in temp
rc.exec "DELETE FROM #{self.table_name} WHERE id IN ( SELECT id FROM #{FARMACIAS_TEMP_TABLE} )"
return false unless check_exec(rc)
#copy lines from temp into self + includes timestamps
rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};"
return false unless check_exec(rc)
end
rescue StandardError => e
Rails.logger.error "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}"
return false
end
true
end

Buffered/RingBuffer IO in Ruby + Amazon S3 non-blocking chunk reads

I have huge csv files (100MB+) on amazon s3 and I want to read them in chunks and process them using ruby CSV library. I'm having a hard time creating the right IO object for csv processing:
buffer = TheRightIOClass.new
bytes_received = 0
RightAws::S3Interface.new(<access_key>, <access_secret>).retrieve_object(bucket, key) do |chunk|
bytes_received += buffer.write(chunk)
if bytes_received >= 1*MEGABYTE
bytes_received = 0
csv(buffer).each do |row|
process_csv_record(row)
end
end
end
def csv(io)
#csv ||= CSV.new(io, headers: true)
end
I don't know what the right setup here should be and what the TheRightIOClass is. I don't want to load the entire file into memory with StringIO. Is there a bufferedio or ringbuffer in ruby to do this?
If anyone has a good solution using threads(no processes) and pipes I would love to see it.
You can use StringIO and do some clever Error handling to insure you have an entire row in a chunk before handling it. The packer class in this example just accumulates the parsed rows in memory until you flush them to disk or a database.
packer = Packer.new
object = AWS::S3.new.buckets[bucket].objects[path]
io = StringIO.new
csv = ::CSV.new(io, headers: true)
object.read do |chunk|
#Append the most recent chunk and rewind the IO
io << chunk
io.rewind
last_offset = 0
begin
while row = csv.shift do
#Store the parsed row unless we're at the end of a chunk
unless io.eof?
last_offset = io.pos
packer << row.to_hash
end
end
rescue ArgumentError, ::CSV::MalformedCSVError => e
#Only rescue malformed UTF-8 and CSV errors if we're at the end of chunk
raise e unless io.eof?
end
#Seek to our last offset, create a new StringIO with that partial row & advance the cursor
io.seek(last_offset)
io.reopen(io.read)
io.read
#Flush our accumulated rows to disk every 1 Meg
packer.flush if packer.bytes > 1*MEGABYTES
end
#Read the last row
io.rewind
packer << csv.shift.to_hash
packer

Ignore first line on csv parse Rails

I am using the code from this tutorial to parse a CSV file and add the contents to a database table. How would I ignore the first line of the CSV file? The controller code is below:
def csv_import
#parsed_file=CSV::Reader.parse(params[:dump][:file])
n = 0
#parsed_file.each do |row|
s = Student.new
s.name = row[0]
s.cid = row[1]
s.year_id = find_year_id_from_year_title(row[2])
if s.save
n = n+1
GC.start if n%50==0
end
flash.now[:message] = "CSV Import Successful, #{n} new students added to the database."
end
redirect_to(students_url)
end
This question kept popping up when i was searching for how to skip the first line with the CSV / FasterCSV libraries, so here's the solution that if you end up here.
the solution is...
CSV.foreach("path/to/file.csv",{:headers=>:first_row}) do |row|
HTH.
#parsed_file.each_with_index do |row, i|
next if i == 0
....
If you identify your first line as headers then you get back a Row object instead of a simple Array.
When you grab cell values, it seems like you need to use .fetch("Row Title") on the Row object.
This is what I came up with. I'm skipping nil with my if conditional.
CSV.foreach("GitHubUsersToAdd.csv",{:headers=>:first_row}) do |row|
username = row.fetch("GitHub Username")
if username
puts username.inspect
end
end
Using this simple code, you can read a CSV file and ignore the first line which is the header or field names:
CSV.foreach(File.join(File.dirname(__FILE__), filepath), headers: true) do |row|
puts row.inspect
end
You can do what ever you want with row. Don't forget headers: true
require 'csv'
csv_content =<<EOF
lesson_id,user_id
5,3
69,95
EOF
parse_1 = CSV.parse csv_content
parse_1.size # => 3 # it treats all lines as equal data
parse_2 = CSV.parse csv_content, headers:true
parse_2.size # => 2 # it ignores the first line as it's header
parse_1
# => [["lesson_id", "user_id"], ["5", "3"], ["69", "95"]]
parse_2
# => #<CSV::Table mode:col_or_row row_count:3>
here where it's the fun part
parse_1.each do |line|
puts line.inspect # the object is array
end
# ["lesson_id", "user_id"]
# ["5", " 3"]
# ["69", " 95"]
parse_2.each do |line|
puts line.inspect # the object is `CSV::Row` objects
end
# #<CSV::Row "lesson_id":"5" "user_id":" 3">
# #<CSV::Row "lesson_id":"69" "user_id":" 95">
So therefore I can do
parse_2.each do |line|
puts "I'm processing Lesson #{line['lesson_id']} the User #{line['user_id']}"
end
# I'm processing Lesson 5 the User 3
# I'm processing Lesson 69 the User 95
data_rows_only = csv.drop(1)
will do it
csv.drop(1).each do |row|
# ...
end
will loop it

Resources