Buffered/RingBuffer IO in Ruby + Amazon S3 non-blocking chunk reads - ruby-on-rails

I have huge csv files (100MB+) on amazon s3 and I want to read them in chunks and process them using ruby CSV library. I'm having a hard time creating the right IO object for csv processing:
buffer = TheRightIOClass.new
bytes_received = 0
RightAws::S3Interface.new(<access_key>, <access_secret>).retrieve_object(bucket, key) do |chunk|
bytes_received += buffer.write(chunk)
if bytes_received >= 1*MEGABYTE
bytes_received = 0
csv(buffer).each do |row|
process_csv_record(row)
end
end
end
def csv(io)
#csv ||= CSV.new(io, headers: true)
end
I don't know what the right setup here should be and what the TheRightIOClass is. I don't want to load the entire file into memory with StringIO. Is there a bufferedio or ringbuffer in ruby to do this?
If anyone has a good solution using threads(no processes) and pipes I would love to see it.

You can use StringIO and do some clever Error handling to insure you have an entire row in a chunk before handling it. The packer class in this example just accumulates the parsed rows in memory until you flush them to disk or a database.
packer = Packer.new
object = AWS::S3.new.buckets[bucket].objects[path]
io = StringIO.new
csv = ::CSV.new(io, headers: true)
object.read do |chunk|
#Append the most recent chunk and rewind the IO
io << chunk
io.rewind
last_offset = 0
begin
while row = csv.shift do
#Store the parsed row unless we're at the end of a chunk
unless io.eof?
last_offset = io.pos
packer << row.to_hash
end
end
rescue ArgumentError, ::CSV::MalformedCSVError => e
#Only rescue malformed UTF-8 and CSV errors if we're at the end of chunk
raise e unless io.eof?
end
#Seek to our last offset, create a new StringIO with that partial row & advance the cursor
io.seek(last_offset)
io.reopen(io.read)
io.read
#Flush our accumulated rows to disk every 1 Meg
packer.flush if packer.bytes > 1*MEGABYTES
end
#Read the last row
io.rewind
packer << csv.shift.to_hash
packer

Related

Rake task handle 404

I am using a rake task to take data from one csv, call the shopify api using that data, and save the response to another CSV. The problem is I have no error handler in place so that if the shopify api cannot find the resource I provided, the whole task gets aborted. What is the best way to handle the error so that if the resource is not found in Shopify, simply skip it and proceed to the next row?
The line calling the shopify API in the code below is:
variant = ShopifyAPI::Variant.find(vid)
namespace :replace do
desc "replace variant id with variant sku"
task :sku => :environment do
file="db/master-list-3-28.csv"
newFile = Rails.root.join('lib/assets', 'newFile.csv')
CSV.open(newFile, "a+") do |csv|
CSV.foreach(file) do |row|
msku, namespace, key, valueType, value = row
valueArray = value.split('|')
newValueString = ""
valueArray.each_with_index do |v, index|
recArray = v.split('*')
handle = recArray[0]
vid = recArray[1]
newValueString << handle
newValueString << "*"
# use api call to retrieve variant sku using handle and vid
#replace vid with sku and save to csv
variant = ShopifyAPI::Variant.find(vid)
sleep 1
# puts variant.sku
newValueString << variant.sku
if index < 2
newValueString << "|"
end
end
#end of value save the newvaluestring to new csv
csv << [newValueString]
end
end
end
end
Here's a simple way to get it done:
begin
variant = ShopifyAPI::Variant.find(vid)
rescue
next
end
If an exception is raised the stuff in rescue happens.

How to use Rails postgresql connection raw.put_copy_data from string instead of a file

I usually copy data into my postgres database in rails using the following import module.
In this case I am uploading a file that is ready for postgres copy command to take in.
module Import #< ActiveRecord::Base
class Customer
include ActiveModel::Model
include EncodingSupport
attr_accessor :file
validates :file, :presence => true
def process file=nil
file ||= #file.tempfile
ActiveRecord::Base.connection.execute('truncate customers')
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',', NULL ' ', HEADER true)")
# open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
data = file.open
data::gets
ticker = 0
counter = 0
success_counter = 0
failed_records = []
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
# once all done...
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
ActiveRecord::Base.connection_pool.checkin(conn)
return { :csv => false, :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
end
I have another file now that needs to be formatted correctly though so I have another module that converts it from a text file to a csv file and trims out unnecessary content. Once it is ready I'd like to pass the data to the module above and have postgres take it into the database.
def pg_import file=nil
file ||= #file.tempfile
ticker = 0
counter = 0
col_order = [:warehouse_id, :customer_type_id, :pricelist_id]
data = col_order.to_csv
file.each do |line|
line.strip!
if item_line?(line)
row = built_line
data += col_order.map { |col| row[col] }.to_csv
else
line.empty?
end
ticker +=1
counter +=1
if ticker == 1000
p counter
ticker = 0
end
end
pg_import data
end
My problem is the 'process' method returns data as
"warehouse_id,customer_type_id,pricelist_id\n201,A01,0AA\n201,A02,0AC
which means when I pass it to pg_import I can't iterate over the data. Because it expects it to be in the following format.
[0] "201,A01,0AA\r\n",
[1] "201,A02,0AC\r\n",
[2] "201,A03,oAE\r\n"
What command can I use to convert the string data so that I can iterate over it in the
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
??
Probably has a really simple solution but just expecting to not be able to use the put_copy_data without having a file to iterate over...
Solved this with this process before and within the loop.
data = CSV.parse(data)
data.shift
data.each_with_index do |line, index|
line = line.to_csv
raw.put_copy_data line
counter += 1
end
Converted the string into an array of arrays with csv. Shifted to get rid of the header row. That allows me to iterate over the arrays. Grab an array and convert that array from an array back into a csv string and then pass it into the database.

optimizing reading database and writing to csv file

I'm trying to read a large amount of cells from database (over 100.000) and write them to a csv file on VPS Ubuntu server. It happens that server doesn't have enough memory.
I was thinking about reading 5000 rows at once and writing them to file, then reading another 5000, etc..
How should I restructure my current code so that memory won't be consumed fully?
Here's my code:
def write_rows(emails)
File.open(file_path, "w+") do |f|
f << "email,name,ip,created\n"
emails.each do |l|
f << [l.email, l.name, l.ip, l.created_at].join(",") + "\n"
end
end
end
The function is called from sidekiq worker by:
write_rows(user.emails)
Thanks for help!
The problem here is that when you call emails.each ActiveRecord loads all the records from the database and keeps them in memory, to avoid this you can use the method find_each:
require 'csv'
BATCH_SIZE = 5000
def write_rows(emails)
CSV.open(file_path, 'w') do |csv|
csv << %w{email name ip created}
emails.find_each do |email|
csv << [email.email, email.name, email.ip, email.created_at]
end
end
end
By default find_each loads records in batches of 1000 at a time, if you want to load batches of 5000 record you have to pass the option :batch_size to find_each:
emails.find_each(:batch_size => 5000) do |email|
...
More information about the find_each method (and the related find_in_batches) can be found on the Ruby on Rails Guides.
I've used the CSV class to write the file instead of joining fields and lines by hand. This is not inteded to be a performance optimization since writing on the file shouldn't be the bottleneck here.

Save large text file without using too much memory

I have a model that creates a KML file. I treat that KML as a string and then forward that to mailer which then delivers it:
def write_kml(coords3d, time)
kml = String.new
kml << header
coords3d.each do |coords|
coordinates = String.new
coords.each do |coord|
lat = coord[0].to_f
lng = coord[1].to_f
coordinates << "#{lng}" + "," "#{lat}" + ",0 "
kml << polygon(coordinates)
end
end
kml << footer
kml
end
This gets used here:
CsvMailer.kml_send(kml,time, mode, email).deliver
Mailer:
def kml_send(kml, time, mode, email)
#time = (time / 60).to_i
#mode = mode
gen_time = Time.now
file_name = gen_time.strftime('%Y-%m-%d %H:%M:%S') + " #{#mode.to_s}" + " #{#time.to_s}(mins)"
attachments[file_name + '(KML).kml'] = { mime_type: 'text/kml', content: kml}
mail to: email, subject: ' KML Filem'
end
It takes up a huge amount of memory. Some of these files are quite large (200MB), so on Heroku for example, they take up too much space.
I had some ideas using S3, but I would need to create this file to begin with, so it would still use the memory. Can I write straight to S3 without using memory?
You can do this with s3 multipart uploads, since they don't require you to know the file size upfront.
Parts have to be at least 5MB in size, so the easiest way to use this is to write your data to an in memory buffer and upload the part to s3 every time you get past 5MB. There's a limit of 10000 parts for an upload, so if your file size is going to be > 50GB then you'd need to know that ahead of time so that you can make the parts bigger.
Using the fog library, that would look a little like
def upload_chunk connection, upload_id, chunk, index
md5 = Base64.encode64(Digest::MD5.digest(chunk)).strip
connection.upload_part('bucket', 'a_key', upload_id, chunk_index, chunk, 'Content-MD5' => md5 )
end
connection = Fog::Storage::AWS.new(:aws_access_key_id => '...', :region => '...', :aws_secret_access_key => '...'
upload_id = connection.initiate_multipart_upload('bucket', 'a_key').body['UploadId']
chunk_index = 1
kml = String.new
kml << header
coords3d.each do |coords|
#append to kml
if kml.bytesize > 5 * 1024 * 1024
upload_chunk connection, upload_id, kml, chunk_index
chunk_index += 1
kml = ''
end
end
upload_chunk connection, upload_id, kml, chunk_index
#when you've uploaded all the chunks
connection.complete_multipart_upload('bucket', 'a_key', upload_id)
You could probably come up with something neater if you created an uploader class to wrap the buffer and stuck all the s3 logic in there. Then your kml code doesn't have to know whether it has an actual string or a string that flushes to s3 periodically

How to split a text file into 2 files using number of lines?

I have been facing some issue with file concepts. I have a text file in which I have 1000 lines. I want to split that file into 2 and each of which should contain 500 lines.
For that I wrote the following code, but it splits that by giving certain memory space.
class Hello
def chunker f_in, out_pref, chunksize = 500
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
ch_path = "/my_applications//#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt"
puts "choose path: "
puts ch_path
File.open(ch_path,"w") do |fh_out|
fh_out << fh_in.read(chunksize)
puts "FH out : "
puts fh_out
end
end
end
end
end
f=Hello.new
f.chunker "/my_applications/hello.txt", "output_prefix"
I am able to split the parent file according to memory size(500kb).
But I want that gets splitted by number of lines. How can I achieve that.
Please help me.
Calculating the middle line pivot, and output according it.
out1 = File.open('output_prefix1', 'w')
out2 = File.open('output_prefix2', 'w')
File.open('/my_applications/hello.txt') do |file|
pivot = file.lines.count / 2
file.rewind
file.lines.each_with_index do |line, index|
if index < pivot
out1.write(line)
else
out2.write(line)
end
end
end
out1.close
out2.close
file = File.readlines('hello.txt')
File.open('first_half.txt', 'w') {|new_file| new_file.puts file[0...500]}
File.open('second_half.txt', 'w') {|new_file| new_file.puts file[500...1000]}

Resources