Copying objects to new S3 folder, but only from certain folders? - ruby-on-rails

I want to copy a bunch of PDFs from folders in my S3 bucket, the folders in question are a number of folders like DD.MM.YYYY, into a folder called "Archive". The question is, how do I isolate my copy request to only grab PDF's from folders in the bucket with the DD/MM/YYYY naming structure?
Here is the rake task I built. However I'm receiving a "Don't know how to build this task" error.
namespace :courts do
task update_ocr_documents: :environment do
OcrDocument.find_each do |ocr_document|
::Courts::Operations::OcrDocuments::RecognizeExisting.run(id: ocr_document.id, skip_auth: true)
end
end
task sync_ocr_bucket: :environment do
::Courts::SyncronizeBucketJob.perform_now
end
task move_old_documents: :environment do
bucket = Settings.file_storage.ocr_documents.s3_credentials.bucket
file_util = Courts::Aws::FileUtil.new(bucket) # Should be called S3Util ?
file_util.get_objects { |obj| obj.key.ends_with?('.pdf') } #
TODO: get only pdf files that are in the folders with names dd.mm.yyy }
file_util.move_objects('archive')
end
end

Related

Rails - Resave All Models for S3 Migration

rails 6.1.3.2
aws-sdk-s3 gem
I currently have a rails app in production that uses ActiveStorage to attach image data to a wrapper Image model. It's currently using the local strategy to save images to disk and I am migrating it to S3. I am not using paperclip or anything similar.
I succeeded in setting it up. Currently it is set to use local primarily and have S3 as a mirror so that I can write to two places during the migration. However the documentation says that it will only save new images to S3 upon create and update of a record. I would like to "re-save" all models in production to force the migration to happen. Does anyone know how to do this?
Looks like it was already answered!
If you happen to be stuck with only access to the Rails Console like I was, this solution worked perfectly. If you copy-paste this code into the console, it will begin to produce output of the S3 uploads. After 5k of those, I was done. An immense thank you to Tayden for the solution.
all_services = [ActiveStorage::Blob.service.primary, *ActiveStorage::Blob.service.mirrors]
# Iterate through each blob
ActiveStorage::Blob.all.each do |blob|
# Select services where file exists
services = all_services.select { |file| file.exist? blob.key }
# Skip blob if file doesn't exist anywhere
next unless services.present?
# Select services where file doesn't exist
mirrors = all_services - services
# Open the local file (if one exists)
local_file = File.open(services.find{ |service| service.is_a? ActiveStorage::Service::DiskService }.path_for blob.key) if services.select{ |service| service.is_a? ActiveStorage::Service::DiskService }.any?
# Upload local file to mirrors (if one exists)
mirrors.each do |mirror|
mirror.upload blob.key, local_file, checksum: blob.checksum
end if local_file.present?
# If no local file exists then download a remote file and upload it to the mirrors (thanks #Rystraum)
services.first.open blob.key, checksum: blob.checksum do |temp_file|
mirrors.each do |mirror|
mirror.upload blob.key, temp_file, checksum: blob.checksum
end
end unless local_file.present?

How to migrate local storage (active storage) to google cloud storage

i'm trying to migrate my rails app on google cloud.
I've connect active storage with the bucket create on GCS.
I've upload the folder "storage" in the bucket but all the images in the app has 404 error.
How i can correctly migrate the local storage folder in the GCS?
Thank you in advice
This question is very similar to this, as is mentioned on that thread:
DiskService uses a different folder structure than cloud storage service on google.
DiskService uses as folders part of the first chars of the key. Cloud services just use the key and put all variants in a separate folder.
You can create a rake task to copy files to cloud storage, for example:
namespace :active_storage do
desc "Migrates active storage local files to cloud"
task migrate_local_to_cloud: :environment do
raise 'Missing storage_config param' if !ENV.has_key?('storage_config')
require 'yaml'
require 'erb'
require 'google/cloud/storage'
config_file = Pathname.new(Rails.root.join('config/storage.yml'))
configs = YAML.load(ERB.new(config_file.read).result) || {}
config = configs[ENV['storage_config']]
client = Google::Cloud.storage(config['project'], config['credentials'])
bucket = client.bucket(config.fetch('bucket'))
ActiveStorage::Blob.find_each do |blob|
key = blob.key
folder = [key[0..1], key[2..3]].join('/')
file_path = Rails.root.join('storage', folder.to_s, key)
file = File.open(file_path, 'rb')
md5 = Digest::MD5.base64digest(file.read)
bucket.create_file(file, key, content_type: blob.content_type, md5: md5)
file.close
puts key
end
end
end
Executed as: rails active_storage:migrate_local_to_cloud storage_config=google.
You can found useful documentation at here.
I would write a migration and iterate over all models that have attachments and "reassign" the current image with the local file in the directory, so thats will be synced with GCS. Also have a look into the Active Storage guide.
Try to use mirror solution: How to sync new ActiveStorage mirrors? — mirror first, then sync.
This works for my migration from local to s3-service.

Heroku - how to write into "tmp" directory?

I need to use the tmp folder on Heroku (Cedar) for writing some temporarily data, I am trying to do that this way:
open("#{Rails.root}/tmp/#{result['filename']}", 'wb') do |file|
file.write open(image_url).read
end
But this produce error
Errno::ENOENT: No such file or directory - /app/tmp/image-2.png
I am trying this code and it's running properly on localhost, but I cannot make it work on Heroku.
What is the proper way to save some files to the tmp directory on Heroku (Cedar stack)?
Thank you
EDIT:
I am running method with Delayed Jobs that needs to has access to the tmp file.
EDIT2:
What I am doing:
files.each_with_index do |f, index|
unless f.nil?
result = JSON.parse(buffer)
filename = "#{Time.now.to_i.to_s}_#{result['filename']}" # thumbnail name
thumb_filename = "#{Rails.root}/tmp/#{filename}"
image_url = f.file_url+"/convert?rotate=exif"
open("#{Rails.root}/tmp/#{result['filename']}", 'wb') do |file|
file.write open(image_url).read
end
img = Magick::Image.read(image_url).first
target = Magick::Image.new(150, 150) do
self.background_color = 'white'
end
img.resize_to_fit!(150, 150)
target.composite(img, Magick::CenterGravity, Magick::CopyCompositeOp).write(thumb_filename)
key = File.basename(filename)
s3.buckets[bucket_name].objects[key].write(:file => thumb_filename)
# save path to the new thumbnail to database
f.update_attributes(:file_url_thumb => "https://s3-us-west-1.amazonaws.com/bucket/#{filename}")
end
end
I have in database information about images. These images are stored in Amazon S3 bucket. I need to create thumbnails to these images. So I am going through one image by another one, load the image, temporarily save it, then resize it and afterwards I will upload this thumbnail to S3 bucket.
But this procedure doesn't seems to be working on Heroku, so, how could I do that (my app is running on Heroku)?
Is /tmp included in your git repo? Removed in your .slugignore? The directory may just not exist out on Heroku.
Try tossing in a quick mkdir before the write:
Dir.mkdir(File.join(Rails.root, 'tmp'))
Or even in an initializer or something...
Here's an elegant way
f = File.new("tmp/filename.txt", 'w')
f << "hi there"
f.close
Dir.entries(Dir.pwd.to_s + ("/tmp")) # See your newly created file in /tmp
Don't forget that whenever your app restarts (for any reason, including those outside your control), your files will be deleted, as they are only stored ephemerally.
Try it with heroku restart, you will see the new file you created is no longer there

Generating a CSV and uploading it to S3 when finished in a background job

I'm providing users with the ability to download an extremely large amount of data via CSV. To do this, I'm using Sidekiq and putting the task off into a background job once they've initiated it. What I've done in the background job is generate a csv containing all of the proper data, storing it in /tmp and then call save! on my model, passing the location of the file to the paperclip attribute which then goes off and is stored in S3.
All of this is working perfectly fine locally. My problem now lies with Heroku and it's ability to store files for a short duration dependent on what node you're on. My background job is unable to find the tmp file that gets saved because of how Heroku deals with these files. I guess I'm searching for a better way to do this. If there's some way that everything can be done in-memory, that would be awesome. The only problem is that paperclip expects an actual file object as an attribute when you're saving the model. Here's what my background job looks like:
class CsvWorker
include Sidekiq::Worker
def perform(report_id)
puts "Starting the jobz!"
report = Report.find(report_id)
items = query_ranged_downloads(report.start_date, report.end_date)
csv = compile_csv(items)
update_report(report.id, csv)
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: csv, status: true)
report.save!
end
def compile_csv(items)
clean_items = items.compact
path = File.new("#{Rails.root}/tmp/uploads/downloads_by_title_#{Process.pid}.csv", "w")
csv_string = CSV.open(path, "w") do |csv|
csv << ["Item Name", "Parent", "Download Count"]
clean_items.each do |row|
if !row.item.nil? && !row.item.parent.nil?
csv << [
row.item.name,
row.item.parent.name,
row.download_count
]
end
end
end
return path
end
end
I've omitted the query method for readabilities sake.
I don't think Heroku's temporary file storage is the problem here. The warnings around that mostly center around the facts that a) dynos are ephemeral, so anything you write can and will disappear without notice; and b) dynos are interchangeable, so the presence of inter-request tempfiles are a matter of luck when you have more than one web dyno running. However, in no situation do temporary files just vanish while your worker is running.
One thing I notice is that you're actually creating two temporary files with the same name:
> path = File.new("/tmp/filename", "w")
=> #<File:/tmp/filename>
> path.fileno
=> 3
> CSV.open(path, "w") do |csv| csv << %w(foo bar baz); puts csv.fileno end
4
=> nil
You could change the path = line to just set the filename (instead of opening it for writing), and then make update_report open the filename for reading. I haven't dug into what Paperclip does when you give it an empty, already-overwritten, opened-for-writing file handle, but changing that flow may well fix the issue.
Alternately, you could do this in memory instead: generate the CSV as a string and give it to Paperclip as a StringIO. (Paperclip supports certain non-file objects, including StringIOs, using e.g. Paperclip::StringioAdapter.) Try something like:
# returns a CSV as a string
def compile_csv(items)
CSV.generate do |csv|
# ...
end
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: StringIO.new(csv), status: true)
report.save!
end

Uploading thousands of images with Paperclip to S3

I have ~16,000 images I'm trying to upload to Amazon. Right now, they're on my local file system. I'd like to upload them to S3 using Paperclip, but I do NOT want to upload them to my server first. I'm using Heroku and they limit slug size.
Is there a way to use a rake task to upload the images directly from my local file system to S3 via Paperclip?
You can configure your app to use Amazon S3 for paperclip storage in development (see my example) and upload the files using a rake task like this:
Lets's say your folder of images was in your_app_folder/public/images, you can create a rake task similar to this.
namespace :images do
desc "Upload images."
task :create => :environment do
#images = Dir["#{RAILS_ROOT}/public/images/*.*"]
for image in #images
MyModel.create(:image => File.open(image))
end
end
end
Yes. I did something similar on my first personal Rails project. Here's a previous SO question (Paperclip S3 download remote images) whose answer links to the where I found my answer so long ago (http://trevorturk.com/2008/12/11/easy-upload-via-url-with-paperclip/).
Great answer Johnny Grass and great question Chris. I had a few hundred tif files on my local machine, Heroku, paperclip, and s3. Some of the tiff files were > 100MB, so getting heroku to pay attention for that long required delayed job and some extra work. Since this was a mostly one time batch process (5 different image forms created from each with 5 x uploads), the idea of a rake task fit perfectly. Here, in case it helps, is the rake task I created assuming like Johnny wrote that your development database has current data (use pg backup to get fresh set of ids) and is connected to S3.
I have a model called "Item" with an attachment "image". I wanted to check if existing Items already had an image, and if not, upload a new one. The effect is to mirror a directory of source files. Good extensions might be to check the dates and see if the local tif if updated.
# lib/image_management.rake
namespace :images do
desc 'upload images through paperclip with postprocessing'
task :create => :environment do
directory = "/Volumes/data/historicus/_projects/deeplandscapes/library/tifs/*.tif"
images = Dir[directory]
puts "\n\nProcessing #{ images.length } images in #{directory}..."
items_with_errors = []
items_updated = []
items_skipped = []
images.each do |image|
# find the needed record
image_basename = File.basename(image)
id = image_basename.gsub("it_", "").gsub(".tif", "").to_i
if id > 0
item = Item.find(id) rescue nil
# check if it has an image already
if item
unless item.image.exists?
# create the image
success = item.update_attributes(:image => File.open(image))
if success
items_updated << item
print ' u '
else
items_with_errors << item
print ' e '
end
else
items_skipped << item
print ' s '
end
else
print "[#{id}] "
end
else
print " [no id for #{image_basename}] "
end
end
unless items_with_errors.empty?
puts "\n\nThe following items had errors: "
items_with_errors.each do |error_image|
puts "#{error_image.id}: #{error_image.errors.full_messages}"
end
end
puts "\n\nUpdated #{items_updated.length} items."
puts "Skipped #{items_skipped.length} items."
puts "Update complete.\n"
end
end

Resources