Tracking Upload Progress of File to S3 Using Ruby aws-sdk

Tracking Upload Progress of File to S3 Using Ruby aws-sdk - ruby-on-rails

Firstly, I am aware that there are quite a few questions that are similar to this one in SO. I have read most, if not all of them, over the past week. But I still can't make this work for me.
I am developing a Ruby on Rails app that allows users to upload mp3 files to Amazon S3. The upload itself works perfectly, but a progress bar would greatly improve user experience on the website.
I am using the aws-sdk gem which is the official one from Amazon. I have looked everywhere in its documentation for callbacks during the upload process, but I couldn't find anything.
The files are uploaded one at a time directly to S3 so it doesn't need to load it into memory. No multiple file upload necessary either.
I figured that I may need to use JQuery to make this work and I am fine with that.
I found this that looked very promising: https://github.com/blueimp/jQuery-File-Upload
And I even tried following the example here: https://github.com/ncri/s3_uploader_example
But I just could not make it work for me.
The documentation for aws-sdk also BRIEFLY describes streaming uploads with a block:
obj.write do |buffer, bytes|
# writing fewer than the requested number of bytes to the buffer
# will cause write to stop yielding to the block
end
But this is barely helpful. How does one "write to the buffer"? I tried a few intuitive options that would always result in timeouts. And how would I even update the browser based on the buffering?
Is there a better or simpler solution to this?
Thank you in advance.
I would appreciate any help on this subject.

The "buffer" object yielded when passing a block to #write is an instance of StringIO. You can write to the buffer using #write or #<<. Here is an example that uses the block form to upload a file.
file = File.open('/path/to/file', 'r')
obj = s3.buckets['my-bucket'].objects['object-key']
obj.write(:content_length => file.size) do |buffer, bytes|
buffer.write(file.read(bytes))
# you could do some interesting things here to track progress
end
file.close

After read the source code of the AWS gem, I've adapted (or mostly copy) the multipart upload method to yield the current progress based on how many chunks have been uploaded
s3 = AWS::S3.new.buckets['your_bucket']
file = File.open(filepath, 'r', encoding: 'BINARY')
file_to_upload = "#{s3_dir}/#{filename}"
upload_progress = 0
opts = {
content_type: mime_type,
cache_control: 'max-age=31536000',
estimated_content_length: file.size,
}
part_size = self.compute_part_size(opts)
parts_number = (file.size.to_f / part_size).ceil.to_i
obj = s3.objects[file_to_upload]
begin
obj.multipart_upload(opts) do |upload|
until file.eof? do
break if (abort_upload = upload.aborted?)
upload.add_part(file.read(part_size))
upload_progress += 1.0/parts_number
# Yields the Float progress and the String filepath from the
# current file that's being uploaded
yield(upload_progress, upload) if block_given?
end
end
end
The compute_part_size method is defined here and I've modified it to this:
def compute_part_size options
max_parts = 10000
min_size = 5242880 #5 MB
estimated_size = options[:estimated_content_length]
[(estimated_size.to_f / max_parts).ceil, min_size].max.to_i
end
This code was tested on Ruby 2.0.0p0

Related

Ruby on Rails - How to convert to images some elements from a word document

Context
In our platform we allow users to upload word documents, those documents are stored in google drive and then dowloaded again to our platform in HTML format to create a section where the users can interact with that content.
Rails 5.0.7
Ruby 2.5.7p206
selenium-webdriver 3.142.7 (latest stable version compatible with our ruby and rails versions)
Problem
Some of the documents have charts or graphics inside that are not processed correctly giving wrong results after all the process.
We have been trying to fix this problem at the moment we get the word document and before to send it to google drive.
I'm looking for a simple way to export the entire chart and/or table as an image, if anyone knows of a way to do this the advice would be much appreciated.
Edit 1: Adding some screenshots:
This screenshot is from the original word doc:
And this is how it looks in our systems:
Here are the approaches I have tried that haven't worked for me so far.
Approach 1
Using nokogiri to read the document and found the nodes that contain the charts (we've found that they are called drawing) and then use Selenium to navigate through the file and take and screenshot of that particular section.
The problem we found with this approach is that the versions our gems are not compatible with the latest versions of selenium and its web drivers (chrome or firefox) and it is not posible to perform this action.
Other problem, and it seems is due to security, is that selenium is not able to browse inside local files and open it.
options = Selenium::WebDriver::Firefox::Options.new(binary: '/usr/bin/firefox', headless: true)
driver = Selenium::WebDriver.for :firefox, options: options
path = "#{Rails.root}/doc_file.docx"
driver.navigate.to("file://#{path}")
# Here occurs the first issue, it is not able to navigate to the file
puts "Title: #{driver.title}"
puts "URL: #{driver.current_url}"
# Below is the code that I am trying to use to replace the images with the modified images
drawing_elements = driver.find_elements(:css, 'w|drawing')
modified_paragraphs = []
drawing_elements.each do |drawing_element|
paragraph_element = drawing_element.find_element(:xpath, '..')
paragraph_element.screenshot.save('paragraph.png')
modified_paragraph = File.read('paragraph.png')
modified_paragraphs << modified_paragraph
end
driver.quit
file = File.open(File.join(Rails.root, 'doc_file.docx'))
doc = Nokogiri::XML(file)
drawing_elements = doc.css('w|drawing')
drawing_elements.each_with_index do |drawing_element, i|
paragraph_element = drawing_element.parent
paragraph_element.replace(modified_paragraphs[i])
end
new_doc_file = File.write('modified_doc.docx', doc.to_xml)
s3_client.put_object(bucket: bucket, key: #document_path, body: new_doc_file)
File.delete('doc_file.docx')
Approach 2
Using nokogiri to get the drawing elements and the try to convert it directly to an image using rmagick or mini_magick.
It is only possible if the drawing element actually contains an image, it can convert that correctly to an image, but the problem is when inside of the drawing element are not images but other elements like graphicData, pic, blipFill, blip. It needs to start looping into the element and rebuilding it, but at that point of time it seems that the element is malformed and it can't rebuild it.
Other issue with this approach is when it founds elements that seem to conform an svg file, it also needs to loop into all the elements and try to rebuild it, but the same as the above issue, it seems that the element is malformed.
response = s3_client.get_object(bucket: bucket, key: #document_path)
docx = response.body.read
Zip::File.open_buffer(docx) do |zip|
doc = zip.find_entry("word/document.xml")
doc_xml = doc.get_input_stream.read
doc = Nokogiri::XML(doc_xml)
drawing_elements = doc.xpath("//w:drawing")
drawing_elements.each do |drawing_element|
node = get_chil_by_name(drawing_element, "graphic")
if node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").any?
img_data = node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").first.attributes["r:embed"].value
img = Magick::Image.from_blob(img_data).first
img.write("node.jpeg")
node.replace("<img src='#{img.to_blob}'/>")
elsif node.xpath("//a:graphicData/a:svg").any?
svg_data = node.xpath("//a:graphicData/a:svg").to_s
Prawn::Document.generate("node.pdf") do |pdf|
pdf.svg svg_data, at: [0, pdf.cursor], width: pdf.bounds.width
end
else
puts "unsupported format"
end
end
# update the file in S3
s3.put_object(bucket: bucket, key: #document_path, body: doc)
end
Approach 3
Convert the elements since its parents to a pdf file and then to an image.
Basically the same issue as in the approach 2, it needs to loop inside all the elements and try to rebuild it, we haven't found a way to do that.

Paperclip Nginx 504 Gateway Time-out

I have a Rails 4 application that allows to upload videos using the jQuery Dropzone plugin and the paperclip gem. Each uploaded video is encoded into multiple formats and uploaded to Amazon S3 in the background using delayed_paperclip, av-transcoder and sidekiq gems.
All works fine with most videos, but with a higher size like 1.1GB after the upload reaches what seems like the end of the progress bar of the dropzone plugin it returns an Nginx 504 Gateway Time-out.
As far as server goes, the rails app runs on Nginx + Passenger on a couple of servers that are behind a load balancer (Nginx used here too). I do not have timeouts set in the upstream section of the load balancer, the client_max_body_size is set to 2000M (both on the load balancer and servers), I've tried setting passenger_pool_idle_time to a large value (600), that didn't help, I have also tried setting send_timeout (600s), nothing made any difference.
Note: When making those changes, I did them on the host files of both servers as well as of the load balancer and always restarted nginx afterwards.
I've read also several answers regarding similar problems like this one and this one but still can't figure this out, google wasn't much more helpful either.
Some extra notes for those unfamiliar with the whole paperclip/delayed_paperclip process, the file is uploaded to the server and then the operation is done as far as the user is concerned, in the background the post processing of the videos (encoding/uploading to S3) is pushed to Redis as a job and Sidekiq processes it whenever it has time/resources.
What could be causing this issue? How can I debug this and solve it?
UPDATE
Thanks to Sergey's answer I was able to solve the issue. Since I was restricted to a specific version of Paperclip, I couldn't update it to the newest version that has the fix, therefore I'll leave here what I ended up doing.
In the engine that I use to handle the uploads I've added the following code in the engine_name.rb file to override the methods from Paperclip that needed fixing:
Paperclip::AbstractAdapter.class_eval do
def copy_to_tempfile(src)
link_or_copy_file(src.path, destination.path)
destination
end
def link_or_copy_file(src, dest)
Paperclip.log("Trying to link #{src} to #{dest}")
FileUtils.ln(src, dest, force: true) # overwrite existing
#destination.close
#destination.open.binmode
rescue Errno::EXDEV, Errno::EPERM, Errno::ENOENT => e
Paperclip.log("Link failed with #{e.message}; copying link #{src} to #{dest}")
FileUtils.cp(src, dest)
end
end
Paperclip::AttachmentAdapter.class_eval do
def copy_to_tempfile(source)
if source.staged?
link_or_copy_file(source.staged_path(#style), destination.path)
else
source.copy_to_local_file(#style, destination.path)
end
destination
end
end
Paperclip::Storage::Filesystem.class_eval do
def flush_writes #:nodoc:
#queued_for_write.each do |style_name, file|
FileUtils.mkdir_p(File.dirname(path(style_name)))
begin
move_file(file.path, path(style_name))
rescue SystemCallError
File.open(path(style_name), "wb") do |new_file|
while chunk = file.read(16 * 1024)
new_file.write(chunk)
end
end
end
unless #options[:override_file_permissions] == false
resolved_chmod = (#options[:override_file_permissions] &~ 0111) || (0666 &~ File.umask)
FileUtils.chmod( resolved_chmod, path(style_name) )
end
file.rewind
end
after_flush_writes # allows attachment to clean up temp files
#queued_for_write = {}
end
private
def move_file(src, dest)
# Support hardlinked files
if File.identical?(src, dest)
File.unlink(src)
else
FileUtils.mv(src, dest)
end
end
end

I faced similar issue a while ago. Maybe, my experience will help.
We had m3.medium instance on Amazon with 4Gb of memory.
User could be able to upload large video files. We faced an issue of 504 error when uploading files larger than 400Mb.
During monitoring and logging the upload process it appeared that Paperclip creates 4 files per attachment and thus all the instance resources work on a file system.
Here there is a description of this problem
https://github.com/thoughtbot/paperclip/issues/1642
and proposed a solution - use links instead of files when possible. You can see the appropriate code changes here
https://github.com/arnonhongklay/paperclip/commit/cd80661df18d7cd112944bfe26d90cb87c928aad
However 2 days ago Paperclip was updated to 5.2.0 version and they implemented similar solution.
So for now it creates only one file per attachment. Thus our file system is not overloaded and after updating to 5.2.0 version we stopped receiving 504 error.
Conclusion:
Use monkey patch from the link attached above if you're restricted in Paperclip version for some reason
Update Paperclip to 5.2.0 version. Should help.

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

I have a rails app which allows users to upload csv files and schedule the reading of multiple csv files with help of delayed_job gem. The problem is the app reads each file in its entirity into memory and then writes to the database. If its just 1 file being read its fine, but when multiple files are read the RAM on the server gets full and causes the app to hang.
I am trying to find a solution for this problem.
One solution I researched is to break the csv file into smaller parts and save them on the server, and read the smaller files. see this link
example: split -b 40k myfile segment
Not my preferred solution. Are there any other approaches to solve this where I dont have to break the file. Solutions must be ruby code.
Thanks,

You can make use of CSV.foreach to read just chunks of your CSV file:
path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
end
If it's run in a background job, you could additionally call GC.start every n rows.
How it works
CSV.foreach operates on an IO stream, as you can see here:
def IO.foreach(path, options = Hash.new, &block)
# ...
open(path, options) do |csv|
csv.each(&block)
end
end
The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:
static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
{
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
rb_yield(str);
}
// ...
}

upload an RMagick-generated file from Heroku to Amazon S3

I am creating a Rails app which is hosted on Heroku and that allows the user to generate animated GIFs on the fly based on an original JPG that's hosted somewhere in the web (think of it as a crop-resize app). I tried Paperclip but, AFAIK, it does not handle dynamically-generated files. I am using the aws-sdk gem and this is a code snippet of my controller:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
str1 = fr1.to_blob
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
str2 = fr2.to_blob
list = Magick::ImageList.new
list.from_blob(str1)
list.from_blob(str2)
list.delay = #animation.delay
list.iterations = 0
That is for the basic creation of a two-frame animation. RMagick can generate a GIF in my development computer with these lines:
list.write("#{Rails.public_path}/images/" + #animation.filename)
I tried uploading the list structure to S3:
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
But I don't have a size method in RMagick::ImageList that I can use to specify that. I tried "precompiling" the GIF into another RMagick::Image:
anim = Magick::Image.new(#animation.width, #animation.height)
anim.format = "GIF"
list.write(anim)
But Rails crashes with a segmentation fault:
/path/to/my_controller.rb:103: [BUG] Segmentation fault ruby 1.8.7 (2010-01-10 patchlevel 249) [universal-darwin11.0]
Abort trap: 6
Line 103 corresponds to list.write(anim).
So right now I have no idea how to do this and would appreciate any help I receive.

As per #mga's request in his answer to his original question...
a non-filesystem based approach is pretty simple
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3_bucket.objects['my_thumbnail.jpg'].write(a_thumbnail.to_blob, {:acl=>:public_read})
Voila! reading an uploaded file, manipulating it with RMagick, and writing it to s3 without ever touching the filesystem.

Since this project is hosted in Heroku I cannot use the filesystem so that is why I was trying to do everything via code. I found that Heroku does have a temporary-writable folder: http://devcenter.heroku.com/articles/read-only-filesystem
This works just fine in my case since I don't need the file after this request.
The resulting code:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
list = Magick::ImageList.new
list << fr1
list << fr2
list.delay = #animation.delay
list.iterations = 0
# gotta packet the file
list.write("#{Rails.root}/tmp/#{#animation.filename}.gif")
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:file => "#{Rails.root}/tmp/#{#animation.filename}.gif")
It would be interesting to know if a non-filesystem-writing solution is possible.

I am updating this answer for AWS SDK Version 2 which should be:
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3 = Aws::S3::Resource.new
bucket = s3.bucket('mybucket')
obj = bucket.object('filename')
obj.put(body: background.to_blob)

I think there's a few things going on here. First, the documentation for RMagick is sub-par, and its easy to get side-tracked. The code you're using to generate the gif can be a little simpler. I cooked up a very contrived example here:
#!/usr/bin/env ruby
require 'rubygems'
require 'RMagick'
# read in source file
im = Magick::Image.read('foo.jpg').first
# make two slightly different frames
fr1 = im.crop(0, 100, 300, 300, true)
fr2 = im.crop(0, 200, 300, 300, true)
# create an ImageList
list = Magick::ImageList.new
# add our images to it
list << fr1
list << fr2
# set some basic values
list.delay = 100
list.iterations = 0
# write out an animated gif to the filesystem
list.write("foo.gif")
This code works -- it reads in a jpg I have locally, and writes out a 2-frame animation. Obviously I've hardcoded some values here, but there's no reason this shouldn't work for you, although I am running ruby 1.9.2 and probably a different version of RMagick, but this is basic code.
The second issue is totally unrelated -- is it possible to upload an image generated in IM to S3 without actually hitting the filesystem? Basically, will this ever work:
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
I'm not sure if it is or not. I experimented with calling list.to_blob, but it only outputs the first frame, and it's as a JPG, although I didn't spend much time on it. You might be able to fool list.write into outputting somewhere, but rather than going down that road, I would personally just output the file unless that is impossible for some reason.

ruby reading files from S3 with open-URI

I'm having some problems reading a file from S3. I want to be able to load the ID3 tags remotely, but using open-URI doesn't work, it gives me the following error:
ruby-1.8.7-p302 > c=TagLib2::File.new(open(URI.parse("http://recordtemple.com.s3.amazonaws.com/music/745/original/The%20Stranger.mp3?1292096514")))
TypeError: can't convert Tempfile into String
from (irb):8:in `initialize'
from (irb):8:in `new'
from (irb):8
However, if i download the same file and put it on my desktop (ie no need for open-URI), it works just fine.
c=TagLib2::File.new("/Users/momofwombie/Desktop/blah.mp3")
is there something else I should be doing to read a remote file?
UPDATE: I just found this link, which may explain a little bit, but surely there must be some way to do this...
Read header data from files on remote server

Might want to check out AWS::S3, a Ruby Library for Amazon's Simple Storage Service
Do an AWS::S3:S3Object.find for the file and then an use about to retrieve the metadata
This solution assumes you have the AWS credentials and permission to access the S3 bucket that contains the files in question.

TagLib2::File.new doesn't take a file handle, which is what you are passing to it when you use open without a read.
Add on read and you'll get the contents of the URL, but TagLib2::File doesn't know what to do with that either, so you are forced to read the contents of the URL, and save it.
I also noticed you are unnecessarily complicating your use of OpenURI. You don't have to parse the URL using URI before passing it to open. Just pass the URL string.
require 'open-uri'
fname = File.basename($0) << '.' << $$.to_s
File.open(fname, 'wb') do |fo|
fo.print open("http://recordtemple.com.s3.amazonaws.com/music/745/original/The%20Stranger.mp3?1292096514").read
end
c = TagLib2::File.new(fname)
# do more processing...
File.delete(fname)
I don't have TagLib2 installed but I ran the rest of the code and the mp3 file downloaded to my disk and is playable. The File.delete would clean up afterwards, which should put you in the state you want to be in.

This solution isn't going to work much longer. Paperclip > 3.0.0 has removed to_file. I'm using S3 & Heroku. What I ended up doing was copying the file to a temporary location and parsing it from there. Here is my code:
dest = Tempfile.new(upload.spreadsheet_file_name)
dest.binmode
upload.spreadsheet.copy_to_local_file(:default_style, dest.path)
file_loc = dest.path
...
CSV.foreach(file_loc, :headers => true, :skip_blanks => true) do |row|}

This seems to work instead of open-URI:
Mp3Info.open(mp3.to_file.path) do |mp3info|
puts mp3info.tag.artist
end
Paperclip has a to_file method that downloads the file from S3.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart