I have a directory of <20MB pdf files (each pdf represents an ad) on an AWS EC2 large instance. I'm trying to upload each pdf file to S3 using ruby and DM-Paperclip.
Most files upload successfully but some seem to take hours with the CPU hanging at 100%. I've located the line of code that causes the issue by printing debug statements in the relevant section.
# Takes an array of pdf file paths and uploads each to S3 using dm-paperclip
def save_pdfs(pdfs_files)
pdf_files.each do |path|
pdf = File.open(path)
ad = Ad.new
ad.pdf.assign(pdf) # <= Last debug statment is printed before this line
begin
ad.save
rescue => e
# log error
ensure
pdf.close
end
end
To help troubleshoot the issue I attached strace to the process while it was stuck at 100%. The result was hundreds of thousands of lines like this:
...
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
... 500K lines
Followed by a few thousand:
...
brk(0x1224d0000) = 0x1224d0000
brk(0x1224f3000) = 0x1224f3000
brk(0x122514000) = 0x122514000
...
During an upload that doesn't hang, strace looks like this:
...
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
fstat(12, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
write(12, "%PDF-1.3\n%\342\343\317\323\n8 0 obj\n<</Filter"..., 4096) = 4096
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
write(12, "S\34\367\23~\277u\272,h\204_\35\215\35\341\347\324\310\307u\370#\364\315\t~^\352\272\26\374"..., 4096) = 4096
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
write(12, "\216%\267\2454`\350\177\4\36\315\211\7B\217g\33\217!e\347\207\256\264\245vy\377\304\256\307\375"..., 4096) = 4096
...
The pdf files that cause this issue seem random. They are all valid pdf files, and they are all relatively small. They vary between ~100KB to ~50MB.
Is the strace with the seemingly excessive stat system calls related to my issue?
It appears to be a problem of the original file permissions on failed files sent along with the file from origin computer. All pdf's(files Tobe saved into server) to be 0644 assigned in the script or if script uses original permissions at pickup to send from client.
Basically the server OS and configurations is rejecting it because the file permissions are not 0644 on write to disc.
Related
Starting point:
There is video called myVideo.mp4 in a folder (/1_original_videos) in a Bucket called myBucket in Google Cloud Storage.
myBucket
-->/1_original_video
-->myVideo.mp4
Goal:
The goal is to take this video, split it into chunks in a Cloud Function myCloudFunction and save the chunks in a subfolder called chunks in myBucket. The part of dividing into chunks is not a problem. The problem is reading the video.
myCloudFunction must be triggered with an HTTP trigger.
_______________
myVideo.mp4 ---->|myCloudFunction|----> chunk0.mp4, chunk1.mp4, chunk2.mp4, ... , chunkN-1.mp4
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
^
|
|
|
HTTP trigger
If the video were on my local computer, in order to read it, the following would be enough:
import cv2
cap = cv2.VideoCapture("/some/path/in/my/local/computer/myVideo.mp4")
Attempts:
Path with authenticated URL:
import cv2
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
When testing this approach, this is the resulting message (see complete code below):
"File Cannot be Opened"
Complete code:
import cv2
def video2chunks(request):
# Request:
REQUEST_JSON = request.get_json()
#If the HTTP contains a key called "start" (e.g. "{"start":"whatever"}"):
if REQUEST_JSON and 'start' in REQUEST_JSON:
try:
# Create VideoCapture object:
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
# If no VideoCapture object is created:
if not cap.isOpened():
message = "File Cannot be Opened"
# If a Videocapture object is created, compute some of the video parameters:
else:
fps = int(cap.get(cv2.CAP_PROP_FPS))
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
fourcc = int(cv2.VideoWriter_fourcc('X','V','I','D')) # XVID codecs
message = "Video downloaded successfully. Some params are: "
message += "FPS= " + str(fps) + " | size= " + str(size)
except Exception as e:
message = str(e)
else:
message = "You did not provide a key called start "
return message
I have been trying to find examples or a better way to do this in a Cloud Function but so far have been unsuccessful. Any alternatives would also be very much appreciated.
I'm not aware whether the cv2 library supports reading directly from Cloud Storage in some way. Nonetheless as Christoph points out you may download the file, process it and upload the results. The code will be essentially the same as running locally.
One thing to note is that Cloud Functions offer a temporal directory which is the way I chose to store the image. However it's important to know that any file stored there is actually consuming part of your function RAM, so the allocated function memory should be sized accordingly. Also you may notice the temp files are deleted before exiting the function, this is just a best practice in Cloud Functions.
import cv2
import os
from google.cloud import storage
def myfunc(request):
# Substitute the variables below for whatever suits your needs
# BUCKET_ID :: The bucket ID
# INPUT_IMAGE_GCS :: Path to GCS object
# OUTPUT_IMAGE_PATH :: Path to save the resulting image/video
# Read video and save to /tmp directory
bucket = storage.Client().bucket(BUCKET_ID)
blob = bucket.blob(INPUT_IMAGE_GCS)
blob.download_to_filename('/tmp/video.mp4')
# Video processing stuff
vidcap = cv2.VideoCapture('/tmp/video.mp4')
success, image = vidcap.read()
cv2.imwrite("/tmp/frame.jpg", image)
# Save results to GCS
img_blob = bucket.blob('potato/frame.jpg')
img_blob.upload_from_filename(OUTPUT_IMAGE_PATH)
# Delete tmp resources to free memory
os.remove('/tmp/video.mp4')
os.remove('/tmp/frame.jpg')
return '', 200
The problem as the user sees it:
User has long list of documents they need to download to iOS device (200+)
User starts the download, with each file downloaded in succession.
At the end of the download queue, they discover that one of the files fails (and its always one of two specific files that are 25MB+)
They retry the job (which only downloads the failed document) and it succeeds
What I'm seeing as a developer:
My app pulls down the document as a blob
When I inspect the blob (within my Typescript app code), it has a size > 0
I call this.file.writeFile(directoryPath, fileName, blob, {replace: true}), which calls the Ionic File wrapper around Cordova File Plugin
However, when I look at the blob in write of FileWriter.js, it has a size of zero
This all results in the error spitting out as:
{"type":"error","bubbles":false,"cancelBubble":false,"cancelable":false,"lengthComputable":false,"loaded":0,"total":0,"target":{"fileName":"","length":0,"localURL":"cdvfile://localhost/persistent/downloaded-assets/9ce34f8a-6201-4023-9f5b-de6133bd5699/{{redacted}}","position":0,"readyState":2,"result":null,"error":{},"onwritestart":null,"onprogress":null,"onwriteend":null,"onabort":null}}
What I'm getting from this is that somewhere between calling file.writeFile on the Ionic File wrapper in my typescript code and the FileWriter.write method in the Cordova package, my blob is getting corrupted, lost, or emptied somehow.
It's difficult to debug as that layer in-between these two points is minified in the xCode debugger, so it would also be nice to hear some suggestions about how I might be able to debug this myself better.
Do we have any idea what might be going on here? Is it a memory issue on iOS? Does Cordova timeout over multiple repeated requests?
A few things to note:
The full list of files download fine every time on I try within the iOS xCode simulator. This is leading me to believe it might be a memory issue, but I'm not sure.
The failure always happens after about 200 files, and on one of two files that are 25-30MB+
As far as debugging goes, the earliest I can see my blob reduced to 0 is here https://github.com/apache/cordova-plugin-file/blob/4a92bbbea755aa9e5bf8cfd160fb9da1cd3287cd/www/FileWriter.js#L107 (though I might be debugging incorrectly)
EDIT - After a little more digging, I was able to see exactly where the Ionic plugin went wrong:
The code I used:
private writeFileInChunks(writer: FileWriter, file: Blob) {
console.log('SIZE OF FILE AT START', file.size);
const BLOCK_SIZE = 1024 * 1024;
let writtenSize = 0;
function writeNextChunk() {
const size = Math.min(BLOCK_SIZE, file.size - writtenSize);
console.log('CALCULATED SIZE:', size);
const chunk = file.slice(writtenSize, writtenSize + size);
console.log('SIZE OF CHUNK TO WRITE', chunk.size)
writtenSize += size;
writer.write(chunk);
}
return getPromise<any>((resolve, reject) => {
writer.onerror = reject as (event: ProgressEvent) => void;
writer.onwrite = () => {
if (writtenSize < file.size) {
writeNextChunk();
} else {
resolve();
}
};
writeNextChunk();
});
}
The output for the failed document:
SIZE OF FILE AT START: 34012899
CALCULATED SIZE: 1048576
SIZE OF CHUNK TO WRITE: 0
On retry:
SIZE OF FILE AT START: 34012899
CALCULATED SIZE: 1048576
SIZE OF CHUNK TO WRITE: 1048576
CALCULATED SIZE: 1048576
SIZE OF CHUNK TO WRITE: 1048576
...
...
...
CALCULATED SIZE: 458467
SIZE OF CHUNK TO WRITE: 458467
So for whatever reason, after a large number of previous downloads, that file.slice step results in an empty/corrupted blob.
Any ideas on how to correct this?
Ran another test with some expanded logging:
private writeFileInChunks(writer: FileWriter, file: Blob) {
...
function writeNextChunk() {
const size = Math.min(BLOCK_SIZE, file.size - writtenSize);
console.log('CALCULATED SIZE:', size);
console.log('WRITTEN SIZE', writtenSize);
console.log('SUMS TO:', writtenSize + size)
console.log('FILE SIZE BEFORE SLICE:', file.size);
const chunk = file.slice(writtenSize, writtenSize + size);
console.log('SIZE OF CHUNK TO WRITE', chunk.size);
writtenSize += size;
writer.write(chunk);
}
...
...
}
Output came to:
CALCULATED SIZE: 1048576
WRITTEN SIZE: 0
SUMS TO: 1048576
FILE SIZE BEFORE SLICE: 34012899
SIZE OF CHUNK TO WRITE 0
Further confirming the issue
I'm trying to upload image png/jpg less than 10 kilobytes to s3 amazon.
The upload succeed but the file uploaded and stored with 0 bytes.
When i'm trying to see the image in the link provided by s3- i get blank.
If i upload image more bigger than 10 kilobytes size it's ok.
Can someone have any idea what is the problem please?
file_name = account[:img_token] + File.extname(img.original_filename)
file = Tempfile.new(file_name, encoding: 'ascii-8bit')
file.write(img.read)
path = file.path
bucket_name = 'bucket'
s3 = AWS::S3.new(access_key_id: ENV['S3_ACCESS_KEY'], secret_access_key: ENV['S3_SECRET_ACCESS_KEY'])
link = 'https://s3-eu-west-1.amazonaws.com/' + bucket_name + '/' + file_name
key = file_name
object = s3.buckets[bucket_name].objects[key].write(file: path, acl: 'public-read')
It looks like you are passing a file to the SDK that is seeked to the end of the file. I suspect the SDK is calling #read and getting nil back. Try rewinding the file first.
According to this answer
Write, read, and delete objects containing from 1 byte to 5 terabytes
of data each. The number of objects you can store is unlimited.
When I upload a file , and send this file to s3 ( 5 MB of csv file ) on heroku I got error H13 ( Connection Cloesed without response). I added a line ( environment.rb and application.rb ) lines:
if Rack::Utils.respond_to?("key_space_limit=")
Rack::Utils.key_space_limit = 262144 # 4 times the default size
end
but It did not help me.
My code is:
def import_from_csv(csv_file, user, organization, related_object, after_import= nil)
file_upload = FileUpload.upload_to_s3!(user, csv_file)
delay.process_csv(file_upload, related_object, organization, after_import)
end
I am creating a Rails app which is hosted on Heroku and that allows the user to generate animated GIFs on the fly based on an original JPG that's hosted somewhere in the web (think of it as a crop-resize app). I tried Paperclip but, AFAIK, it does not handle dynamically-generated files. I am using the aws-sdk gem and this is a code snippet of my controller:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
str1 = fr1.to_blob
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
str2 = fr2.to_blob
list = Magick::ImageList.new
list.from_blob(str1)
list.from_blob(str2)
list.delay = #animation.delay
list.iterations = 0
That is for the basic creation of a two-frame animation. RMagick can generate a GIF in my development computer with these lines:
list.write("#{Rails.public_path}/images/" + #animation.filename)
I tried uploading the list structure to S3:
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
But I don't have a size method in RMagick::ImageList that I can use to specify that. I tried "precompiling" the GIF into another RMagick::Image:
anim = Magick::Image.new(#animation.width, #animation.height)
anim.format = "GIF"
list.write(anim)
But Rails crashes with a segmentation fault:
/path/to/my_controller.rb:103: [BUG] Segmentation fault ruby 1.8.7 (2010-01-10 patchlevel 249) [universal-darwin11.0]
Abort trap: 6
Line 103 corresponds to list.write(anim).
So right now I have no idea how to do this and would appreciate any help I receive.
As per #mga's request in his answer to his original question...
a non-filesystem based approach is pretty simple
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3_bucket.objects['my_thumbnail.jpg'].write(a_thumbnail.to_blob, {:acl=>:public_read})
Voila! reading an uploaded file, manipulating it with RMagick, and writing it to s3 without ever touching the filesystem.
Since this project is hosted in Heroku I cannot use the filesystem so that is why I was trying to do everything via code. I found that Heroku does have a temporary-writable folder: http://devcenter.heroku.com/articles/read-only-filesystem
This works just fine in my case since I don't need the file after this request.
The resulting code:
im = Magick::Image.read(#animation.url).first
fr1 = im.crop(#animation.x1,#animation.y1,#animation.width,#animation.height,true)
fr2 = im.crop(#animation.x2,#animation.y2,#animation.width,#animation.height,true)
list = Magick::ImageList.new
list << fr1
list << fr2
list.delay = #animation.delay
list.iterations = 0
# gotta packet the file
list.write("#{Rails.root}/tmp/#{#animation.filename}.gif")
# upload to Amazon S3
s3 = AWS::S3.new
bucket = s3.buckets['mybucket']
obj = bucket.objects[#animation.filename]
obj.write(:file => "#{Rails.root}/tmp/#{#animation.filename}.gif")
It would be interesting to know if a non-filesystem-writing solution is possible.
I am updating this answer for AWS SDK Version 2 which should be:
rm_image = Magick::Image.from_blob(params[:image][:datafile].read)[0]
# [0] because from_blob returns an array
# the blob, presumably, can have multiple images data in it
a_thumbnail = rm_image.resize_to_fit(150, 150)
# just as an example of doing *something* with it before writing
s3 = Aws::S3::Resource.new
bucket = s3.bucket('mybucket')
obj = bucket.object('filename')
obj.put(body: background.to_blob)
I think there's a few things going on here. First, the documentation for RMagick is sub-par, and its easy to get side-tracked. The code you're using to generate the gif can be a little simpler. I cooked up a very contrived example here:
#!/usr/bin/env ruby
require 'rubygems'
require 'RMagick'
# read in source file
im = Magick::Image.read('foo.jpg').first
# make two slightly different frames
fr1 = im.crop(0, 100, 300, 300, true)
fr2 = im.crop(0, 200, 300, 300, true)
# create an ImageList
list = Magick::ImageList.new
# add our images to it
list << fr1
list << fr2
# set some basic values
list.delay = 100
list.iterations = 0
# write out an animated gif to the filesystem
list.write("foo.gif")
This code works -- it reads in a jpg I have locally, and writes out a 2-frame animation. Obviously I've hardcoded some values here, but there's no reason this shouldn't work for you, although I am running ruby 1.9.2 and probably a different version of RMagick, but this is basic code.
The second issue is totally unrelated -- is it possible to upload an image generated in IM to S3 without actually hitting the filesystem? Basically, will this ever work:
obj.write(:single_request => true, :content_type => 'image/gif', :data => list)
I'm not sure if it is or not. I experimented with calling list.to_blob, but it only outputs the first frame, and it's as a JPG, although I didn't spend much time on it. You might be able to fool list.write into outputting somewhere, but rather than going down that road, I would personally just output the file unless that is impossible for some reason.