Getting a datafile from AWS S3 Bucket and parse in Rails? - ruby-on-rails

I'm creating a Ruby script in Rails that will:
1) create an S3 object with AWS S3 SDK
2) iterate through bucket and download (get) each file
3) through iteration, store the file in memory and then convert to a string
4) parse the string for data and re-upload the file based on parsed data to an appropriate folder in same bucket
So the code I have so far in Rails jobs:
def aws_get
io = IO.new(1)
bucket_col = []
s3 = Aws::S3::Resource.new(region: 'us-east-1', access_key_id: Rails.application.credentials.dig(:aws, :access_key_id), secret_access_key: Rails.application.credentials.dig(:aws, :secret_access_key))
s3.bucket('missouridata').objects.each do |object|
obj = s3.bucket('missouridata').object(object.key)
file = obj.get(response_target: io)
???
end
end
The questions marks is where I don't know what to do next. How do I take the file stored in memory and convert it to a string to be parsed?

I have the perfect solution for you. I have beein using fog gem to manipulate S3 bucket for a while. It can do pretty much anything for you.
Here is the reference link.
https://www.ironin.it/blog/manipulating-files-on-amazon-s3-storage-with-rubys-fog-gem.html

Related

How to copy list of public S3 files to private S3 bucket

In rails, and with (say 5k files) using the aws-sdk gem, what is the easiest way to copy a list of public files that are hosted on S3 (not my account) into my private bucket? I would want to keep the same file and path name.
Example:
http://target.com.s3.amazonaws.com/assets/videos/abc123.mp4 (public)
http://myexample.com.s3.amazonaws.com/assets/videos/abc123.mp4 (private)
I would like read the files into memory and directly stream into S3. I won't have disk space with my hosting provider (Heroku). These files are MP4s and are about 3-4MB in size.
Here's my approach (UNTESTED):
vid_file = 'http://example.com.s3.amazonaws.com/assets/videos/abc123.mp4'
vid_response = HTTParty.get(vid_file)
if vid_response.code == 200
filename = File.basename(vid_file) # TOOD - fix to include s3 folder before object filename
s3 = Aws::S3::Resource.new(region: ENV['AWS_REGION'])
obj = s3.bucket(ENV['S3_BUCKET']).object(filename)
obj.put(body: vid_response.body)
end
However, is the a way with the SDK to direct AWS to perform an internal copy between the S3 bucket, albeit I don't have the keys for the first bucket (but the objects are public)? If NOT, is my above approach correct (streaming into memory, posting to S3)?
One easy solution if you know the file name pattern is to use something like wget and then a ruby s3 client to upload to your private bucket. I understand why you would want to use memory instead of hdd but honestly assuming you have a couple gigs free your internet connection is probably the bottleneck.
1) There's is no sdk feature for an 'internal copy' of public S3 objects to ones private S3 bucket.
2) the below source works, which keeps the same S3 directory structure
vid_file = 'http://example.com.s3.amazonaws.com/assets/videos/abc123.mp4'
vid_response = HTTParty.get(vid_file)
if vid_response.code == 200
uri_path = URI(vid_url).path
uri_path.slice!(0) # slice!(0) removes leading slash, otherwise creates an empty s3 folder
s3 = Aws::S3::Resource.new(region: ENV['AWS_REGION'])
obj = s3.bucket(ENV['S3_BUCKET']).object(uri_path)
obj.put(body: vid_response.body) if !obj.exists?
end

Ruby aws-sdk - ".exists?" says the file doesn't exist even though I see it in the bucket

I stuck all afternoon on checking whether an uploaded file to AWS S3 exists or not. I use Ruby On Rails and the gem called aws-sdk, v2.
First of all - the file exists in the bucket, it is located here:
test_bucket/users/10/file_test.pdf
There's no typo, this is the exact path. Also, the bucket + credentials are set up correctly.
And here's how I try to check the existence of the file:
config = {region: 'us-west-1', bucket: AWS_S3_CONFIG['bucket'], key: AWS_S3_CONFIG['access_key_id'], secret: AWS_S3_CONFIG['secret_access_key']}
Aws.config.update({region: config[:region],
credentials: Aws::Credentials.new(config[:key], config[:secret]),
:s3 => { :region => 'us-east-1' }})
bucket = Aws::S3::Resource.new.bucket(config[:bucket])
puts bucket.object("file_test.pdf").exists?
The output is always false.
I also tried puts bucket.object("test_bucket/users/10/file_test.pdf").exists?, but still false.
Also, I tried to make the file public in the AWS S3 dashboard, but no success, still false. The file is visible when click on the generated link.
But the problem is that when I check with using aws-sdk if the file exist, the output is still false.
What am I doing wrong?
Thank you.
You need to pass the full path to the object (not including the bucket name) - users/10/file_test.pdf

Rails 4, Fog, Amazon s3 - retrieving all the images as an array from a specific folder in a bucket.

I am using amazon s3, rails 4, and the FOG gem. I have an amazon bucket called uipstudy with 100 folders, each containing about 20 images. I use the following to get all the images in a specific folder (In my application_helper.rb which is included in the application_controller.rb).
def get_files(image_folder)
connection = Fog::Storage.new(
provider: 'AWS',
aws_access_key_id: '######',
aws_secret_access_key: '#######'
)
connection.directories.get('uipimages', prefix:image_folder).files.map do |file|
file.key
end
end
In my controller I have this....in this example I am looking in the folder "1" in the uipstudy bucket.
#Amazon solution:
#images = get_files('1')
#images.each do |image|
image = "https://s3.amazonaws.com/uipstudy/#{image}"
#image_array << image
end
The problem is that its returning the files inside the folder labelled "1" but also in 10, 11, 12,13....etc. I assumed that the prefix was an absolute but it appears not. Is there a way to enforce that the prefix gets exactly the folder specified in the prefix?
I think you should be able to make a small change in your script to get the behavior you want. Simply append a forward slash to the prefix so that it clearly shows you want things that are like a directory instead of any/all things that begin with a particular character.
So, that would get you something like:
directory = connection.directories.get('upimages', prefix: image_folder + '/')
directory.files.map do |file|
file.key
end
(I just split it into two commands to make it format/read easier)
Below is my solution using the aws-sdk gem.
initialize s3 client
s3 = AWS::S3.new
bucket = s3.buckets[ENV['AWS_BUCKET']]
regex for ipa files in _inbox folder
regex = %r{_inbox/(?:[^/]+/)*[^/]+\.ipa}i
get and process ipa files
bucket.objects.select { |o| o.key.match(regex) }.each do |ipa|

Dealing with null bytes when creating ec2 with user_data using fog

I am trying to provision an ec2 instance using fog, here is the code that I am using:
compute = Fog::Compute.new provider: 'AWS',
region: 'us-east-1', aws_access_key_id: ACCESS_KEY,
aws_secret_access_key: SECRET_ACCESS_KEY
options = {
image_id: 'ami-xxxxxx',
flavor_id: 'm1.small',
#custom security group created in AWS Account with open ports
groups: ['myGroup'],
private_key_path: '~/.ssh/id_rsa',
public_key_path: '~/.ssh/id_rsa.pub',
username: 'ec2-user',
user_data: File.read(Rails.root.join('public', 'somefile.zip'))
}
compute.servers.bootstrap options
When, I run this. I get following error:
Fog::JSON::EncodeError: string contains null byte
from /home/gaurish/.rvm/gems/ruby-2.0.0-p247/gems/multi_json-1.8.2/lib/multi_json/adapters/oj.rb:20:in `dump'
As you may notice, above. I am supplying a ZIP file for user_data option. And this is what I think the problem occurs. My guess is that the zip file or encoding it to base64 somehow adds a null byte("\0") due to which Oj can't encode it to JSON format.
Now,
Can anyone verify if its a bug in fog or am I doing anything wrong?
Any workarounds to avoid null bytes?
Versions used:
Fog 1.19
multi_json-1.8.2
oj-2.2.3
I have solved this issue. here is how:
file = File.open(path, 'rb') #path => path to zip file
contents = file.read
file.close
user_data = Base64.encode64 contents
now, this user_data can be safely passed into options[:user_data] hash without null byte errors. this issue is being tracked here:
https://github.com/fog/fog/issues/2506

How to copy file across buckets using aws-s3 or aws-sdk gem in ruby on rails

The aws-s3 documentation says:
# Copying an object
S3Object.copy 'headshot.jpg', 'headshot2.jpg', 'photos'
But how do I copy heashot.jpg from the photos bucket to the archive bucket for example
Thanks!
Deb
AWS-SDK gem. S3Object#copy_to
Copies data from the current object to another object in S3.
S3 handles the copy so the client does not need to fetch the
data and upload it again. You can also change the storage
class and metadata of the object when copying.
It uses copy_object method internal, so the copy functionality allows you to copy objects within or between your S3 buckets, and optionally to replace the metadata associated with the object in the process.
Standard method (download/upload)
Copy method
Code sample:
require 'aws-sdk'
AWS.config(
:access_key_id => '***',
:secret_access_key => '***',
:max_retries => 10
)
file = 'test_file.rb'
bucket_0 = {:name => 'bucket_from', :endpoint => 's3-eu-west-1.amazonaws.com'}
bucket_1 = {:name => 'bucket_to', :endpoint => 's3.amazonaws.com'}
s3_interface_from = AWS::S3.new(:s3_endpoint => bucket_0[:endpoint])
bucket_from = s3_interface_from.buckets[bucket_0[:name]]
bucket_from.objects[file].write(open(file))
s3_interface_to = AWS::S3.new(:s3_endpoint => bucket_1[:endpoint])
bucket_to = s3_interface_to.buckets[bucket_1[:name]]
bucket_to.objects[file].copy_from(file, {:bucket => bucket_from})
Using the right_aws gem:
# With s3 being an S3 object acquired via S3Interface.new
# Copies key1 from bucket b1 to key1_copy in bucket b2:
s3.copy('b1', 'key1', 'b2', 'key1_copy')
the gotcha I ran into is that if you have pics/1234/yourfile.jpg the bucket is only pics and the key is 1234/yourfile.jpg
I got the answer from here: How do I copy files between buckets using s3 from a rails application?
For anyone still looking, AWS has documentation for this. It's actually very simple with the aws-sdk gem:
bucket = Aws::S3::Bucket.new('source-bucket')
object = bucket.object('source-key')
object.copy_to(bucket: 'target-bucket', key: 'target-key')
When using the AWS SDK gem's copy_from or copy_to there are three things that aren't copied by default: ACL, storage class, or server side encryption. You need to specify them as options.
from_object.copy_to from_object.key, {:bucket => 'new-bucket-name', :acl => :public_read}
https://github.com/aws/aws-sdk-ruby/blob/master/lib/aws/s3/s3_object.rb#L904
Here's a simple ruby class to copy all objects from one bucket to another bucket: https://gist.github.com/edwardsharp/d501af263728eceb361ebba80d7fe324
Multiple images could easily be copied using aws-sdk gem as follows:
require 'aws-sdk'
image_names = ['one.jpg', 'two.jpg', 'three.jpg', 'four.jpg', 'five.png', 'six.jpg']
Aws.config.update({
region: "destination_region",
credentials: Aws::Credentials.new('AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY')
})
image_names.each do |img|
s3 = Aws::S3::Client.new()
resp = s3.copy_object({
bucket: "destinationation_bucket_name",
copy_source: URI.encode_www_form_component("/source_bucket_name/path/to/#{img}"),
key: "path/where/to/save/#{img}"
})
end
If you have too many images, it is suggested to put the copying process in a background job.
I believe that in order to copy between buckets you must read the file's contents from the source bucket and then write it back to the destination bucket via your application's memory space. There's a snippet showing this using aws-s3 here and another approach using right_aws here
The aws-s3 gem does not have the ability to copy files in between buckets without moving files to your local machine. If that's acceptable to you, then the following will work:
AWS::S3::S3Object.store 'dest-key', open('http://url/to/source.file'), 'dest-bucket'
I ran into the same issue that you had, so I cloned the source code for AWS-S3 and made a branch that has a copy_to method that allows for copying between buckets, which I've been bundling into my projects and using when I need that functionality. Hopefully someone else will find this useful as well.
View the branch on GitHub.

Resources