How to load a mat file from a google storage bucket in jupyter notebook - machine-learning

I am trying to train a model on ~16gb of image data. I need to import an annotations.mat file from my Cloud Storage bucket. However, since loadmat requires a file path, I am not sure how to import a Google Storage bucket path. I tried to create a pickle file of the mat data, but Jupyter Notebook crashes.
Current attempt:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id')
blob = bucket.get_blob('path/to/annotations.pkl')
# crashes here
print(blob.download_as_string())
I want to do something like:
import scipy.io as sio
client = storage.Client()
bucket = client.get_bucket('bucket-id')
matfile = sio.loadmat(buket_path + 'path/to/annotations.pkl')
Does anyone know how to load a mat file from a Cloud Storage bucket?

I haven't found any direct import from a blob object to a mat file in python. However there is a workaround that would solve the problem: instead of importing directly the blob object and read it through loadmat, create a temporary file and use the path for loadmat function.
In order to reproduce the scenario, I followed the Google Cloud Storage python example (uploaded a mat file to a bucket). The following python code downloads the blob object, reads it using loadmat, and finally it removes the file created:
from google.cloud import storage
import scipy.io
bucket_name = '<BUCKET NAME>'
mat_file_path = '<PATH>/<MAT FILENAME>'
temp_mat_filename = 'temp.mat'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(mat_file_path)
# Download mat file to temporary mat file
blob.download_to_filename(temp_mat_filename)
# Get mat object from temporary mat file
mat = scipy.io.loadmat(temp_mat_filename)
# Remove temp_mat_filename file
# import os
# os.remove(temp_mat_filename)
Hope it helps :)

This code describe the uploading object to the bucket.
I add the url where you can find more info:
https://cloud.google.com/storage/docs/uploading-objects.

Related

How to process video with OpenCV2 in Google Cloud function?

Starting point:
There is video called myVideo.mp4 in a folder (/1_original_videos) in a Bucket called myBucket in Google Cloud Storage.
myBucket
-->/1_original_video
-->myVideo.mp4
Goal:
The goal is to take this video, split it into chunks in a Cloud Function myCloudFunction and save the chunks in a subfolder called chunks in myBucket. The part of dividing into chunks is not a problem. The problem is reading the video.
myCloudFunction must be triggered with an HTTP trigger.
_______________
myVideo.mp4 ---->|myCloudFunction|----> chunk0.mp4, chunk1.mp4, chunk2.mp4, ... , chunkN-1.mp4
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
^
|
|
|
HTTP trigger
If the video were on my local computer, in order to read it, the following would be enough:
import cv2
cap = cv2.VideoCapture("/some/path/in/my/local/computer/myVideo.mp4")
Attempts:
Path with authenticated URL:
import cv2
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
When testing this approach, this is the resulting message (see complete code below):
"File Cannot be Opened"
Complete code:
import cv2
def video2chunks(request):
# Request:
REQUEST_JSON = request.get_json()
#If the HTTP contains a key called "start" (e.g. "{"start":"whatever"}"):
if REQUEST_JSON and 'start' in REQUEST_JSON:
try:
# Create VideoCapture object:
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
# If no VideoCapture object is created:
if not cap.isOpened():
message = "File Cannot be Opened"
# If a Videocapture object is created, compute some of the video parameters:
else:
fps = int(cap.get(cv2.CAP_PROP_FPS))
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
fourcc = int(cv2.VideoWriter_fourcc('X','V','I','D')) # XVID codecs
message = "Video downloaded successfully. Some params are: "
message += "FPS= " + str(fps) + " | size= " + str(size)
except Exception as e:
message = str(e)
else:
message = "You did not provide a key called start "
return message
I have been trying to find examples or a better way to do this in a Cloud Function but so far have been unsuccessful. Any alternatives would also be very much appreciated.
I'm not aware whether the cv2 library supports reading directly from Cloud Storage in some way. Nonetheless as Christoph points out you may download the file, process it and upload the results. The code will be essentially the same as running locally.
One thing to note is that Cloud Functions offer a temporal directory which is the way I chose to store the image. However it's important to know that any file stored there is actually consuming part of your function RAM, so the allocated function memory should be sized accordingly. Also you may notice the temp files are deleted before exiting the function, this is just a best practice in Cloud Functions.
import cv2
import os
from google.cloud import storage
def myfunc(request):
# Substitute the variables below for whatever suits your needs
# BUCKET_ID :: The bucket ID
# INPUT_IMAGE_GCS :: Path to GCS object
# OUTPUT_IMAGE_PATH :: Path to save the resulting image/video
# Read video and save to /tmp directory
bucket = storage.Client().bucket(BUCKET_ID)
blob = bucket.blob(INPUT_IMAGE_GCS)
blob.download_to_filename('/tmp/video.mp4')
# Video processing stuff
vidcap = cv2.VideoCapture('/tmp/video.mp4')
success, image = vidcap.read()
cv2.imwrite("/tmp/frame.jpg", image)
# Save results to GCS
img_blob = bucket.blob('potato/frame.jpg')
img_blob.upload_from_filename(OUTPUT_IMAGE_PATH)
# Delete tmp resources to free memory
os.remove('/tmp/video.mp4')
os.remove('/tmp/frame.jpg')
return '', 200

How to upload a file to an s3 bucket with a custom resource in aws cdk

I need to upload a zip file to an s3 bucket after its creation. I'm aware of the s3_deployment package but it doesn't fit my usecase because I need the file to be uploaded only once, on stack creation. The s3_deployment package would upload the zip on every update.
I have the following custom resource defined however I'm not sure how to pass the body of the file to the custom resource. I've tried opening the file in binary mode but that returns an error.
app_data_bootstrap = AwsCustomResource(self, "BootstrapData",
on_create={
"service": "S3",
"action": "putObject",
"parameters": {
"Body": open('app_data.zip', 'rb'),
"Bucket": f"my-app-data",
"Key": "app_data.zip",
},
"physical_resource_id": PhysicalResourceId.of("BootstrapDataBucket")
},
policy=AwsCustomResourcePolicy.from_sdk_calls(resources=AwsCustomResourcePolicy.ANY_RESOURCE)
)
I don't believe that's possible unless you write a custom script and runs before your cdk deploy to upload your local files to an intermediary S3 bucket. Then you can write a custom resource that copies content of the intermediary bucket on on_create event to the bucket that was created via CDK.
Read this paragraph from s3_deployment in CDK docs:
This is what happens under the hood:
When this stack is deployed (either via cdk deploy or via CI/CD), the contents of the local website-dist directory will be archived and uploaded to an intermediary assets bucket. If there is more than one source, they will be individually uploaded.
The BucketDeployment construct synthesizes a custom CloudFormation resource of type Custom::CDKBucketDeployment into the template. The source bucket/key is set to point to the assets bucket.
The custom resource downloads the .zip archive, extracts it and issues aws s3 sync --delete against the destination bucket (in this case websiteBucket). If there is more than one source, the sources will be downloaded and merged pre-deployment at this step.
So in order for you do replicate step 1, you have to write a small script that creates an intermediate bucket and uploads your local files to it. A sample of that script can be like this:
#!/bin/sh
aws s3 mb <intermediary_bucket> --region <region_name>
aws s3 sync <intermediary_bucket> s3://<your_bucket_name>
Then your custom resource can be something like this:
*Note that this will work for copying one object, you can change the code to copy multiple objects.
import json
import boto3
import cfnresponse
def lambda_handler(event, context):
print('Received request:\n%s' % json.dumps(event, indent=4))
resource_properties = event['ResourceProperties']
if event['RequestType'] in ['Create']: #What happens when resource is created
try:
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'intermediary_bucket',
'Key': 'path/to/filename.extension'
}
bucket = s3.Bucket('otherbucket')
obj = bucket.Object('otherkey')
obj.copy(copy_source)
except:
cfnresponse.send(event, context, cfnresponse.FAILED, {})
raise
else:
cfnresponse.send(event, context, cfnresponse.SUCCESS,
{'FileContent': response['fileContent'].decode('utf-8')})
elif event['RequestType'] == 'Delete': # What happens when resource is deleted
cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
Alternative to all of this, is to open an issue in AWS CDK's Github repo and ask them to add your usecase.

Local file for Google Speech

I followed this page:
https://cloud.google.com/speech/docs/getting-started
and I could reach the end of it without problems.
In the example though, the file
'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
is processed.
What if I want to process a local file? In case this is not possible, how can I upload my .flac via command line?
Thanks
You're now able to process a local file by specifying a local path instead of the google storage one:
gcloud ml speech recognize '/Users/xxx/cloud-samples-tests/speech/brooklyn.flac' \ --language-code='en-US'
You can send this command by using the gcloud tool (https://cloud.google.com/speech-to-text/docs/quickstart-gcloud).
Solution found:
I created my own bucket (my_bucket_test), and I upload the file there via:
gsutil cp speech.flac gs://my_bucket_test
If you don't want to create a bucket (costs extra time and money) - you can stream the local files. The following code is copied directly from the Google cloud docs:
def transcribe_streaming(stream_file):
"""Streams transcription of the given audio file."""
import io
from google.cloud import speech
client = speech.SpeechClient()
with io.open(stream_file, "rb") as audio_file:
content = audio_file.read()
# In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (
speech.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(config=config)
# streaming_recognize returns a generator.
responses = client.streaming_recognize(
config=streaming_config,
requests=requests,
)
for response in responses:
# Once the transcription has settled, the first result will contain the
# is_final result. The other results will be for subsequent portions of
# the audio.
for result in response.results:
print("Finished: {}".format(result.is_final))
print("Stability: {}".format(result.stability))
alternatives = result.alternatives
# The alternatives are ordered from most likely to least.
for alternative in alternatives:
print("Confidence: {}".format(alternative.confidence))
print(u"Transcript: {}".format(alternative.transcript))
Here is the URL incase the package's function names are edited over time: https://cloud.google.com/speech-to-text/docs/streaming-recognize

retrieve carrierwave file uploaded to Amazon S3

Using Rails 3.2, carrier wave, and recently switched to store on Amazon S3. My setup and uploads are all working fine.
1. I have image_uploader.rb to upload and store images. Displaying them all works fine
2. I have file_uploader.rb to upload and store files. I've even taken it a step further to upload ZIP files and extract a version so that both the ZIP file and TXT files are stored in the correct place on S3.
My problem is I run a method on the TXT file. In the past, I used storage :file
With that I was able to:
Dir.chdir("public/uploads/")
import_file = Dir['*.TXT'].first
f = File.new(import_file)
Now, that I'm using storage :fog I can't get seem to retrieve/File.new/Open the file.
I see the file with the usual commands:
#upload1.team_file # stored file
#upload1.team_file.url # url
#upload1.team_file_url(:data_file).to_s # version created
I've been pouring through all kinds of very limited leads on retrieving and/or opening the file, but everything I try seems to return errors, such as:
Errno::ENOENT: No such file or directory - https://teamfiles.s3.amazonaws.com/data_files…
Thoughts on the difference here of retrieving and USING a file from AmazonS3? Thanks!
Pulling from multiple threads, APIs, etc. I'm answering my own question with what I've found. I welcome any corrections or improvements:
To retrieve carrierwave files uploaded to AmazonS3, you have to understand that open(#upload.file_url) or File.open(#upload.file_url) does NOT open the file, it only opens the PATH to the file. (ref: Ruby OpenURI )
I use: open_uri_url = open(#upload.file_url)
You then have to find the specific file in that path that you want. For me, I then find a ZIP file that was uploaded to AmazonS3 and Extract the specific file within the ZIP file that I want with a unique *.ABC extension:
zip_content_file = Zip::File.open(open_uri_url).map{|content| content if content.to_s.split('.').last == "ABC"}.compact.first
Now, from here, where to extract to?? I create a unique directory in the Rails tmp directory to extract the file to, use it and then delete the directory:
tmp_directory = "tmp/extracts/#{#upload.parent_id}/"
FileUtils.mkdir_p(tmp_directory) unless File.directory?(tmp_directory)
extract = zip_content_file.extract(tmp_directory + content_file.to_s)
Now with found from the AmazonS3 stored ZIP file and extracted, I can open, read, etc:
f = File.new(tmp_directory + extract.to_s)
I hope this helps with Carrierwave, AmazonS3, ZIP files and using them once uploaded.

Django Custom File Storage - write contents to file

I'm trying to write a custom django backend that writes the contents of an uploaded file to an output file while also saving the file as it normally would. I assumed I could do this by overriding the _open function of Django, but no luck. Anyone know how to accomplish this? Here's what I've been messing around with
from django.core.files.storage import FileSystemStorage
class TestStore(FileSystemStorage):
def _open(self, name, mode='rb'):
data = open(name, 'rb')
dataRead = data.read()
filename = '/home/somewhere/testdir/output.txt'
FILE = open(filename, 'w')
FILE.write(dataRead)
FILE.close()
data.close()
return name
If you already created the file you can just output it with content-disposition:
response = HttpResponse(data, content_type='text/txt')
response['Content-Disposition'] = 'attachment; filename=filename'

Resources