Which apache-beam feature to use to just read a function as first in the pipeline and take the output - google-cloud-dataflow

I'm trying to create a dataflow pipeline and deploy it in Cloud environment. I have a code like below, which is trying to read files from a specific folder within a GCS bucket :
def read_from_bucket():
file_paths=[]
client = storage.Client()
bucket = client.bucket("bucket_name")
blobs_specific = list(bucket.list_blobs(prefix="test_folder/"))
for blob in blobs_specific:
file_paths.append(blob)
return file_paths
The above function returns a list of file paths present in that GCS folder.
Now this list is sent to the below code, which will filter the files as per their extension and store it in respective folders in the GCS bucket.
def write_to_bucket():
client = storage.Client()
for blob in client.list_blobs("dataplus-temp"):
source_bucket = client.bucket("source_bucket")
source_blob = source_bucket.blob(blob.name)
file_extension = pathlib.Path(blob.name).suffix
if file_extension == ".json":
destination_bucket=client.bucket("destination_bucket")
new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'source_bucket/JSON/{}'.format(source_blob))
elif file_extension == ".txt":
destination_bucket=client.bucket("destination_bucket")
new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'Text/{}'.format(source_blob))
I have to execute the above implementation using dataflow pipeline , in such a way that the file path has go to dataflow pipeline and it should get stored in the respective folders. I created a dataflow pipeline like below, but not sure whether I used right Ptrasformations.
pipe= beam.Pipeline()(
pipe
|"Read data from bucket" >> beam.ParDo(read_from_bucket)
|"Write files to folders" >> beam.ParDo(write_to_bucket)
)
pipe.run()
executed like :
python .\\filename.py
--region asia-east1
--runner DataflowRunner
--project proejct_name
--temp_location gs://bucket_name/tmp
--job_name test
I'm getting below error after execution:
return inputs[0].windowing
AttributeError: 'PBegin' object has no attribute 'windowing'
I have checked apache-beam documentation, but somehow unable to understand. I'm new to apache-beam, just started as a beginner, please consider if this question is silly.
Kindly help me in how to solve this.

To solve your issue you have to use existing Beam IO to read and write to Cloud Storage, example :
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
(
p
| "Read data from bucket" >> ReadFromText("gs://your_input_bucket/your_folder/*")
| "Write files to folders" >> WriteToText("gs://your_dest_bucket")
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()

Related

Apache Beam dataflow on GCS: ReadFromText from multiple path

I was wondering if it was possible to use the ReadFromText PTransform passing it multiple path.
My PTranform expand method is:
def expand(self, pcoll):
dataset = (
pcoll
| "Read Dataset from text file" >> beam.io.ReadFromText(self._source)
And source right now is a string with a path with a blob pattern
self._source="gs://bucket1/folder/*
From the documentation it says:
Args:
file_pattern (str): The file path to read from as a local file path or a
GCS ``gs://`` path. The path can contain glob characters
(``*``, ``?``, and ``[...]`` sets).
But even if it works greatly if I use gs://folder/*.gz (I have multiple files under a path) I can't seem to make it work if I have different path (or, in my case, in different buckets).
I tried with the command ls with something like:
gsutils ls gs://{bucket1/folder,bucket2/folder}/*
But if I try it with the beam pipeline it doesn't work and gives me
ERROR: (gcloud.dataflow.flex-template.run) unrecognized arguments:
Is there a way to make it work ?
As you explained in your comment, you can solve it with a for loop on the Beam Pipeline, example :
bucket_paths = [
"gs://bucket/folder/file*.txt",
"gs://bucket2/folder/file*.txt"
]
with beam.Pipeline(options=PipelineOptions()) as p:
for i, bucket_path in enumerate(bucket_paths):
(p
| f"Read Dataset from text file {i}" >> beam.io.ReadFromText(bucket_path)
....
)

Beam/Dataflow read Parquet files and add file name/path to each record

I'm using Apache Beam Python SDK and I'm trying to read data from Parquet files using apache_beam.io.parquetio, but I also want to add the filename (or path) to the data since it contains data as well. I went over the suggested pattern here and read that Parquetio is similar to fileio but it doesn't seem like it implements the functionality that allows to go over files and add that to the party.
Anyone figured a good way to implement this?
Thanks!
If the number of files is not tremendous, you can get all the files before you read them through the IO.
import glob
filelist = glob.glob('/tmp/*.parquet')
p = beam.Pipeline()
class PairWithFile(beam.DoFn):
def __init__(self, filename):
self._filename = filename
def process(self, e):
yield (self._filename, e)
file_with_records = [
(p
| 'Read %s' % (file) >> beam.io.ReadFromParquet(file)
| 'Pair %s' % (file) >> beam.ParDo(PairWithFile(file)))
for file in filelist
] | beam.Flatten()
Then your PCollection looks like this:

How to process video with OpenCV2 in Google Cloud function?

Starting point:
There is video called myVideo.mp4 in a folder (/1_original_videos) in a Bucket called myBucket in Google Cloud Storage.
myBucket
-->/1_original_video
-->myVideo.mp4
Goal:
The goal is to take this video, split it into chunks in a Cloud Function myCloudFunction and save the chunks in a subfolder called chunks in myBucket. The part of dividing into chunks is not a problem. The problem is reading the video.
myCloudFunction must be triggered with an HTTP trigger.
_______________
myVideo.mp4 ---->|myCloudFunction|----> chunk0.mp4, chunk1.mp4, chunk2.mp4, ... , chunkN-1.mp4
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
^
|
|
|
HTTP trigger
If the video were on my local computer, in order to read it, the following would be enough:
import cv2
cap = cv2.VideoCapture("/some/path/in/my/local/computer/myVideo.mp4")
Attempts:
Path with authenticated URL:
import cv2
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
When testing this approach, this is the resulting message (see complete code below):
"File Cannot be Opened"
Complete code:
import cv2
def video2chunks(request):
# Request:
REQUEST_JSON = request.get_json()
#If the HTTP contains a key called "start" (e.g. "{"start":"whatever"}"):
if REQUEST_JSON and 'start' in REQUEST_JSON:
try:
# Create VideoCapture object:
cap = cv2.VideoCapture("https://storage.cloud.google.com/myBucket/1_original_videos/myVideo.mp4")
# If no VideoCapture object is created:
if not cap.isOpened():
message = "File Cannot be Opened"
# If a Videocapture object is created, compute some of the video parameters:
else:
fps = int(cap.get(cv2.CAP_PROP_FPS))
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
fourcc = int(cv2.VideoWriter_fourcc('X','V','I','D')) # XVID codecs
message = "Video downloaded successfully. Some params are: "
message += "FPS= " + str(fps) + " | size= " + str(size)
except Exception as e:
message = str(e)
else:
message = "You did not provide a key called start "
return message
I have been trying to find examples or a better way to do this in a Cloud Function but so far have been unsuccessful. Any alternatives would also be very much appreciated.
I'm not aware whether the cv2 library supports reading directly from Cloud Storage in some way. Nonetheless as Christoph points out you may download the file, process it and upload the results. The code will be essentially the same as running locally.
One thing to note is that Cloud Functions offer a temporal directory which is the way I chose to store the image. However it's important to know that any file stored there is actually consuming part of your function RAM, so the allocated function memory should be sized accordingly. Also you may notice the temp files are deleted before exiting the function, this is just a best practice in Cloud Functions.
import cv2
import os
from google.cloud import storage
def myfunc(request):
# Substitute the variables below for whatever suits your needs
# BUCKET_ID :: The bucket ID
# INPUT_IMAGE_GCS :: Path to GCS object
# OUTPUT_IMAGE_PATH :: Path to save the resulting image/video
# Read video and save to /tmp directory
bucket = storage.Client().bucket(BUCKET_ID)
blob = bucket.blob(INPUT_IMAGE_GCS)
blob.download_to_filename('/tmp/video.mp4')
# Video processing stuff
vidcap = cv2.VideoCapture('/tmp/video.mp4')
success, image = vidcap.read()
cv2.imwrite("/tmp/frame.jpg", image)
# Save results to GCS
img_blob = bucket.blob('potato/frame.jpg')
img_blob.upload_from_filename(OUTPUT_IMAGE_PATH)
# Delete tmp resources to free memory
os.remove('/tmp/video.mp4')
os.remove('/tmp/frame.jpg')
return '', 200

How to upload a file to an s3 bucket with a custom resource in aws cdk

I need to upload a zip file to an s3 bucket after its creation. I'm aware of the s3_deployment package but it doesn't fit my usecase because I need the file to be uploaded only once, on stack creation. The s3_deployment package would upload the zip on every update.
I have the following custom resource defined however I'm not sure how to pass the body of the file to the custom resource. I've tried opening the file in binary mode but that returns an error.
app_data_bootstrap = AwsCustomResource(self, "BootstrapData",
on_create={
"service": "S3",
"action": "putObject",
"parameters": {
"Body": open('app_data.zip', 'rb'),
"Bucket": f"my-app-data",
"Key": "app_data.zip",
},
"physical_resource_id": PhysicalResourceId.of("BootstrapDataBucket")
},
policy=AwsCustomResourcePolicy.from_sdk_calls(resources=AwsCustomResourcePolicy.ANY_RESOURCE)
)
I don't believe that's possible unless you write a custom script and runs before your cdk deploy to upload your local files to an intermediary S3 bucket. Then you can write a custom resource that copies content of the intermediary bucket on on_create event to the bucket that was created via CDK.
Read this paragraph from s3_deployment in CDK docs:
This is what happens under the hood:
When this stack is deployed (either via cdk deploy or via CI/CD), the contents of the local website-dist directory will be archived and uploaded to an intermediary assets bucket. If there is more than one source, they will be individually uploaded.
The BucketDeployment construct synthesizes a custom CloudFormation resource of type Custom::CDKBucketDeployment into the template. The source bucket/key is set to point to the assets bucket.
The custom resource downloads the .zip archive, extracts it and issues aws s3 sync --delete against the destination bucket (in this case websiteBucket). If there is more than one source, the sources will be downloaded and merged pre-deployment at this step.
So in order for you do replicate step 1, you have to write a small script that creates an intermediate bucket and uploads your local files to it. A sample of that script can be like this:
#!/bin/sh
aws s3 mb <intermediary_bucket> --region <region_name>
aws s3 sync <intermediary_bucket> s3://<your_bucket_name>
Then your custom resource can be something like this:
*Note that this will work for copying one object, you can change the code to copy multiple objects.
import json
import boto3
import cfnresponse
def lambda_handler(event, context):
print('Received request:\n%s' % json.dumps(event, indent=4))
resource_properties = event['ResourceProperties']
if event['RequestType'] in ['Create']: #What happens when resource is created
try:
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'intermediary_bucket',
'Key': 'path/to/filename.extension'
}
bucket = s3.Bucket('otherbucket')
obj = bucket.Object('otherkey')
obj.copy(copy_source)
except:
cfnresponse.send(event, context, cfnresponse.FAILED, {})
raise
else:
cfnresponse.send(event, context, cfnresponse.SUCCESS,
{'FileContent': response['fileContent'].decode('utf-8')})
elif event['RequestType'] == 'Delete': # What happens when resource is deleted
cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
Alternative to all of this, is to open an issue in AWS CDK's Github repo and ask them to add your usecase.

Local file for Google Speech

I followed this page:
https://cloud.google.com/speech/docs/getting-started
and I could reach the end of it without problems.
In the example though, the file
'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
is processed.
What if I want to process a local file? In case this is not possible, how can I upload my .flac via command line?
Thanks
You're now able to process a local file by specifying a local path instead of the google storage one:
gcloud ml speech recognize '/Users/xxx/cloud-samples-tests/speech/brooklyn.flac' \ --language-code='en-US'
You can send this command by using the gcloud tool (https://cloud.google.com/speech-to-text/docs/quickstart-gcloud).
Solution found:
I created my own bucket (my_bucket_test), and I upload the file there via:
gsutil cp speech.flac gs://my_bucket_test
If you don't want to create a bucket (costs extra time and money) - you can stream the local files. The following code is copied directly from the Google cloud docs:
def transcribe_streaming(stream_file):
"""Streams transcription of the given audio file."""
import io
from google.cloud import speech
client = speech.SpeechClient()
with io.open(stream_file, "rb") as audio_file:
content = audio_file.read()
# In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (
speech.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(config=config)
# streaming_recognize returns a generator.
responses = client.streaming_recognize(
config=streaming_config,
requests=requests,
)
for response in responses:
# Once the transcription has settled, the first result will contain the
# is_final result. The other results will be for subsequent portions of
# the audio.
for result in response.results:
print("Finished: {}".format(result.is_final))
print("Stability: {}".format(result.stability))
alternatives = result.alternatives
# The alternatives are ordered from most likely to least.
for alternative in alternatives:
print("Confidence: {}".format(alternative.confidence))
print(u"Transcript: {}".format(alternative.transcript))
Here is the URL incase the package's function names are edited over time: https://cloud.google.com/speech-to-text/docs/streaming-recognize

Resources