Is Dataflow making use of Google Cloud Storage's gzip transcoding? - google-cloud-dataflow

I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.
According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly.
Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?
Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?

You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).
In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.

Related

Reading video during cloud dataflow, using GCSfuse, download locally, or write new Beam reader?

I am building a python cloud video pipeline that will read video from a bucket, perform some computer vision analysis and return frames back to a bucket. As far as I can tell, there is not a Beam read method to pass GCS paths to opencv, similar to TextIO.read(). My options moving forward seem to download the file locally (they are large), use GCS fuse to mount on a local worker (possible?) or write a custom source method. Anyone have experience on what makes most sense?
My main confusion was this question here
Can google cloud dataflow (apache beam) use ffmpeg to process video or image data
How would ffmpeg have access to the path? Its not just a question of uploading the binary? There needs to be a Beam method to pass the item, correct?
I think that you will need to download the files first and then pass them through.
However instead of saving the files locally, is it possible to pass bytes through to opencv. Does it accept any sort of ByteStream or input stream?
You could have one ParDo which downloads the files using the GCS API, then passes it to a opencv through a stream, ByteChannel stdin pipe, etc.
If that is not available, you will need to save the files to disk locally. Then pass opencv the filename. This could be tricky because you may end up using too much disk space. So make sure to garbage collect the files properly and delete the files from local disk after opencv processes them.
I'm not sure but you may need to also select a certain VM machine type to ensure you have enough disk space, depending on the size of your files.

Using EC2 to resize images stored on S3 on demand

We need to serve the same image in a number of possible sizes in our app. The library consists of 10's of thousands of images which will be stored on S3, so storing the same image in all it's possible sizes does not seem ideal. I have seen a few mentions on Google that EC2 could be used to resize S3 images on the fly, but I am struggling to find more information. Could anyone please point me in the direction of some more info or ideally, some code samples?
Tip
It was not obvious to us at first, but never serve images to an app or website directly from S3, it is highly recommended to use CloudFront. There are 3 reasons:
Cost - CloudFront is cheaper
Performance - CloudFront is faster
Reliability - S3 will occasionally not serve a resource when queried frequently i.e. more than 10-20 times a second. This took us ages to debug as resources would randomly not be available.
The above are not necessarily failings of S3 as it's meant to be a storage and not a content delivery service.
Why not store all image sizes, assuming you aren't talking about hundreds of different possible sizes? Storage cost is minimal. You would also then be able to serve your images up through Cloudfront (or directly from S3) such that you don't have to use your application server to resize images on the fly. If you serve a lot of these images, the amount of processing cost you save (i.e. CPU cycles, memory requirements, etc.) by not having to dynamically resize images and process image requests in your web server would likely easily offset the storage cost.
What you need is an image server. Yes, it can be hosted on EC2. These links should help starting off: https://github.com/adamdbradley/foresight.js/wiki/Server-Resizing-Images
http://en.wikipedia.org/wiki/Image_server

Batch processing of image

I have been assigned to write a windows application which will copy the images from source folder > and it's sub folder (there could be n number of sub folder which can have upto 50 Gb images). Each image size might vary from some kb to 20 MB. I need to resize and compress the picture.
I am clueless and wondering if that can be done without hitting the CPU to hard and on the other hand, little faster.
Is it possible ? Can you guide me the best way to implement this ?
Image processing is always a CPU intensive task. You could do little tricks like lowering the priority of the process that's preforming the image processes so it impacts your machine less, but there's very little tradeoff to be made.
As for how to do it,
Write a script that looks for all the files in the current directory and sub-directories. If you're not sure how, do a Google search. You could do this is Perl, Python, PHP, C#, or even a BAT file.
Call one of the 10,000,000 free or open-source programs to do image conversion. The most widely used Linux program is ImageMagick and there's a Windows version of it available too.

Cannot upload files bigger than 8GB to Amazon S3 by multi-part upload Java API due to broken pipe

I implemented S3 multi-part upload in Java, both high level and low level version, based on the sample code from
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?HLuploadFileJava.html and http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?llJavaUploadFile.html
When I uploaded files of size less than 4 GB, the upload processes completed without any problem. When I uploaded a file of size 13 GB, the code started to show IO exception, broken pipes. After multiple retries, it still failed.
Here is the way to repeat the scenario. Take 1.1.7.1 release,
create a new bucket in US standard region
create a large EC2 instance as the client to upload file
create a file of 13GB in size on the EC2 instance.
run the sample code on either one of the high-level or low-level API S3 documentation pages from the EC2 instance
test either one of the three part size: default part size (5 MB) or set the part size to 100,000,000 or 200,000,000 bytes.
So far the problem shows up consistently. I did a tcpdump. It appeared the HTTP server (S3 side) kept resetting the TCP stream, which caused the client side to throw IO exception - broken pipe, after the uploaded byte counts exceeding 8GB . Anyone has similar experiences when upload large files to S3 using multi-part upload?

How can I read sections of a large remote file (via tcpip?)

A client has a system which reads large files (up to 1 GB) of multiple video images. Access is via an indexing file which "points" into the larger file. This works well on a LAN. Does anyone have any suggestions as to how I can access these files through the internet if they are held on a remote server. The key constraint is that we cannot afford the time necessary to download the whole file before accessing individual images within it.
You could put your big file behind an HTTP server like Apache, then have your client side use HTTP Range headers to fetch the chunk it needs.
Another alternative would be to write a simple script in PHP, Perl or server-language-of-your-choice which takes the required offsets as input and returns the chunk of data you need, again over HTTP.
If I understand the question correctly, it depends entirely on the format chosen to contain the images as a video. If the container has been designed in such a way that the information about each image is accessible just before or just after the image, rather than at the end of the container, you could extract images from the video container and the meta-data of the images, to start working on what you have downloaded until now. You will have to have an idea of the binary format used.
FTP does let you use 'paged files' where sections of the file can be transferred independently
To transmit files that are discontinuous, FTP defines a page
structure. Files of this type are sometimes known as
"random access files" or even as "holey files".
In FTP, the sections of the file are called pages -- rfc959
I've never used it myself though.

Resources