is that possible to call dataflow using cloud run in GCP - google-cloud-dataflow

Is that possible to call dataflow using cloud run in GCP or any other alternative ways to run dataflow every 30 mins.
I would thankful if someone share reference material for implementation procedure.
Running dataflow locally on Cloud Run

You could do the following:
Given that you want to invoke dataflow every 30 mins or at a predefined/regular interval, consider using Cloud Scheduler. Cloud Scheduler is a fully-managed cron like service and it will allow you to say invoke a URL, every 30 mins or whatever is the frequency that you would want.
The URL that you are invoking can be a Google Cloud Function. The code inside your function will be the execution code that launches your Dataflow template.

You should consider using a Flex template and invoking the dataflow job using the REST API and Cloud Scheduler.
If you want to run a regular dataflow job, you could follow Romin's advice and use a Cloud Function to execute custom code to launch a dataflow job and invoke the function from the Cloud Scheduler. This is more complex IMHO. Using flex templates might just be easier.

Related

ECS Fargate - Is it possible to create container instances dynamically?

I am working on a project where it is required to create multiple instance of container dynamically based on the count received from the AWS Lambda function. Each container will execute its own task. I have done a lot of research but still not sure how to achieve this. Also how to delete the container instance when the task execution is completed?
You're describing the use case AWS Batch has been built for. It essentially allows you to submit tasks that are being processed in Docker Containers and manages the lifecycle of those containers for you. Since pre:invent 2020 it also supports Fargate.
An alternative would be using a Step Function that processes the output of the Lambda function and dynamically creates ECS tasks for that. Tasks without a service, so they just terminate when they're done processing. Depending on the amount of jobs you have I'd prefer AWS Batch.

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?
Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.
For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.
Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

Calling dataflow job automatically on a finite interval

Is it possible to automatically call a dataflow job in the gap of every 10 minutes. Any insights how to achieve this?
Yes. This is explained in the blog post Scheduling Dataflow pipelines using App Engine Cron Service or Cloud Functions
.

Does spring-cloud-dataflow provide support for scheduling applications defined as tasks?

I have been looking at using projects built using spring-cloud-task within spring-cloud-dataflow. Having looked at the example projects and the documentation, the indication seems to be that tasks are launched manually through the dashboard or the shell. Does spring-cloud-dataflow provide any way of scheduling task definitions so that they can run for example on a cron schedule? I.e. Can you create a spring-cloud-task app which itself has no knowledge of a schedule, but deploy it to the dataflow server and configure the scheduling there?
Among the posts and blogs I have looked at I noticed the following:
https://spring.io/blog/2016/01/27/introducing-spring-cloud-task
Some of the Q&A afterwards hints at this being a possibility, with the reference to triggers, but I think this was discussed before it was released.
Any advice would be greatly appreciated, many thanks.
There are few ways you could launch Tasks in Spring Cloud Data Flow. Following are the available options today.
Launch it using TriggerTask; with this you could either choose to launch it with fixedDelay or via a cron expression - example here.
Launch it via an event in streaming pipeline. Imagine a use-case where you would want to create a "thumbnail" as and when there's a new image (event) in s3-bucket or in a file-system directory; the "thumbnail" operation could be a task in this case - example here.
Lastly, in the upcoming releases, we will port over "scheduler" functionality from Spring XD to Spring Cloud Data Flow.
Yes, Spring Cloud Data Flow does provide a scheduling option. To enable it, you need to add below arguments while starting the server:
--spring.cloud.dataflow.features.schedules-enabled=true

Running periodic Dataflow job

I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.

Resources