Spring clould data flow and spring clould task batch jobs - spring-cloud-dataflow

We have been using spring batch for below use cases
Read data from file, process and write to target database (batch
kicks off when file arrives)
Read data from remote database, process and write to target database (runs on scheduled interval, triggered
by Autosys)
With the plan to move all online apps to spring-boot microservices and PCF, we are looking at doing a similar excercise on the batch side if it adds value.
In the new world, the spring cloud batch job task will be reading the file from S3 storage (ECSS3).
I am looking at good design here (stay away from too many pipes/filters and orchestration if possible), the input data ranges from 1MM to 20MM records
ECSS3 will notify on file arrival by sending an http request, the
workflow would be - clould stram httpsource->launch clould batch job task that will read from object store, process and save records to target database
Spring Clould Job Task triggered from PCF scheduler to read from remote database, process and save to target database
With the above design, I don't see the value of wrapping the spring batch job into clould task and running in the PCF with spring data flow
Am I missing something here ? Is PCF/SpringClouldDataFlow an overkill in this case ?

Orchestrating batch-jobs in a cloud setting could bring new benefits to the solution. For instance, the resiliency model that PCF supports could be useful. Spring Cloud Task (SCT) are typically run in a short-lived container; if it goes down, PCF will bring it back up and run in it.
Both the options listed above are feasible and it comes down to the use-case wrt the frequency in which you're processing the incoming data. It is really real-time or it can happily run on a schedule is something you'd have to determine to make the decision.
As for the applicability of Spring Cloud Data Flow (SCDF) + PCF, again, it comes down to your business requirements. You may not be using it now, but Spring Batch Admin is EOL in favor of SCDF's Dashboard. The following questions might help realize the SCDF + SCT value proposition.
Do you have to monitor the overall batch-jobs' status, progress, and health? Maybe you've requirements to assemble multiple batch-jobs as a DAG? How about visually composing a series of Tasks and orchestrate it entirely from the Dashboard?
Also, when the batch-jobs are used together with SCT, SCDF, and PCF Scheduler, you'd get the benefit to monitoring all of this from the PCF Apps Manager.

Related

Spring Cloud Data Flow preconfigured tasks

Spring Cloud Data Flow is a great solution and currently I'm trying to find the possibility to preconfigure the tasks in order to trigger them manually.
The use-case very simple:
as a DevOps I should have ability to preconfigure the tasks, which includes creation of the execution graph and application and deploy parameters and save task with all parameters needed for execution.
as a User with role ROLE_DEPLOY I should have ability start, stop, restart and execution monitor the preconfigured tasks.
Preconfigured tasks is the task with all parameters needed for execution.
Is it possible to have such functionality?
Thank you.
You may want to review the continuous deployment section from the reference guide. There's built-in lifecycle semantics for orchestrating tasks in SCDF.
Assuming you have the Task applications already built and that the DevOps persona is familiar with how the applications work together, they can either interactively or programmatically build a composed-task definition and task/job parameters in SCDF.
The above step persists the task definition in the SCDF's database. Now, the same definition can be launched manually or via a cronjob schedule.
If nothing really changes to the definition, app version, or the task/parameters, yes, anyone with the ROLE_DEPLOY role can interact with it.
You may also find the CD for Tasks guide useful reference material. Perhaps repeat this locally (with desired security configurations) on your environment to get a hold of how it works.

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?
Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.
For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.
Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

How to scale the spring cloud data flow server

Normally we run the Jar of spring cloud data flow in one of the machine, but what if over the period we create many flows on the machine and the server gets overloaded and becomes a single point of failure, Do we have some thing where we can run the spring cloud data flow server jar on another machine and shift the flows on to that so that we can avoid any such failures and make our complete system more resilient and robust. or does the expansion happen automatically when we deploy our complete system on PCF/or cloud foundry.
SCDF is a simple Boot application. It doesn't retain any state about the stream/task applications itself, but it does keep track of the DSL definitions in the database.
It is common to provision multiple instances of SCDF-server and a load balancer in front for resiliency.
In PCF specifically, if you scale the SCDF-server to >1, PCF will automatically load-balance the incoming traffic (from SCDF Shell/GUI). It is also important to note that PCF will automatically restart the application instance, if it goes down for any reason. You will be set up for multiple levels of resiliency this way.

Can slave process be dynamically provisioned based on load using Spring Cloud data flow?

We are currently using Spring batch - remote chunking for scaling batch process . Thinking of using Cloud data flow but would like to know if based on load Slaves can be dynamically provisioned?
we are deployed in Google Cloud and hence want to think of using Spring Cloud data flow support for kubernetes as well if Cloud data flow would fit our needs ?
When using the batch extensions of Spring Cloud Task (specifically the DeployerPartitionHandler), workers are dynamically launched as needed. That PartitionHandler allows you to configure a maxiumum number of workers, then it will process each partition as an independent worker up to that max (processing the rest of the partitions as others finish up). The "dynamic" aspect is really controlled by the number of partitions returned by the Partitioner. The more partitions returned means the more workers launched.
You can see a simple example configured to use CloudFoundry in this repo: https://github.com/mminella/S3JDBC The main difference between it and what you'd need is that you'd swap out the CloudFoundryTaskLauncher for a KubernetesTaskLauncher and it's appropriate configuration.

Spring cloud data flow distributed processing

How does Spring Cloud Data Flow take care of distributed processing? If the server is deployed on PCF and say there are 2 instances, how will the input data be distributed between these 2 instances?
Also, how are failures handled when deployed on PCF? PCF will spawn a new instance for failed one. But will it also take care of deploying the stream or manual intervention is required there?
You should make the distinction between what the Spring Cloud Dataflow documentation calls "the server" and the apps that make up a managed stream.
"The server" is only here to receive deployment requests and honor them, spawning apps that make up your stream(s). If you deploy multiple instances of "the server", then there is nothing special about it. PCF will front it with a LB and either instance will handle your REST requests. When deploying on PCF, state is maintained in a bound service, so there is nothing special here.
If you're rather referring to "the apps", ie deploying a stream with some or all of its part using more than one instance, ie
stream create foo --definition "time | log"
stream deploy foo --properties "app.log.count=3"
then by default, it's up to the binder implementation to choose how to distribute data. This often means round robin balancing.
If you want to control how data pertaining to the same conceptual domain object ends up on the same app instance, you should tell Dataflow how to do so. Something like
stream deploy bar --properties "app.x.producer.partitionKeyExpression=<someDomainConcept>"
As for handling failures, I'm not sure what you're asking. The deployed apps are the stream. Once a request to have that many instances of the stream components has been sent and received by PCF, it will take care of honouring that request. It's out of the hands of Dataflow at that point, and this is exactly why the boundary for the Spring Cloud Deployer contract has been set there (same for other runtimes)/

Resources