Orchestration/notification of processing events - spring-cloud-dataflow

I have the following SCDF use case.
I have a couple hundred files to process and put in the db
A producer will get a single file, reads the first N number of rows and send it to source (rabbit mq) , then reads the next N number of rows and sends it to source again, etc, until done.
A consumer will receive these file chunks (from rabbit mq), do some minor enriching, and write it to the DB (sink)
I will have some number of streams > 1 running (say 4 for example) for some parallel processing of these files
My question is: Does SCDF have a mechanism to know when all consumers are completed (and hence the queue(s) are exhausted) so I can know when to start some other process (could be another stream/task/anything) that needs the db fully populated to begin

Yes sink1 is the only consumer of source1. In a streaming application, there is no concept of “COMPLETED”. By definition, stream processing is logically unbounded and stream apps (sources and sinks) are designed to run forever. Tasks, on the other hand, are short lived, finite processes that exit when they are complete. The application logic defines when the task is complete. Processing a file, or a chunk of a file is the most common use case. A stream can monitor a file system, or remote file source such as sftp, or s3, and launch a task whenever a new file appears. The task processes the file and marks the execution as COMPLETE.

This type of use case is better suited for task/batch. See https://dataflow.spring.io/docs/recipes/batch/sftp-to-jdbc/ which details the recommended architecture. You can use define a composed task to run the ingest and then the next task.

Related

How does dataflow manage current processes during upscaling streaming job?

When dataflow streaming job with autoscaling enabled is deployed, it uses single worker.
Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ.
Let's also assume that PubSub queue is already a bit big.
So pipeline get started and loads some pubsubs processing them on single worker.
After couple of minutes it gets realized that some extra workers are needed and creates them.
Many pubsub messages are already loaded and are being processed but not acked yet.
And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers.
Is this expected behavior?
Another question is - what next? First wins? Or new wins?
I mean, we have the same pubsub message that is still being processed on first worker and on the new one.
What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1

Dataflow control high fanout between steps

I have 3 dataflow steps in a Dataflow pipeline.
Reads from pubsub , saves in a table and splits into multiple events(puts into context output).
For each split, queries db and decorates the event with additional data.
Publishes to another pubsub topic for further procession.
PROBLEM:
After step 1, its splitting into 10K to 20K events.
Now in step 2 its running out of database connections. (I have a static hikari connection pool).
It works absolutely fine will less data. I am using a n1-standard-32 machine.
What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.
I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).
Ideas include:
Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.
One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).

Does Google Cloud Dataflow optimize around IO bound processes?

I have a Python beam.DoFn which is uploading a file to the internet. This process uses 100% of one core for ~5 seconds and then proceeds to upload a file for 2-3 minutes (and uses a very small fraction of the cpu during the upload).
Is DataFlow smart enough to optimize around this by spinning up multiple DoFns in separate threads/processes?
Yes Dataflow will run up multiple instances of a DoFn using python multiprocessing.
However, keep in mind that if you use a GroupByKey, then the ParDo will process elements for a particular key serially. Though you still achieve parallelism on the worker since you are processing multiple keys at once. However, if all of your data is on a single "hot key" you may not achieve good parallelism.
Are you using TextIO.Write in a batch pipeline? I believe that the files are prepared locally and then uploaded after your main DoFn is processed. That is the file is not uploaded until the PCollection is complete and will not receive more elements.
I don't think it streams out the files as you are producing elements.

scheduled task or windows service

My team is having a debate which is better: a windows service or scheduled tasks. We have a server dedicated to running jobs and currently they are all scheduled tasks. Some jobs take files, rename them and place them in other directories on the network. Other jobs extract data from SQL, modify it, and ship it elsewhere. Other jobs ftp files out. There is a lot of variety, but all in all, they are fairly straightforward.
I am partial to having each of these run as a windows service instead of a scheduled task because it is so much easier to monitor a windows service than a scheduled task. Some are diametrically opposed. In the end, none of us have that much experience to provide actual factual comparisons between the two methods. I am looking for some feedback on what other have experienced.
If it runs constantly - windows service.
If it needs to be run at various intervals - scheduled task.
Scheduled Task - When activity to be carried out on some fixed/predefined schedule. It take less memory and resources of OS. Not required installation. It can have UI (eg. Send reminder mail to defaulters)
Windows Service - When a continue monitoring is required. It makes OS busy by consuming more. Require install/uninstallation while changing version. No UI at all (eg. Process a mail as soon as it arrives)
Use them wisely
Sceduling jobs with the build in functionality is a perfectly valid use. You would have to recreate the full functionality in order to create a good service, and unless you want to react to speciffic events, I see no reason to move a nightly job into a service.
Its different when you want to process a file after it was posted in a folder, thats something I would create a service for, thats using the filesystem watcher to monitor a folder.
I think its reinventing the wheel
While there is nothing wrong with using the Task Scheduler, it is itself, a service. But we have the same requirements where I work and we have general purpose program that does several of these jobs. I interpreted your post to say that you would run individual services for each task, I would consider writing a single, database driven (service) program to do all your tasks, and that way, when you add a new one, it is simply a data entry chore, and not a whole new progam to write. If you practice change control, this difference is can be significant. If you have more than a few tasks the effort may be comperable. This approach will also allow you to craft a logging mechanism best suited to your operations.
This is a portion of our requirments document for our task program, to give you an idea of where to start:
This program needs to be database driven.
It needs to run as a windows service.
The program needs to be able to process "jobs" in the following manner:
Jobs need to be able to check for the existence of a source file, and take action based on the existence or not of the source file. (i.e proceed with processing, vs report that the file isn't there vs ignore it because it is not critical that the file isn't there.
Jobs need to be able to copy a file from a source to a target location or
Copy a file from source, to a staging location, perform "processing", and then copy either the original file or a result of the "processing" to the target location or
Copy a file from source, to a staging location, perform "processing", and the processing is the end result.
The sources and destination that jobs might copy to and from can be disparate: UNC, SFTP, FTP, etc.
The "processing", can be, encrypting/decrypting a file, parsing a data file for correct format, feeding the file to the mainframe via terminal emulation, etc., usually implemented by calling a command line passing parameters to an .exe
Jobs need to be able to clean up after themselves, as required. i.e. delete intermediate or original files, copy files to an archive location, etc.
The program needs to be able to determine the success and failure of each phase of a job and take appropriate action which would be logging, and possibly other notification, abort further processing on failure, etc.
Jobs need to be configured to activate at certain set times, or at certain intervals (optionally during certain set hours) i.e. every 15 mins from 9:00 - 5:00.
There needs to be a UI to add new jobs.
There needs to be a button to push to fire off a job as if a timer event had activated it.
The standard Display of the program should show an operator what is going on and whether the program is functioning properly.
All of this is predicated on the premise that it is a given that you write your own software. There are several enterprise task scheduler programs available on the market, as well. Buying off the shelf may be a better solution for you.

Using Erlang to manage multiple instances of an external process

I have a single threaded process, which takes an input file and produces an output file (takes file in and file out paths as inputs). I want to use Erlang to create, manage and close multiple instances of this process.
Basically, whenever the client process need to produce the output file, the client connects to the Erlang server with the input and output path - the server initiates a new process - feeds it the paths, and then once the process is done, terminate the process.
I have basic understanding of how gen_server etc. work, but I want to know whether I can use erlang to create and delete instances of an external process? (e.g. a JAR). What library should I look into?
Look at ports. http://www.erlang.org/doc/man/erlang.html#open_port-2
The os:cmd function is probably the closest, see [http://www.erlang.org/doc/man/os.html1. It does assume that your processes run and then finish - the "deleting" part is not covered.

Resources