Publish dataflow job status once completed onto Google Pub/Sub - google-cloud-dataflow

Currently i am using Flex Template to launch a job from a microservice. I am trying to find a better way (than using job polling method) to get the dataflow job status. Basically trying to publish dataflow job status onto pubsub by datflow job itself when completed.
Can someone help me with this?

There is currently no way to make a Dataflow job itself send its status to a Pub/Sub topic.
Instead, you can create logs exports to send your Dataflow logs to a Pub/Sub topic with inclusion and exclusion filters and further perform text searches on the Pub/Sub messages to deduce the status of your job. For example, you can create an inclusion filter on dataflow.googleapis.com%2Fjob-message", and, among the received messages, one that contains a string like "Workflow failed." is from a batch job that failed.

Related

How to get lineage info of dataflow jobs?

I am new to dataflow and am trying to get the lineage information about any dataflow job, for an app I am trying to build. I am trying to fetch atleast the source and destination names from a job and if possible find out the transformation applied on the pcollection in the pipeline, something like a trace of the function calls.
I have been analyzing the logs for different kind of jobs, but could not figure out a definite way to fetch any of the informations I am looking for.
You should be able to get this information from the graph itself. One way to do this would be to implement your own runner which delegates to the Dataflow runner.
For Dataflow, you could also use the fetch the job (whose steps will give the topology) from the service via the Dataflow API.

Unable to stop a streaming dataflow on google cloud

There is a streaming dataflow running on google cloud (Apache beam 2.5). The dataflow was showing some system lag so I tries to update that dataflow with --update flag. Now the old dataflow is in Updating state and the new dataflow that initiated after the update process is in Pending state.
Now at this point everything is stuck. I am unable to stop/cancel the jobs now. Old job is still in updating state and no status change operation is permitted. I tried to change the state of the job using gcloud dataflow jobs cancel and REST api but it's showing job cannot be updated as it's in RELOAD state. The new initiate job is in not started/pending state. Unable to change the state of this as well. It's showing job is not in condition to perform this operation.
Please let me know how to stop/cancel/delete this streaming dataflow.
Did you try to cancel the job from both command line gcloud tool and web console UI? If nothing works, I think you need to contact Google Cloud Support.

Drain DataFlow job and start another one right after, cause to message duplication

I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.

How to read from pubsub source in parallel using dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

JSR352: Monitoring Status of Job, Step and Partitions?

IBM's version of JSR352 provides a Rest API which can be used to trigger jobs, restart them, get the job logs. Can it also be used to get the status of each step and each partition of the step?
I want to build a job monitoring console from where i can trigger the jobs and monitor the status of the steps and partitions in real time without actually having to look into the job log. (after i trigger the job it should periodically give me the status of the step and partitions)
How should i go about doing this?
You can subscribe to our batch events, a JMS topic tree where we publish messages at various stages in the batch job lifecycle, (job started/ended, step checkpointed, etc.)
See the Knowledge Center documentation and this whitepaper as well for more information.

Resources