I am new to dataflow and am trying to get the lineage information about any dataflow job, for an app I am trying to build. I am trying to fetch atleast the source and destination names from a job and if possible find out the transformation applied on the pcollection in the pipeline, something like a trace of the function calls.
I have been analyzing the logs for different kind of jobs, but could not figure out a definite way to fetch any of the informations I am looking for.
You should be able to get this information from the graph itself. One way to do this would be to implement your own runner which delegates to the Dataflow runner.
For Dataflow, you could also use the fetch the job (whose steps will give the topology) from the service via the Dataflow API.
Related
Currently i am using Flex Template to launch a job from a microservice. I am trying to find a better way (than using job polling method) to get the dataflow job status. Basically trying to publish dataflow job status onto pubsub by datflow job itself when completed.
Can someone help me with this?
There is currently no way to make a Dataflow job itself send its status to a Pub/Sub topic.
Instead, you can create logs exports to send your Dataflow logs to a Pub/Sub topic with inclusion and exclusion filters and further perform text searches on the Pub/Sub messages to deduce the status of your job. For example, you can create an inclusion filter on dataflow.googleapis.com%2Fjob-message", and, among the received messages, one that contains a string like "Workflow failed." is from a batch job that failed.
I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.
In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?
Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/
Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination
I am using Aggregator to log some runtime stats of dataflow job and I want to flush them to either GCS or BQ when the pipeline completes (or each transformer completes).
Currently I am doing it by beyond using Aggregator also creating side output by utilizing tupleTag at the same time and flush the side output PCollection.
However i am wondering whether might there by any other handy ways to flush the aggregators themselves directly?
Your method of using a side output PCollection should produce semantically equivalent results to using an Aggregator. (For example, both Aggregators and side outputs will not include duplicate values when a bundle fails and has to be retried.) The main difference is that partial results for Aggregators are available during pipeline execution in the monitoring UI and programmatically.
Within Java, you can use PipelineResult.getAggregatorValues(). If you get the PipelineResult from the [non-blocking]DataflowPipelineRunner, that will let you query aggregators as the job runs. If you use the BlockingDataflowPipelineRunner, Pipeline.run() blocks and you won't get the PipelineResult until after the job completes.
There's also commandline support: gcloud alpha dataflow metrics tail JOB_ID