Check data watermark at different steps via the Dataflow API - google-cloud-dataflow

In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?

Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/

Related

How to get lineage info of dataflow jobs?

I am new to dataflow and am trying to get the lineage information about any dataflow job, for an app I am trying to build. I am trying to fetch atleast the source and destination names from a job and if possible find out the transformation applied on the pcollection in the pipeline, something like a trace of the function calls.
I have been analyzing the logs for different kind of jobs, but could not figure out a definite way to fetch any of the informations I am looking for.
You should be able to get this information from the graph itself. One way to do this would be to implement your own runner which delegates to the Dataflow runner.
For Dataflow, you could also use the fetch the job (whose steps will give the topology) from the service via the Dataflow API.

Is it possible to track Metrics within a sink such as FileIO?

Playing around with org.apache.beam.sdk.metrics was wondering the following... Can you track metrics from an "obscure" (and by obscure I mean that the code of that stage is not yours) to catch failures, delays and such like, for example, within the CassandraIO connector when inserts fail?
If so, how can I access that information?
So far I've been tracking metrics within my very own stages doing Metrics.counter("new_counter", "new_metric").inc(n) and similar.
The metrics need to be added manually to each connector, much like you already do with your own metrics (i.e. Metrics.counter(.....).inc(..)).
If the connector has metrics of its own, it will publish them. Unfortunately, it seems like that's not the case for CassandraIO. : (
If you are interested, I would invite you to submit a pull request to the Apache Beam repository to add metrics that you find interesting for CassandraIO or any other connector that you like.

See job graph size in Google Dataflow

I'm getting the following error message when attempting to run a pipeline:
The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.
According to the docs, the limit is 10 MB. However; I would like to know how big the graph actually is, to make debugging it easier.
Is there any way to see the size of the graph?
As mentioned in the comment, use the --dataflow_job_file option. Note that there's no need to specify a gcs path, you can write it out locally. You can also pass the --dry_run option to avoid actually submitting the job.

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Streaming Dataflow pipeline with no sink

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Resources