Beam.io.WriteToPubSub throws error "The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection" - google-cloud-dataflow

I'm getting an error whenever I use "WriteToPubSub". The code below is me trying to debug the issue. My actual code is trying to take data from failures of WriteToBigQuery in order to push it to a deadletter pubsub topic. But when I tried to do that I kept encountering the error below.
I am running Apache Beam 2.27, Python 3.8
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.io.gcp.bigtableio import WriteToBigTable
from apache_beam.runners import DataflowRunner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth
import json
import pytz
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions(flags=[])
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'asia-east1'
# Sets the job name
options.view_as(GoogleCloudOptions).job_name = 'data_ingest'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = '[REDACTED]'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location
# The directory to store the output files of the job.
output_gcs_location = '%s/output' % dataflow_gcs_location
ib.options.recording_duration = '1m'
# The Google Cloud PubSub topic for this example.
topic = "[REDACTED]"
output_topic = "[REDACTED]"
subscription = "[REDACTED]"
deadletter_topic = "[REDACTED]"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
p = beam.Pipeline(InteractiveRunner(),options=options)
data = p | beam.io.ReadFromPubSub(topic=topic) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=deadletter_topic)
ib.show(data, include_window_info=False)
The error given is
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Can someone spot what the problem is?
No matter what I do, WriteToPubSub says it's receiving PDone.
EDIT:
If i use p.run(), I get the following error instead:
'PDone' object has no attribute 'to_runner_api'
In both cases, the pipeline does not try to run, it immediately errors out.
EDIT:
I've realised the problem
p = beam.Pipeline(InteractiveRunner(),options=options)
It is this line. If I remove the interactiverunner everything works. Not sure why

Beam Terminology
Apache Beam has some base concepts, that we should adhere to while leveraging the power of this programming model.
Pipeline
In simple terms, a pipeline is a series of tasks for a desired output. It can be as simple as a linear flow or could have a complex branching of tasks. The fundamental concept is read from input source(s), perform some transformations and emit to output(s).
Mathematically, beam pipeline is just a Directed Acyclic Graph of tasks.
PCollection
In simple terms, PCollections is an immutable bag of elements which could be distributed across machines. Each step in a beam pipeline will have it's input and output as a PCollection (apart from sources and sinks)
PCollection is a powerful distributed data structure that a beam pipeline operates on. It could be bounded or unbounded based on your source type.
PTransforms
In simple terms, Transforms are the operations of your pipleine. It provides processing logic and this logic is applied to each element of one or more input of PCollections.
Example : PTransform<PCollection<X>,PCollection<Y>> will transform X to Y.
Based on processing paradigm, beam provides us multiple core transforms - ParDo, GroupByKey, Flatten, Combine etc.
I/O Transforms
When you create a pipeline one need a data source to read data such as a file or a database. Likewise, you want to emit your result data to an external storage system such as topic or an object store. The transforms which deal with External Input and Output are I/O Transforms.
Usually for an external source, you will have the following
Source : A PTransform to read data from the external system. This will read from
an external system(like file, db). It excepts a PBegin (pipeline entry point) and return a PCollection.
PTransform<PBegin,PCollection>
This would be one of the entry points of your pipeline.
Sink : A PTransform that will output data to an external system. This will write to an external system(like topic, storage). It excepts a PCollection and return a PDone (pipeline entry point).
PTransform<PCollection,PDone>
This would be one of the exit points of your pipeline.
Combination of a source and sink is an I/O Connector like RedisIO, PubSubIO etc. Beam provides multiple in-built connectors and one can write a custom one also.
There are still various concepts and extenions of the above, that allow users to program complex requirements that could be run on different runners. This is what makes Beam so powerful.
Solution
In your case, ib.show(data, include_window_info=False) is throwing the below error
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Source Code
Because your data contains result of beam.io.WriteToPubSub(topic=deadletter_topic) which is a sink and returns a PDone not a PCollection.
For your use case of BQ Writing Failures to PubSub, you could follow something below
data = beam.io.ReadFromPubSub(topic=topic) | 'Write to BQ' >> beam.io.WriteToBigQuery( ...)
(data['beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS']
| 'publish failed' >> beam.io.WriteToPubSub(topic=deadletter_topic)
However, if this does not solve your issue posting the code would be useful or else you could write a custom PTransform with output tags for writing to BQ and to return failures(via tuple tags) for publising to PubSub.
P.S. : WriteToBigQuery is not a sink, but a custom PTransform that writes to big query and returns failures.

Related

How to directly access/use Tensorflow Extended StatisticsGen statistics?

I'm experimenting with TFX for common ML pipeline work. I somewhat struggle to actually utilize StatisticsGen component to inspect an analyze data statistics.
While in case of TFDV I can access statistics in a straightforward manner:
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv('data.csv', delimiter=',')
stats # This gives a JSON-like output
in case of TFX itself, StatisticsGen generates a binary FeatureStats.pb file in artifacts/StatisticsGen/statistics/...
How to extract actual statistics from StatisticsGen to use it for checking data (or any other purpose)? I'm aware of the existence of interactive context's ability to visualize stats, but this is unhelpful in production environment.

Apache Beam - Parallelize Google Cloud Storage Blob Downloads While Maintaining Grouping of Blobs

I’d like to be able to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS). i.e.PCollection<Iterable<String>> --> PCollection<Iterable<String>> where the starting PCollection is an Iterable of file paths and the resulting PCollection is Iterable of file contents. Alternatively, PCollection<String> --> PCollection<Iterable<String>> would also work and perhaps even be preferable, where the starting PCollection is a glob pattern, and the resulting PCollection is an iterable of file contents which matched the glob.
My use-case is that at a point in my pipeline I have as input PCollection<String>. Each element of the PCollection is a GCS glob pattern. It’s important that files which match the glob be grouped together because the content of the files–once all files in a group are read–need to be grouped downstream in the pipeline. I originally tried using FileIO.matchAll and a subsequently GroupByKey . However, the matchAll, window, and GroupByKey combination lacked any guarantee that all files matching the glob would be read and in the same window before performing the GroupByKey transform (though I may be misunderstanding Windowing). It’s possible to achieve the desired results if a large time span WindowFn is applied, but it’s still probabilistic rather than a guarantee that all files will be read before grouping. It’s also the main goal of my pipeline to maintain the lowest possible latency.
So my next, and currently operational, plan was to use an AsyncHttpClient to fan out fetching file contents via GCS HTTP API. I feel like this goes against the grain in Beam and is likely sub-optimal in terms of parallelization.
So I’ve started investigating SplittableDoFn . My current plan is to allow splitting such that each entity in the input Iterable (i.e. each matched file from the glob pattern) could be processed separately. I've been able to modify FileIO#MatchFn (defined here in the Java SDK) to provide mechanics for PCollection<String> -> PCollection<Iterable<String>> transform between input of GCS glob patterns and output of Iterable of matches for the glob.
The challenge I’ve encountered is: how do I go about grouping/gathering the split invocations back into a single output value in my DoFn? I’ve tried using stateful processing and using a BagState to collect file contents along the way, but I realized part way along that the ProcessElement method of a splittable DoFn may only accept ProcessContext and Restriction tuples, and no other args therefore no StateId args referring to a StateSpec (throws an invalid argument error at runtime).
I noticed in the FilePatternWatcher example in the official SDF proposal doc that a custom tracker was created wherein FilePath Objects kept in a set and presumably added to the set via tryClaim. This seems as though it could work for my use-case, but I don’t see/understand how to go about implementing a #SplitRestriction method using a custom RestrictionTracker.
I would be very appreciative if anyone were able to offer advice. I have no preference for any particular solution, only that I want to achieve the ability to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS).
Would joining the output PCollections help you?
PCollectionList
.of(collectionOne)
.and(collectionTwo)
.and(collectionThree)
...
.apply(Flatten.pCollections())

Apache Beam: Reading in PCollection as PBegin for a pipeline

I'm debugging this beam pipeline and my end goal is to write all of the strings in a PCollection to a text file.
I've set a breakpoint at the point after the the PCollection I want to inspect is created and what I've been trying to do is create a new Pipeline that
Reads in this output PCollection as the inital input
Prints it to a file (using `TextIO.write().to("/Users/my/local/fp"))
I'm struggling with #1 of how to read in the PCollection as initial input.
The skeleton of what I've been trying:
Pipeline p2 = Pipeline.create();
p2.apply(// READ IN THE PCOLLECTION HERE)
.apply(TextIO.write().to("/Users/my/local/fp")));
p2.run()
Any thoughts or suggestions would be appreciated
In order to read a pcollection into input, you need to read it from a source. I.e. some data stored in BigQuery, Google Cloud Storage, etc. There are specific source transforms you can use to read from each of these locations. Depending on where you have stored your data you will need to use the correct source and pass in the relevant parameters (i.e. the GCS path, BigQuery table)
Please take a look at the Minimal Word Count Example on the apache beam website (Full source on github). I suggest starting from this code and iterating on it until you build the pipeline you need.
In this example files are read from GCS
p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
Please also see this guide on using IOs and also this list of beam IO transforms. If you just want a basic example working, you can use Create.of to read from variables in your program.

Slowness / Lag in beam streaming pipeline in group by key stage

Context
Hi all, I have been using Apache Beam pipelines to generate columnar DB to store in GCS, I have a datastream coming in from Kafka and have a window of 1m.
I want to transform all data of that 1m window into a columnar DB file (ORC in my case, can be Parquet or anything else), I have written a pipeline for this transformation.
Problem
I am experiencing general slowness. I suspect it could be due to the group by key transformation as I have only key. Is there really a need to do that? If not, what should be done instead? I read that combine isn't very useful for this as my pipeline isn't really aggregating the data but creating a merged file. What I exactly need is an iterable list of objects per window which will be transformed to ORC files.
Pipeline Representation
input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO (to write to GCS)
What I have tried
I have tried using the profiler, scaling horizontally/vertically. Using the profiler I saw more than 50% of the time going into group by key operation. I do believe the problem is of hot keys but I am unable to find a solution on what should be done. When I removed the group by key operation, my pipeline keeps up with the Kafka lag (ie, it doesn't seem to be an issue at Kafka end).
Code Snippet
p.apply("ReadLines", KafkaIO.<Long, byte[]>read().withBootstrapServers("myserver.com:9092")
.withTopic(options.getInputTopic())
.withTimestampPolicyFactory(MyTimePolicy.myTimestampPolicyFactory())
.withConsumerConfigUpdates(Map.of("group.id", "mygroup-id")).commitOffsetsInFinalize()
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class).withoutMetadata())
.apply("UncompressSnappy", ParDo.of(new UncompressSnappy()))
.apply("DecodeProto", ParDo.of(new DecodePromProto()))
.apply("MapTSSample", ParDo.of(new MapTSSample()))
.apply(Window.<TSSample>into(FixedWindows.of(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(WithKeys.<Integer, TSSample>of(1))
.apply(GroupByKey.<Integer, TSSample>create())
.apply("CreateTSORC", ParDo.of(new CreateTSORC()))
.apply(new WriteOneFilePerWindow(options.getOutput(), 1));
Wall Time Profile
https://gist.github.com/anandsinghkunwar/4cc26f7e3da7473af66ce9a142a74c35
The problem indeed seems to be a hot keys issue, I had to change my pipeline to create a custom IO for ORC files and bump up the number of shards to 50 for my case. I removed the GroupByKey totally. Since beam doesn't yet have auto determination of number of shards for FileIO.write(), you'll have to manually choose a number that suits your workload.
Also, enabling streaming engine API in Google Dataflow sped up the ingestion even more.

large numpy matrix as dataflow side input

I'm trying to write a Dataflow pipeline in Python that requires a large numpy matrix as a side input. The matrix is saved in cloud storage. Ideally, each Dataflow worker would load the matrix directly from cloud storage.
My understanding is that if I say matrix = np.load(LOCAL_PATH_TO_MATRIX), and then
p | "computation" >> beam.Map(computation, matrix)
the matrix get shipped from my laptop to each Datflow worker.
How could I instead direct each worker to load the matrix directly from cloud storage? Is there a beam source for "binary blob"?
Your approach is correct.
What Dataflow does, in this case, is handle the NumPy matrix as a side input. This means that it's uploaded once from your machine to the service, and the Dataflow service will send it to each worker.
Given that the matrix is large, this will make your workers use I/O to receive it from the service, and carry the burden of keeping the whole matrix in memory, but it should work.
If you want to avoid computing/loading the matrix in your machine, you can upload your matrix to GCS as a text file, read that file in, and obtain the matrix. You can do something like so:
matrix_file = 'gs://mybucket/my/matrix'
p | beam.ParDo(ComputationDoFn(matrix_file))
And your DoFn could be something like:
class ComputationDoFn(beam.DoFn):
def __init__(self, matrix_file):
self._matrix_file = matrix_file
self._matrix = None
def start_bundle(self, element):
# We check because one DoFn instance may be reused
# for different bundles.
if self._matrix is None:
self.load_matrix(self._matrix_file)
def process(self, element):
# Now process the element
def load_matrix(self, matrix_file):
# Load the file from GCS using the GCS API
I hope this makes sense. I can flesh up the functions if you feel like you need some more help.

Resources