How to send a signal to all the PTransform objects running on Dataflow? - google-cloud-dataflow

I'm now implementing an Apache beam application running on Dataflow which consumes data from a Cloud PubSub, transforms the format of them and sends the results to another Cloud PubSub. It loads the definitions of the streaming data which describes the names and types of keys and how each data should be transformed. The definitions are stored in GCS and loaded when the applications starts.
My question is that the way to update the definitions and notify the changes of each PTransform object running on the data flow. Is it possible to do that online? or do we have to drain and recreate the dataflow app?

Related

How does apache beam access bigtable data?

If BigtableIO.Read is run in dataflow, is the data being accessed via a bigtable node or going directly to bigtable tablets?
Bigtable architecture has:
client requests go through a front-end server before they are sent to a Cloud Bigtable node
and goes on to say:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets to help balance the workload of queries... Tablets are stored on Colossus, Google's file system, in SSTable format
(The concern is if there is a dataflow job running at the same as users are making individual request that definitely go through the nodes, whether there will be a small or large amount of contention from the dataflow job. I would guess that if the dataflow job went through the nodes there would be significantly more contention as opposed to hitting the tablets directly.)
Beam BigTable connector uses the Cloud BigTable's public API hence requests will be going through the BigTable front end server nodes.
See here for bit more detail regarding BigTable client API usage of the Beam connector.

What is the benefit of using google cloud pub/sub service in a streaming pipeline

Can anyone explain what is the benefit of adopting google cloud pub/sub service in a streaming pipeline?
I saw one of the event streaming pipeline example showcased, and it was using pub/sub to ingest the events data before connecting to the google cloud data flow service to transform it. Why does it not connect to the events data directly through data flow?
Thanks.
Dataflow will need a source to get the data from. If you are using a streaming pipeline you can use different options as a source and each of them will have its own characteristics that may fit your scenario.
With Pub/Sub you can easily publish events using a client library or directly the API to a topic, and it will guarantee at least once delivery of that message.
When you connect it with Dataflow streaming pipeline, you can have a resilient architecture (Pub/Sub will keep sending the message until Dataflow acknowledge that it has processed it) and a near real-time processing. In addition, Dataflow can use Pub/Sub metrics to scale up or down depending on the number of the messages in the backlog.
Finally, Dataflow runner uses an optimized version of the PubSubIO connector which provides additional features. I suggest checking this documentation that describes some of these features.

How does Apache Beam handle multiple windows

We are using Beam 2.2 java sdk with Google Dataflow runner.
We receive batches of 4 hours' data in PubSub (which does not have timestamps) , which we need to throttle and process the resulting windows one by one as each of these require some state information. We can assign timestamps to this data and create windows, but this may result into multiple windows being prepared and emitted at the same time. Does Beam process these windows one by one or we need to explicitly ensure this processing?

Can i make the custom source&sink from local server(file or dbs..) to dataflow directly?

I would like to make the custom source&sink from local server (files or dbs) to dataflow directly. so i wonder whether it is possible.
If it's possible, What should I be carefully to make it?
FYI, I have never made the custom source&sink.
But I used once GCS, dataflow.
Dataflow's custom IO framework can read from arbitrary sources and write to arbitrary sinks. You can certainly write connectors to various types of files and databases.
However, when executing pipelines on a remote service, like Google Cloud Dataflow in the cloud, depending on several factors, it may not be able to access services running on your local machine. Moreover, such local services may not scale well-enough to get a performant data-processing pipeline.
Thus, it might be better to move data to a cloud-based service, like Google Cloud Storage or Google BigQuery.

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

Resources