What is the benefit of using google cloud pub/sub service in a streaming pipeline - google-cloud-dataflow

Can anyone explain what is the benefit of adopting google cloud pub/sub service in a streaming pipeline?
I saw one of the event streaming pipeline example showcased, and it was using pub/sub to ingest the events data before connecting to the google cloud data flow service to transform it. Why does it not connect to the events data directly through data flow?
Thanks.

Dataflow will need a source to get the data from. If you are using a streaming pipeline you can use different options as a source and each of them will have its own characteristics that may fit your scenario.
With Pub/Sub you can easily publish events using a client library or directly the API to a topic, and it will guarantee at least once delivery of that message.
When you connect it with Dataflow streaming pipeline, you can have a resilient architecture (Pub/Sub will keep sending the message until Dataflow acknowledge that it has processed it) and a near real-time processing. In addition, Dataflow can use Pub/Sub metrics to scale up or down depending on the number of the messages in the backlog.
Finally, Dataflow runner uses an optimized version of the PubSubIO connector which provides additional features. I suggest checking this documentation that describes some of these features.

Related

Is there a GCP equivalent to AWS SQS?

Im curious to understand the implementation of GCP's PubSub. Although Pubsub seems to point to follow a Publish-Subscribe design pattern, it seems more close to AWS's SQS (queue) than AWS SNS (that use publish-subscribe model). Why is think this is, GCP's pubSub
Allows upto 10,000 subscriptions per project.
Allows filtering on subscriptions
It even allows ordering (beta) - which should involve a FIFA queue somewhere.
It exposes synchronous api for request/response pattern.
It makes me wonder if subscriptions in pub/sub are merely queues of SQS.
I would like your opinions on this comparison. The confusion is due to lack of implementation details on PubSub and the obvious name indicating a certain design pattern.
Regards,
The division for messaging in GCP is along slightly different lines than what you may see in AWS. GCP breaks down messaging into three categories:
Torrents: Messaging pipelines that are designed to handle large amounts of throughput on pipes that are persistent. In other words, one creates a new pipeline rarely and sends messages over it for long periods of time. The scaling pattern for torrents is a relatively small number of pipelines transmitting a lot of data. For this category, Cloud Pub/Sub is the right product.
Trickles: Messaging pipelines that are largely ephemeral or require broadcast to a very large number of end-user devices. These pipelines have a low throughput but the number of pipelines can be extremely large. Firebase Cloud Messaging is the product that fits into this category.
Queues: Messaging pipelines where one has more control over the end-to-end message delivery. These pipelines are not really high throughput nor is the number of pipelines large, but more advanced properties are supported, e.g., the ability to delay or cancel the delivery of a message. Cloud Tasks fits in this category, though Cloud Pub/Sub is also adopting features that make it more and more viable for this use case.
So Cloud Pub/Sub is the publish/subscribe aspects of SQS+SNS, where SNS is used as a means to distribute messages to different SQS queues. It also serves as the big-data ingestion mechanism a la Kinesis. Firebase Cloud Messaging covers the portions of SNS designed to reach end user devices. Cloud Tasks (and Cloud Pub/Sub, more and more) provide functionality of a single queue in SQS.
You are correct to say that GCP PubSub is close to AWS SQS. As far as I know, there is no exact SNS tool available in GCP, but I think the closest tool is GCM (Google Cloud Messaging). You are not the only one who has had this query:
AWS SNS equivalent in GCP stack

Read from Google PubSub and then read from Bigtable based on the PubSub message Topic

In Beam(Dataflow 2.0.0), I am reading a PubSub topic and then trying to fetch few rows from Bigtable based on the message from the topic. I couldn't find a way to scan the BigTable based on the pubsub messages through Beam documentation. I tried to write ParDo function and pipe it into the beam pipeline but in vain.
The BigTableIO gives an option of read but that is outside of pipeline and am not sure it would work in the steaming fashion as my use-case.
Can anyone please let me know if this is doable as in streaming PubSub and read BigTable based on the message content.
P.S: I am using Java API in Beam 2.0.
PCollection<String> keyLines =
pipeline.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*************"))
.apply("PubSub Message to Payload as String",
ParDo.of(new PubSubMessageToStringConverter()));
Now I want the keyLines to act as the row keys to scan the BigTable. I am using the below code snippet from BigTable. I can see 'RowFilter.newBuilder()' and 'ByteKeyRange' but both of them seems work in batch mode not in streaming fashion.
pipeline.apply("read",
BigtableIO.read()
.withBigtableOptions(optionsBuilder)
.withTableId("**********");
pipeline.run();
Please advise.
You should be able to read from BigTable in a ParDo. You would have to use Cloud Big Table or HBase API directly. It is better to initialize the client in #Setup method in your DoFn (example). Please post more details if it does not work.

How to send a signal to all the PTransform objects running on Dataflow?

I'm now implementing an Apache beam application running on Dataflow which consumes data from a Cloud PubSub, transforms the format of them and sends the results to another Cloud PubSub. It loads the definitions of the streaming data which describes the names and types of keys and how each data should be transformed. The definitions are stored in GCS and loaded when the applications starts.
My question is that the way to update the definitions and notify the changes of each PTransform object running on the data flow. Is it possible to do that online? or do we have to drain and recreate the dataflow app?

Writing to memcache from within a streaming dataflow pipeline

Is it possible to write to memcache from a streaming data flow pipeline? or do I need to write to a pubsub and create another compute engine or app engine?
Yes, the Dataflow workers can communicate with any external services that you need; they are just VMs with no special restrictions or permissions.
If you are just writing out data to memcache, the Sink API will likely be useful
For redis I created DoFn with redis client.
It is possible to do some tricks if you need batch writing. For example:
link

Can i make the custom source&sink from local server(file or dbs..) to dataflow directly?

I would like to make the custom source&sink from local server (files or dbs) to dataflow directly. so i wonder whether it is possible.
If it's possible, What should I be carefully to make it?
FYI, I have never made the custom source&sink.
But I used once GCS, dataflow.
Dataflow's custom IO framework can read from arbitrary sources and write to arbitrary sinks. You can certainly write connectors to various types of files and databases.
However, when executing pipelines on a remote service, like Google Cloud Dataflow in the cloud, depending on several factors, it may not be able to access services running on your local machine. Moreover, such local services may not scale well-enough to get a performant data-processing pipeline.
Thus, it might be better to move data to a cloud-based service, like Google Cloud Storage or Google BigQuery.

Resources