Read from Google PubSub and then read from Bigtable based on the PubSub message Topic - google-cloud-dataflow

In Beam(Dataflow 2.0.0), I am reading a PubSub topic and then trying to fetch few rows from Bigtable based on the message from the topic. I couldn't find a way to scan the BigTable based on the pubsub messages through Beam documentation. I tried to write ParDo function and pipe it into the beam pipeline but in vain.
The BigTableIO gives an option of read but that is outside of pipeline and am not sure it would work in the steaming fashion as my use-case.
Can anyone please let me know if this is doable as in streaming PubSub and read BigTable based on the message content.
P.S: I am using Java API in Beam 2.0.
PCollection<String> keyLines =
pipeline.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*************"))
.apply("PubSub Message to Payload as String",
ParDo.of(new PubSubMessageToStringConverter()));
Now I want the keyLines to act as the row keys to scan the BigTable. I am using the below code snippet from BigTable. I can see 'RowFilter.newBuilder()' and 'ByteKeyRange' but both of them seems work in batch mode not in streaming fashion.
pipeline.apply("read",
BigtableIO.read()
.withBigtableOptions(optionsBuilder)
.withTableId("**********");
pipeline.run();
Please advise.

You should be able to read from BigTable in a ParDo. You would have to use Cloud Big Table or HBase API directly. It is better to initialize the client in #Setup method in your DoFn (example). Please post more details if it does not work.

Related

What is the benefit of using google cloud pub/sub service in a streaming pipeline

Can anyone explain what is the benefit of adopting google cloud pub/sub service in a streaming pipeline?
I saw one of the event streaming pipeline example showcased, and it was using pub/sub to ingest the events data before connecting to the google cloud data flow service to transform it. Why does it not connect to the events data directly through data flow?
Thanks.
Dataflow will need a source to get the data from. If you are using a streaming pipeline you can use different options as a source and each of them will have its own characteristics that may fit your scenario.
With Pub/Sub you can easily publish events using a client library or directly the API to a topic, and it will guarantee at least once delivery of that message.
When you connect it with Dataflow streaming pipeline, you can have a resilient architecture (Pub/Sub will keep sending the message until Dataflow acknowledge that it has processed it) and a near real-time processing. In addition, Dataflow can use Pub/Sub metrics to scale up or down depending on the number of the messages in the backlog.
Finally, Dataflow runner uses an optimized version of the PubSubIO connector which provides additional features. I suggest checking this documentation that describes some of these features.

How Kapacitor get Stream in TICK architecure

As per my information, Kapacitor can work on streams or batches. In case of batches, it fetches data from Influxdb and operate on that.
But how does it work with stream. Does it subscribe to InfluxDB or Telegraph. I hope it subscribe to InfluxDB. So in case any client write data to InfluxDb, Kapacitor also receive that data. Is this understanding correct? Or it subscribe directly to Telegraph?
Why this question is important to us is because we want to use Azure IoT hub in place of Telegraph. So, we will read the data from Azure IoT hub and write it to InfluxDb. We hope that we can use Kapacitor Stream here.
Thanks In Advance
Kapacitor subscribes to influxDB. By default if we do not mention what databases to subscribe to, it subscribes to all of them. In Kapacitor's config file you can mention the databases in InfluxDB which you want to subscribe to.

Connect flume sink to another sink

Apologies for the newbie question, but I have a slightly quirky Flume question.
Scenario:
I have Sink A, which grabs data, performs filtering, then sends successful data to some HTTP service.
I now want to log all successfully processed data to Hive.
Is there a way in Flume to grab data in a sink, then put it back into Flume to be picked up by another sink?
Ideally, a would like to write a Sink B that writes this data to Hive.
So answering my own question:
Is there a way in Flume to grab data in a sink, then put it back into Flume to be picked up by another sink?
No - not from what I found.
Ideally, a would like to write a Sink B that writes this data to Hive.
I ended up piping this data to a kafka topic, which was then read by a new Flume agent that I had to write.

How to send a signal to all the PTransform objects running on Dataflow?

I'm now implementing an Apache beam application running on Dataflow which consumes data from a Cloud PubSub, transforms the format of them and sends the results to another Cloud PubSub. It loads the definitions of the streaming data which describes the names and types of keys and how each data should be transformed. The definitions are stored in GCS and loaded when the applications starts.
My question is that the way to update the definitions and notify the changes of each PTransform object running on the data flow. Is it possible to do that online? or do we have to drain and recreate the dataflow app?

Benefit of Apache Flume

I am new with Apache Flume.
I understand that Apache Flume can help transport data.
But I still fail to see the ultimate benefit offered by Apache Flume.
If I can configure a software or make a software to send which data goes where, why I need Flume?
Maybe someone can explain a situation that shows Apache Flume's benefit?
Reliable transmission (if you use the file channel):
Flume sends batches of small events. Every time it sends a batch to the next node it waits for acknowledgment before deleting. The storage in the file channel is optimized to allow recovery on crash.
I think the biggest benefit that you get out of flume is extensiblity. Basically all components starting from source, interceptor and sink, everything is extensible.
We use flume and read data using custom kakfa source, data is in the form of JSON we parse it in custom kafka source and then pass it on to HDFS sink. It working reliably in 5 of nodes. We extended only kafka source, HDFS sink functionality we got out the box.
At the same time, being from the Hadoop ecosystem, you get great community support and multiple options to use the tools in different ways.

Resources