Apologies for the newbie question, but I have a slightly quirky Flume question.
Scenario:
I have Sink A, which grabs data, performs filtering, then sends successful data to some HTTP service.
I now want to log all successfully processed data to Hive.
Is there a way in Flume to grab data in a sink, then put it back into Flume to be picked up by another sink?
Ideally, a would like to write a Sink B that writes this data to Hive.
So answering my own question:
Is there a way in Flume to grab data in a sink, then put it back into Flume to be picked up by another sink?
No - not from what I found.
Ideally, a would like to write a Sink B that writes this data to Hive.
I ended up piping this data to a kafka topic, which was then read by a new Flume agent that I had to write.
Related
Is there a Serilog sink that just writes to a buffer in memory? What I am thinking about is a sink that will store X lines and then I could access those X lines and show them on a web page via an api controller. This would be more for viewing recent errors that occurred in the application.
I looked on the GitHub sink page (https://github.com/serilog/serilog/wiki/Provided-Sinks) but did not see one and just wondered if there was something I was missing.
Serilog doesn't have a built-in Sink that writes to memory, but you could easily write one just for that. Take a look, for example, at the DelegatingSink that is used in Serilog's unit tests, which is 80% of what you would need... You'd just have to store the events in an in-memory data structure.
Another option would be to use the mssqlserver sink, write the events to a simple table, and display in your web app.
A third option (which would be my recommendation) would be to just install Seq, which is free for development and single-user deployment, and just write the logs to Seq through their sink. That will save you from having to write the web app, and will give you search and filtering out-of-the-box.
In Beam(Dataflow 2.0.0), I am reading a PubSub topic and then trying to fetch few rows from Bigtable based on the message from the topic. I couldn't find a way to scan the BigTable based on the pubsub messages through Beam documentation. I tried to write ParDo function and pipe it into the beam pipeline but in vain.
The BigTableIO gives an option of read but that is outside of pipeline and am not sure it would work in the steaming fashion as my use-case.
Can anyone please let me know if this is doable as in streaming PubSub and read BigTable based on the message content.
P.S: I am using Java API in Beam 2.0.
PCollection<String> keyLines =
pipeline.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*************"))
.apply("PubSub Message to Payload as String",
ParDo.of(new PubSubMessageToStringConverter()));
Now I want the keyLines to act as the row keys to scan the BigTable. I am using the below code snippet from BigTable. I can see 'RowFilter.newBuilder()' and 'ByteKeyRange' but both of them seems work in batch mode not in streaming fashion.
pipeline.apply("read",
BigtableIO.read()
.withBigtableOptions(optionsBuilder)
.withTableId("**********");
pipeline.run();
Please advise.
You should be able to read from BigTable in a ParDo. You would have to use Cloud Big Table or HBase API directly. It is better to initialize the client in #Setup method in your DoFn (example). Please post more details if it does not work.
I'm now implementing an Apache beam application running on Dataflow which consumes data from a Cloud PubSub, transforms the format of them and sends the results to another Cloud PubSub. It loads the definitions of the streaming data which describes the names and types of keys and how each data should be transformed. The definitions are stored in GCS and loaded when the applications starts.
My question is that the way to update the definitions and notify the changes of each PTransform object running on the data flow. Is it possible to do that online? or do we have to drain and recreate the dataflow app?
I plan to create a system where I can read web logs in real time, and use apache spark to process them. I am planning to use kafka to pass the logs to spark streaming to aggregate statistics.I am not sure if I should do some data parsing (raw to json ...), and if yes, where is the appropriate place to do it (spark script, kafka, somewhere else...) I will be grateful if someone can guide me. Its kind of a new stuff to me. Cheers
Apache Kafka is a distributed pub-sub messaging system. It does not provide any way to parse or transform data it is not for that. But any Kafka consumer can process, parse or transform the data published to Kafka and republished the transformed data to another topic or store it in a database or file system.
There are many ways to consume data from Kafka one way is the one you suggested, real-time stream processors(apache flume, apache-spark, apache storm,...).
So the answer is no, Kafka does not provide any way to parse the raw data. You can transform/parse the raw data with spark but as well you can write your own consumer as there are many Kafka clients ports or use any other built consumer Apache flume, Apache storm, etc
I am new with Apache Flume.
I understand that Apache Flume can help transport data.
But I still fail to see the ultimate benefit offered by Apache Flume.
If I can configure a software or make a software to send which data goes where, why I need Flume?
Maybe someone can explain a situation that shows Apache Flume's benefit?
Reliable transmission (if you use the file channel):
Flume sends batches of small events. Every time it sends a batch to the next node it waits for acknowledgment before deleting. The storage in the file channel is optimized to allow recovery on crash.
I think the biggest benefit that you get out of flume is extensiblity. Basically all components starting from source, interceptor and sink, everything is extensible.
We use flume and read data using custom kakfa source, data is in the form of JSON we parse it in custom kafka source and then pass it on to HDFS sink. It working reliably in 5 of nodes. We extended only kafka source, HDFS sink functionality we got out the box.
At the same time, being from the Hadoop ecosystem, you get great community support and multiple options to use the tools in different ways.