avro events from kafka to HDFS as ORC files with flume - flume

I have kafka cluster that receives avro events from producers.
I would like to use flume in order to consume these events and put them as ORC files in HDFS
Is this possible with flume?
Does anyone have example of a configuration file demonstrating how to do it?

Related

Fluentd to Kafka on Kubernetes connection Issues

I'm having issues connecting Fluentd to Kafka for a centralized logging PoC I'm working on.
I'm currently using the following configuration:
Minikube
Fluentd
fluent/fluentd-kubernetes-daemonset:v1.14.3-debian-kafka2-1.0 (docker)
Configuration: I have the FLUENT_KAFKA2_BROKERS=<INTERNAL KAFKA BOOTSTRAP IP>:9092 and FLUENT_KAFKA2_DEFAULT_TOPIC=logs env set in my yaml for fluentd daemonset.
Kafka
I was sort of expecting to see the logs appear in a Kafka consumer running against the same broker listening on the "logs" topic. No dice.
Could anyone recommend next steps for troubleshooting and or a good reference? I've done a good bit of searching and have only found a few people posting about setting up with the fluentd-kafka plugin. Also would it make sense for me to explore Fluent Bit Kafka setup as an alternative?
In general, to configure forwarding of log events to Kafka topic you would definitely need to use output plugins for Fluentd.
Fluentd delivers fluent-plugin-kafka plugin, as specified in Fluentd docs, for both input and output use cases. For output case, this plugin has Kafka Producer functions to publishes messages into topics. kafka-connect-fluentd plugin can also be used as an alternative.
Fluent Bit - being the sub-project of Fluentd - a good lightweight alternative for Fluentd, but which one to use depends on your particular use case.
Fluent Bit has limited amount of filtering options, it is not as pluggable and flexible as Fluentd. The later has more configuration options and filters, it can be integrated with a much larger amount of input and output sources. It is essentially designed to deal with heavy throughput — aggregating from multiple inputs, processing data and routing to different outputs. More on comparison here and here.

How to transfer data from MSSQL to BigQuery using apache Beam (Datafow)

I want to transfer data from MSSQL to GCP BigQuery or GCP Cloud Storage using apache beam, dataflow. But I found that there is no Built-in connector to do that. Does anyone know how to approach this task using apache beam or dataflow??

Big Data project requirements using twitter streams

I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.
First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.

Spark cluster: Standalone mode without HDFS

We have a standalone Spark cluster. With a cluster, if the RDD memory storage is not enough, it spills the data to disk. Where exactly is the data spilled to when there is no HDFS? Local disk of each slave node?
Thanks!
As far as I know all data is spilled to the local directory defined by spark.local.dir independent of HDFS access.

How to make Ganglia graphs work with Flume

I am monitoring a multi agent flume setup with gangaila. I am able to produce metrics which I can view as JSON data on the browser. Can anyone suggest me how to view these metrics in graphs?
When starting the flume agent provide following java options:
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=<host:port of ganglia monitor daemon>
Depending on the flume installation these can be set in flume-env.sh (in mine the file is in /etc/flume/conf).
You won't be able to run http and ganglia monitoring at the same time.

Resources