Benefit of Apache Flume - flume

I am new with Apache Flume.
I understand that Apache Flume can help transport data.
But I still fail to see the ultimate benefit offered by Apache Flume.
If I can configure a software or make a software to send which data goes where, why I need Flume?
Maybe someone can explain a situation that shows Apache Flume's benefit?

Reliable transmission (if you use the file channel):
Flume sends batches of small events. Every time it sends a batch to the next node it waits for acknowledgment before deleting. The storage in the file channel is optimized to allow recovery on crash.

I think the biggest benefit that you get out of flume is extensiblity. Basically all components starting from source, interceptor and sink, everything is extensible.
We use flume and read data using custom kakfa source, data is in the form of JSON we parse it in custom kafka source and then pass it on to HDFS sink. It working reliably in 5 of nodes. We extended only kafka source, HDFS sink functionality we got out the box.
At the same time, being from the Hadoop ecosystem, you get great community support and multiple options to use the tools in different ways.

Related

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

web logs parsing for Spark Streaming

I plan to create a system where I can read web logs in real time, and use apache spark to process them. I am planning to use kafka to pass the logs to spark streaming to aggregate statistics.I am not sure if I should do some data parsing (raw to json ...), and if yes, where is the appropriate place to do it (spark script, kafka, somewhere else...) I will be grateful if someone can guide me. Its kind of a new stuff to me. Cheers
Apache Kafka is a distributed pub-sub messaging system. It does not provide any way to parse or transform data it is not for that. But any Kafka consumer can process, parse or transform the data published to Kafka and republished the transformed data to another topic or store it in a database or file system.
There are many ways to consume data from Kafka one way is the one you suggested, real-time stream processors(apache flume, apache-spark, apache storm,...).
So the answer is no, Kafka does not provide any way to parse the raw data. You can transform/parse the raw data with spark but as well you can write your own consumer as there are many Kafka clients ports or use any other built consumer Apache flume, Apache storm, etc

Zabbix & external monitoring systems

I need to make freinds zabbix & other monitoring system.
My company uses Zabbix for monitoring. Our partner plans to use other system.
We need to exchange monitoring datas.
I'm interested in coopereation with the next systems: BMC Patrol, MS SCOM, NetCool, Portal.
What is the best way to integrate it?
Maybe via SNMP?
Replicate hosts and metrics into your Zabbix (use Zabbix trapper item type and setup also Allowed hosts value) and then just use some suitable zabbix-sender implementation and push data into Zabbix.
IMO it's terrible idea, because latency, syncing, ... Do you really need data (item values) or do you need only visualize data from different datasources in one graph?
Regarding BMC Patrol you can use History Loader/Propagator KM to export the monitoring data:
https://docs.bmc.com/docs/display/public/unixlinux912/PATROL+KM+for+History+Loader
or you can use the 'dump_hist' command to dump the history data from the agents:
https://docs.bmc.com/docs/display/pia9600/dump_hist+uility
Regarding Netcool events, you could get the information using different approaches, for example, depending on the version, you could get the events from the HTTP interface, as described below:
https://www.ibm.com/support/knowledgecenter/en/SSNFET_9.2.0/com.ibm.netcool_OMNIbus.doc_7.4.0/omnibus/wip/api/reference/omn_api_http_httpinterface.html
Or perhaps you could create a flat file gateway to read the events and write them on a file:
https://www.ibm.com/support/knowledgecenter/en/SSSHTQ/omnibus/gateways/flatfilegw/wip/concept/flatfilegw_intro.html

How to use ganglia ui with flume?

I am interested in monitoring my multi-agent apache flume setup. I have enabled the inbuilt ganglia server which provides me the flume metrics through JSON data. Now I am interested in viewing these info in graphs/charts. TO achieve this I am using ganglia web ui, I have these questions - Do I have to install gmond and gmetad to achieve it, if not then how I will use the existing ganglia info with the ganglia web ui ?
Thanks in advance.
You'll need both, IMHO. Moreover, I think Flume can communicate directly to a gmond by appending some stuff in JAVA_OPTS, see hortonworks docs.
You'll need gmetad because it stores your data in RRD files, and the web UI query on it to display graphs.
Graphite can do the job too.

Capture / Monitor system data of application server in Graphite

I am using graphite server to capture my metrics data and bring down to graphs. I have 4 application servers which is load balancer setup. My aim is capture system data such as cpu usage, memory usage, disk load, etc., for all the 4 application servers. I setup an graphite environment in a separate server and i wanted to push the system data for all the applications servers to graphite and get it display as graphs. I don't know what needs to be done for feeding system data to graphite. My thinking was to install statsd in all application servers and feed the system data to graphite but looks like statsd does not support system data rather application data.
Can anyone help me to catch the right track. Thanks in advance.
Running collectd with a graphite agent would be an excellent start to gather the information your after.
There is an almost unlimited amount of ways to get your data into graphite.
You can find a list of tools that have known to work very well with graphite on the readthedocs.org page: http://graphite.readthedocs.org/en/0.9.10/tools.html
There is also an example script that gathers load average from the system in the carbon project: example-client.py

Resources