I know the Enterprise (Cloudera for example) way, by using a CM (via browser) or by Cloudera REST API one can access monitoring and configuring facilities.
But how to schedule (run and rerun) flume agents livecycle, and monitor their running/failure status without CM? Are there such things in the Flume distribution?
Flume's JSON Reporting API can be used to monitor health and performance.
Link
I tried adding flume.monitoring.type/port to flume-ng on start. And it completely fits my needs.
Lets create a simple agent a1 for example. Which listens on localhost:44444 and logs to console as a sink:
# flume.conf
a1.sources = s1
a1.channels = c1
a1.sinks = d1
a1.sources.s1.channels = c1
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sinks.d1.channel = c1
a1.sinks.d1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 10
Run it with additional parameters flume.monitoring.type/port:
flume-ng agent -n a1 -c conf -f flume.conf -Dflume.root.logger=INFO,console -Dflume.monitoring.type=http -Dflume.monitoring.port=44123
And then monitor output in browser at localhost:44123/metrics
{"CHANNEL.c1":{"ChannelCapacity":"100","ChannelFillPercentage":"0.0","Type":"CHANNEL","EventTakeSuccessCount":"570448","ChannelSize":"0","EventTakeAttemptCount":"570573","StartTime":"1567002601836","EventPutAttemptCount":"570449","EventPutSuccessCount":"570448","StopTime":"0"}}
Just try some load:
dd if=/dev/urandom count=1024 bs=1024 | base64 | nc localhost 44444
Related
I was learning Flume and encountered an error. It prompt me was "Could not configure source r1 due to: Failed to configure component!"
My requirements are as follows:
Flume1 monitors port data sends the monitored data to Flume2, Flume3, and Flume4 and prints it to the console.
Data containing "ATguigu" is sent to Flume2, data containing "Shangguigu" is sent to Flume3, and other data is sent to Flume4
My four configuration files are:
flume1:
a1.sources=r1
a1.channels=c1 c2 c3
a1.sinks=k1 k2 k3
a1.sources.r1.type=netcat
a1.sources.r1.bind=localhost
a1.sources.r1.port=5555
# Interceptor
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=com.cmcz.flume.interceptor.EventHeaderInterceptor$MyBuilder
# channel selector:multiplexing channel selector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = title
a1.sources.r1.selector.mapping.at = c1
a1.sources.r1.selector.mapping.st = c2
a1.sources.r1.selector.default = c3
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
a1.channels.c2.type=memory
a1.channels.c2.capacity=10000
a1.channels.c2.transactionCapacity=100
a1.channels.c3.type=memory
a1.channels.c3.capacity=10000
a1.channels.c3.transactionCapacity=100
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=localhost
a1.sinks.k1.port=6666
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=localhost
a1.sinks.k2.port=7777
a1.sinks.k3.type=avro
a1.sinks.k3.hostname=localhost
a1.sinks.k3.port=8888
a1.sources.r1.channnels=c1 c2 c3
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2
a1.sinks.k3.channel=c3
flume2:
a2.sources=r1
a2.channels=c1
a2.sinks=k1
a2.sources.r1.type=avro
a2.sources.r1.bind=localhost
a2.sources.r1.port=6666
a2.channels.c1.type=memory
a2.channels.c1.capacity=10000
a2.channels.c1.transactionCapacity=100
a2.sinks.k1.type=logger
a2.sources.r1.channels=c1
a2.sinks.k1.channel=c1
flume3:
a3.sources=r1
a3.channels=c1
a3.sinks=k1
a3.sources.r1.type=avro
a3.sources.r1.bind=localhost
a3.sources.r1.port=7777
a3.channels.c1.type=memory
a3.channels.c1.capacity=10000
a3.channels.c1.transactionCapacity=100
a3.sinks.k1.type=logger
a3.sources.r1.channels=c1
a3.sinks.k1.channel=c1
flume4:
a4.sources=r1
a4.channels=c1
a4.sinks=r1
a4.sources.r1.type=avro
a4.sources.r1.bind=localhost
a4.sources.r1.port=8888
a4.channels.c1.type=memory
a4.channels.c1.capacity=10000
a4.channels.c1.transactionCapacity=100
a4.sinks.k1.type=logger
a4.sources.r1.channels=c1
a4.sinks.k1.channel=c1
And then I start them:
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/multiplexing/flume4.conf -n a4 -Dflume.root.logger=INTO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/multiplexing/flume3.conf -n a3 -Dflume.root.logger=INTO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/multiplexing/flume2.conf -n a2 -Dflume.root.logger=INTO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/multiplexing/flume1.conf -n a1 -Dflume.root.logger=INTO,console
An error occurred during startup of flume1:
Source r1 failed to start.
What should I do?
I submit my code to a spark stand alone cluster. Submit command is like below:
nohup ./bin/spark-submit \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2" \
./myCode.py 1>a.log 2>b.log &
I specify the executor use 4G memory in above command. But use the top command to monitor the executor process, I notice the memory usage keeps growing. Now the top Command output is below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12578 root 20 0 20.223g 5.790g 23856 S 61.5 37.3 20:49.36 java
My total memory is 16G so 37.3% is already bigger than the 4GB I specified. And it is still growing.
Use the ps command , you can know it is the executor process.
[root#ES01 ~]# ps -awx | grep spark | grep java
10409 ? Sl 1:43 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
10603 ? Sl 6:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
12420 ? Sl 10:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 --executor-memory 4G --num-executors 1 --total-executor-cores 1 /opt/flowSpark/sparkStream/ForAsk01.py
12578 ? Sl 21:03 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#10.79.148.184:52931 --executor-id 0 --hostname 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url spark://Worker#10.79.148.184:52660
Below are the code. It is very simple so I do not think there is memory leak
if __name__ == "__main__":
dataDirectory = '/stream/raw'
sc = SparkContext(appName="Netflow")
ssc = StreamingContext(sc, 20)
# Read CSV File
lines = ssc.textFileStream(dataDirectory)
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
The code for process function is below. Please note that I am using HiveContext not SqlContext here. Because SqlContext do not support window function
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = HiveContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def process(time, rdd):
if rdd.isEmpty():
return sc.emptyRDD()
sqlContext = getSqlContextInstance(rdd.context)
# Convert CSV File to Dataframe
parts = rdd.map(lambda l: l.split(","))
rowRdd = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), flow_direction=p[9], bits=int(p[11])))
dataframe = sqlContext.createDataFrame(rowRdd)
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
dataframe.show()
Actually I found below code will cause the problem:
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
Because If I remove these 5 line. The code can run all night without showing memory increase. But adding them will cause the memory usage of executor grow to a very high number.
Basically the above code is just some window + grouby in SparkSQL. So is this a bug?
Disclaimer: this answer isn't based on debugging, but more on observations and the documentation Apache Spark provides
I don't believe that this is a bug to begin with!
Looking at your configurations, we can see that you are focusing mostly on the executor tuning, which isn't wrong, but you are forgetting the driver part of the equation.
Looking at the spark cluster overview from Apache Spark documentaion
As you can see, each worker has an executor, however, in your case, the worker node is the same as the driver node! Which frankly is the case when you run locally or on a standalone cluster in a single node.
Further, the driver takes 1G of memory by default unless tuned using spark.driver.memory flag. Furthermore, you should not forget about the heap usage from the JVM itself, and the Web UI that's been taken care of by the driver too AFAIK!
When you delete the lines of code you mentioned, your code is left without actions as map function is just a transformation, hence, there will be no execution, and therefore, you don't see memory increase at all!
Same applies on groupBy as it is just a transformation that will not be executed unless an action is being called which in your case is agg and show further down the stream!
That said, try to minimize your driver memory and the overall number of cores in spark which is defined by spark.cores.max if you want to control the number of cores on this process, then cascade down to the executors. Moreover, I would add spark.python.profile.dump to your list of configuration so you can see a profile for your spark job execution, which can help you more with understanding the case, and to tune your cluster more to your needs.
As I can see in your 5 lines, maybe the groupBy is the issue , would you try with reduceBy, and see how it performs.
See here and here.
enter image description here
enter image description here
my flume configuration file is as follows:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/flume-1.5.0-bin/log_exec_tail
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
And start my flume agent with the following stript:
bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
question 1: the run result is as follows, however I don't konw if it run successful or not!
question 2: And there is the sentences as follows and I don't know what the mean is about "the queation of flume test":
NOTE: To test that the Flume agent is running properly, open a new terminal window and change directories to /home/horton/solutions/:
horton#ip:~$ cd /home/horton/solutions/
Run the following script, which writes log entries to nodemanager.log:
$ ./test_flume_log.sh
If successful, you should see new files in the /user/horton/flume_sink directory in HDFS
Stop the logagent Flume agent
As per your flume configuration, whenever the file /home/hadoop/flume-1.5.0-bin/log_exec_tail is changed, it will do a tail operation and append the results in the console.
So to test it working correctly,
1. run the command bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
2. Open a terminal and add few lines in the file /home/hadoop/flume-1.5.0-bin/log_exec_tail
3. Save it
4. Now check the terminal where you triggered flume command
5. You can see newly added lines displayed
I am obtaining CPU and RAM statistics for the openvpn process by running the following command in a Python script on a Linux Debian 7 box.
>ps aux | grep openvpn
The output is parsed and sent to a zabbix monitoring server.
I currently use the following Python script called psperf.py.
If I want CPU% stats I run: psperf 2
>#!/usr/bin/env python
>
>import subprocess, sys, shlex
>
>psval=sys.argv[1] #ps aux val to extract such as CPU etc #2 = %CPU, 3 = %MEM, 4 = VSZ, 5 = RSS
>
>#https://stackoverflow.com/questions/6780035/python-how-to-run-ps-cax-grep-something-in-python
>proc1 = subprocess.Popen(shlex.split('ps aux'),stdout=subprocess.PIPE)
>proc2 = subprocess.Popen(shlex.split('grep >openvpn'),stdin=proc1.stdout,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
>
>proc1.stdout.close() # Allow proc1 to receive a SIGPIPE if proc2 exits.
>out,err=proc2.communicate()
>
>#string stdout?
>output = (format(out))
>
>#create output list
>output = output.split()
>
>#make ps val an integer to enable list location
>psval = int(psval)
>
>#extract value to send to zabbix from output list
>val = output[psval]
>
>#OUTPUT
>print val
This script works fine for obtaining the data in relation to openvpn. However I now want to reuse the script by passing process details from which to extract data without having to have a script for each individual process. For example I might want CPU and RAM statistics for the zabbix process.
I have tried various solutions including the following but get an index out of range.
For example I run: psperf 2 apache
>#!/usr/bin/env python
>
>import subprocess, sys, shlex
>
>psval=sys.argv[1] #ps aux val to extract such as CPU etc #2 = %CPU, 3 = %MEM, 4 = VSZ, 5 = RSS
>psname=sys.argv[2] #process details/name
>
>#https://stackoverflow.com/questions/6780035/python-how-to-run-ps-cax-grep-something-in-python
>proc1 = subprocess.Popen(shlex.split('ps aux'),stdout=subprocess.PIPE)
>proc2 = subprocess.Popen(shlex.split('grep', >psname),stdin=proc1.stdout,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
>
>proc1.stdout.close() # Allow proc1 to receive a SIGPIPE if proc2 exits.
>out,err=proc2.communicate()
>
>#string stdout?
>output = (format(out))
>
>#create output list
>output = output.split()
>
>#make ps val an integer to enable list location
>psval = int(psval)
>
>#extract value to send to zabbix from output list
>val = output[psval]
>
>#OUTPUT
>print val
Error:
>root#Deb764opVPN:~# python /usr/share/zabbix/externalscripts/psperf.py 4 openvpn
>Traceback (most recent call last):
> File "/usr/share/zabbix/externalscripts/psperf.py", line 25, in <module>
> val = output[psval]
>IndexError: list index out of range
In the past I haven't used the shlex class which is new to me. This was necessary to pipe the ps aux command to grep securely - avoiding shell = true - a security hazard (http://docs.python.org/2/library/subprocess.html).
I adopted the script from: How to run " ps cax | grep something " in Python?
I believe its to do with how shlex handles my request but I`m not to sure how to go forward.
Can you help? As in how can I successfully pass a value to the grep command.
I can see this being benfical to many others who pipe commands etc.
Regards
Aidan
I carried on researching and solved using the following:
!/usr/bin/env python
import subprocess, sys # , shlex
psval=sys.argv[1] #ps aux val to extract such as CPU etc #2 = %CPU, 3 = %MEM, 4 = VSZ, 5 = RSS
psname=sys.argv[2] #process details/name
#http://www.cyberciti.biz/tips/grepping-ps-output-without-getting-grep.html
proc1 = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE)
proc2 = subprocess.Popen(['grep', psname], stdin=proc1.stdout,stdout=subprocess.PIPE)
proc1.stdout.close() # Allow proc1 to receive a SIGPIPE if proc2 exits.
stripres = proc2.stdout.read()
#TEST RESULT
print stripres
#create output list
output = stripres.split()
#make ps val an integer to enable list location
psval = int(psval)
#extract value to send to zabbix from output list
val = output[psval]
#OUTPUT
print val
Regards
Aidan
I am trying to give normal text file to flume as source and sink is hdfs ,the source ,channel and sink are showing registered and started but nothing is comming in output directory of hdfs.M new to flume can anyone help me through this???????
Conf for flume .conf file are
agent12.sources = source1
agent12.channels = channel1
agent12.sinks = HDFS
agent12.sources.source1.type = exec
agent12.sources.source1.command = tail -F /usr/sap/sample.txt
agent12.sources.source1.channels = channel1
agent12.sinks.HDFS.channels = channel1
agent12.sinks.HDFS.type = hdfs
agent12.sinks.HDFS.hdfs.path= hdfs://172.18.36.248:50070:user/root/xz
agent12.channels.channel1.type =memory
agent12.channels.channel1.capacity = 1000
agent started using
/usr/bin/flume-ng agent -n agent12 -c usr/lib//flume-ng/conf/sample.conf -f /usr/lib/flume-ng/conf/flume-conf.properties.template