loading file into hdfs using flume - flume

***I want to load a text file from my system into hdfs.
this is my conf file:
agent.sources = seqGenSrc
agent.sinks = loggerSink
agent.channels = memoryChannel
agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.command = tail -F my.system.IP/D:/salespeople.txt
agent.sinks.loggerSink.type = hdfs
agent.sinks.loggerSink.hdfs.path = hdfs://IP.address:port:user/flume
agent.sinks.loggerSink.hdfs.filePrefix = events-
agent.sinks.loggerSink.hdfs.round = true
agent.sinks.loggerSink.hdfs.roundValue = 10
agent.sinks.loggerSink.hdfs.roundUnit = minute
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000
agent.channels.memoryChannel.transactionCapacity = 100
agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.channel = memoryChannel
** when i run it .. i get following .. and then it gets stuck.
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Channel memoryChannel
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Waiting for channel:
memoryChannel to start. Sleeping for 500 ms
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Sink loggerSink
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Source seqGenSrc
13/07/23 16:30:44 INFO source.ExecSource: Exec source starting with command:tail -F 10.48.226.27/D:/salespeople.txt
** where am i wrong, or what could be the error ??

I assume you want to write your file to /user/flume, so your path should be :
agent.sinks.loggerSink.hdfs.path = hdfs://IP.address:port/user/flume
As your agent uses tail -F there is no message that tells you it is finished (because it never is ^^). if you want to know if your file is created you have to look at /user/flume folder.
I'm using a configuration like yours and it works perfectly. You could try using
-Dflume.root.logger=INFO,console to have more information ?

Related

How to monitor Apache Flume agents status?

I know the Enterprise (Cloudera for example) way, by using a CM (via browser) or by Cloudera REST API one can access monitoring and configuring facilities.
But how to schedule (run and rerun) flume agents livecycle, and monitor their running/failure status without CM? Are there such things in the Flume distribution?
Flume's JSON Reporting API can be used to monitor health and performance.
Link
I tried adding flume.monitoring.type/port to flume-ng on start. And it completely fits my needs.
Lets create a simple agent a1 for example. Which listens on localhost:44444 and logs to console as a sink:
# flume.conf
a1.sources = s1
a1.channels = c1
a1.sinks = d1
a1.sources.s1.channels = c1
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sinks.d1.channel = c1
a1.sinks.d1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 10
Run it with additional parameters flume.monitoring.type/port:
flume-ng agent -n a1 -c conf -f flume.conf -Dflume.root.logger=INFO,console -Dflume.monitoring.type=http -Dflume.monitoring.port=44123
And then monitor output in browser at localhost:44123/metrics
{"CHANNEL.c1":{"ChannelCapacity":"100","ChannelFillPercentage":"0.0","Type":"CHANNEL","EventTakeSuccessCount":"570448","ChannelSize":"0","EventTakeAttemptCount":"570573","StartTime":"1567002601836","EventPutAttemptCount":"570449","EventPutSuccessCount":"570448","StopTime":"0"}}
Just try some load:
dd if=/dev/urandom count=1024 bs=1024 | base64 | nc localhost 44444

FLUME EXCEPTION

I am trying to configure flume and am following this link. The following command works for me:
flume-ng agent -n TwitterAgent -c conf -f /usr/lib/apache-flume-1.7.0-bin/conf/flume.conf
The result I got with error is,
17/01/31 12:04:08 INFO source.DefaultSourceFactory: Creating instance of source Twitter, type com.cloudera.flume.source.TwitterSource
17/01/31 12:04:08 ERROR node.PollingPropertiesFileConfigurationProvider: Failed to load configuration data.
Exception follows. org.apache.flume.FlumeException:
Unable to load source type:
com.cloudera.flume.source.TwitterSource, class:
com.cloudera.flume.source.TwitterSource.
(This is part of the result, I just copied the error part of it)
Can anyone help to solve this error please? I need to fix it to go on step 24 which is the last step.
Please find CDH 5.12 Flume Twitter Setup:
1. Here is file /usr/lib/flume-ng/conf/flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type= com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = Hadoop,BigData
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/cloudera/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
2. Rename the below flume-env.sh.template file as flume-env.sh
~]$ sudo cp /usr/lib/flume-ng/conf/flume-env.sh.template /usr/lib/flume-ng/conf/flume-env.sh
3. Set JAVA_HOME and FLUME_CLASSPATH in flume-env.sh file as:
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
FLUME_CLASSPATH="/usr/lib/flume-ng/lib/flume-sources-1.0-SNAPSHOT.jar"
4. If you don't find "/usr/lib/flume-ng/lib/flume-sources-1.0-SNAPSHOT.jar" on your system then download the apache-flume-1.6.0-bin from google and copy lib folder of this to current lib folder.
Link: https://www.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
4.1. Rename old lib folder
4.2. Download this above link to your cloudera desktop and do the following:
~]$ sudo mv /usr/lib/flume-ng/lib /usr/lib/flume-ng/lib_cloudera
~]$ sudo mv /home/cloudera/Desktop/apache-flume-1.6.0-bin/lib /usr/lib/flume-ng/lib
5. Now run Flume Agent Command:
~]$ flume-ng agent --conf-file /usr/lib/flume-ng/conf/flume.conf --name TwitterAgent -Dflume.root.logger=INFO,console -n TwitterAgent
This should run successfully.
All the Best.

The run result of flume and test flume

enter image description here
enter image description here
my flume configuration file is as follows:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/flume-1.5.0-bin/log_exec_tail
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
And start my flume agent with the following stript:
bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
question 1: the run result is as follows, however I don't konw if it run successful or not!
question 2: And there is the sentences as follows and I don't know what the mean is about "the queation of flume test":
NOTE: To test that the Flume agent is running properly, open a new terminal window and change directories to /home/horton/solutions/:
horton#ip:~$ cd /home/horton/solutions/
Run the following script, which writes log entries to nodemanager.log:
$ ./test_flume_log.sh
If successful, you should see new files in the /user/horton/flume_sink directory in HDFS
Stop the logagent Flume agent
As per your flume configuration, whenever the file /home/hadoop/flume-1.5.0-bin/log_exec_tail is changed, it will do a tail operation and append the results in the console.
So to test it working correctly,
1. run the command bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
2. Open a terminal and add few lines in the file /home/hadoop/flume-1.5.0-bin/log_exec_tail
3. Save it
4. Now check the terminal where you triggered flume command
5. You can see newly added lines displayed

Issue with copying file into HDFS using FLUME

I have a file in local file system which I want to move in HDFS using FLUME.
hduser#ubuntu:~$ ls -ltr /home/hduser/Desktop/flume_test_dir/
total 7060
-rwxrw-rw- 1 hduser hduser 7226791 Nov 6 10:31 airports.csv
hduser#ubuntu:~$ hadoop fs -ls hdfs://localhost:54310//user/hduser/flume/spool5
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2015-11-07 00:20 hdfs://localhost:54310/user/hduser/flume/spool5/FlumeData.1446884442571.tmp
-rw-r--r-- 1 hduser supergroup 137560 2015-11-07 00:21 hdfs://localhost:54310/user/hduser/flume/spool5/FlumeData.1446884464560.tmp
So my actual file size is 7226791. After FLUME is getting executed it is creating two files of size 137560 and 0.
So the problem is the full file is not getting copied into HDFS and also it is getting split-ed. I want to move it as a single file and want to move the full file. I am using the below configuration. What change may need to be done there?
#Flume Configuration Starts
# Define a file channel called fileChannel on agent_slave_1
agent_slave_1.channels.fileChannel1_1.type = file
# on Ubuntu FS
agent_slave_1.channels.fileChannel1_1.capacity = 200000
agent_slave_1.channels.fileChannel1_1.transactionCapacity = 1000
# Define a source for agent_slave_1
agent_slave_1.sources.source1_1.type = spooldir
# on Ubuntu FS
#Spooldir in my case is /home/hduser/Desktop/flume_test_dir
agent_slave_1.sources.source1_1.spoolDir = /home/hduser/Desktop/flume_test_dir/
agent_slave_1.sources.source1_1.fileHeader = false
agent_slave_1.sources.source1_1.fileSuffix = .COMPLETED
agent_slave_1.sinks.hdfs-sink1_1.type = hdfs
#Sink is /user/hduser/flume/spool5/ under hdfs
agent_slave_1.sinks.hdfs-sink1_1.hdfs.path = hdfs://localhost:54310//user/hduser/flume/spool5/
agent_slave_1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent_slave_1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream
agent_slave_1.sources.source1_1.channels = fileChannel1_1
agent_slave_1.sinks.hdfs-sink1_1.channel = fileChannel1_1
agent_slave_1.sinks = hdfs-sink1_1
agent_slave_1.sources = source1_1
agent_slave_1.channels = fileChannel1_1

Not able to get output in hdfs directory using hdfs as sink in flume

I am trying to give normal text file to flume as source and sink is hdfs ,the source ,channel and sink are showing registered and started but nothing is comming in output directory of hdfs.M new to flume can anyone help me through this???????
Conf for flume .conf file are
agent12.sources = source1
agent12.channels = channel1
agent12.sinks = HDFS
agent12.sources.source1.type = exec
agent12.sources.source1.command = tail -F /usr/sap/sample.txt
agent12.sources.source1.channels = channel1
agent12.sinks.HDFS.channels = channel1
agent12.sinks.HDFS.type = hdfs
agent12.sinks.HDFS.hdfs.path= hdfs://172.18.36.248:50070:user/root/xz
agent12.channels.channel1.type =memory
agent12.channels.channel1.capacity = 1000
agent started using
/usr/bin/flume-ng agent -n agent12 -c usr/lib//flume-ng/conf/sample.conf -f /usr/lib/flume-ng/conf/flume-conf.properties.template

Resources