Issue with copying file into HDFS using FLUME - flume

I have a file in local file system which I want to move in HDFS using FLUME.
hduser#ubuntu:~$ ls -ltr /home/hduser/Desktop/flume_test_dir/
total 7060
-rwxrw-rw- 1 hduser hduser 7226791 Nov 6 10:31 airports.csv
hduser#ubuntu:~$ hadoop fs -ls hdfs://localhost:54310//user/hduser/flume/spool5
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2015-11-07 00:20 hdfs://localhost:54310/user/hduser/flume/spool5/FlumeData.1446884442571.tmp
-rw-r--r-- 1 hduser supergroup 137560 2015-11-07 00:21 hdfs://localhost:54310/user/hduser/flume/spool5/FlumeData.1446884464560.tmp
So my actual file size is 7226791. After FLUME is getting executed it is creating two files of size 137560 and 0.
So the problem is the full file is not getting copied into HDFS and also it is getting split-ed. I want to move it as a single file and want to move the full file. I am using the below configuration. What change may need to be done there?
#Flume Configuration Starts
# Define a file channel called fileChannel on agent_slave_1
agent_slave_1.channels.fileChannel1_1.type = file
# on Ubuntu FS
agent_slave_1.channels.fileChannel1_1.capacity = 200000
agent_slave_1.channels.fileChannel1_1.transactionCapacity = 1000
# Define a source for agent_slave_1
agent_slave_1.sources.source1_1.type = spooldir
# on Ubuntu FS
#Spooldir in my case is /home/hduser/Desktop/flume_test_dir
agent_slave_1.sources.source1_1.spoolDir = /home/hduser/Desktop/flume_test_dir/
agent_slave_1.sources.source1_1.fileHeader = false
agent_slave_1.sources.source1_1.fileSuffix = .COMPLETED
agent_slave_1.sinks.hdfs-sink1_1.type = hdfs
#Sink is /user/hduser/flume/spool5/ under hdfs
agent_slave_1.sinks.hdfs-sink1_1.hdfs.path = hdfs://localhost:54310//user/hduser/flume/spool5/
agent_slave_1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent_slave_1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream
agent_slave_1.sources.source1_1.channels = fileChannel1_1
agent_slave_1.sinks.hdfs-sink1_1.channel = fileChannel1_1
agent_slave_1.sinks = hdfs-sink1_1
agent_slave_1.sources = source1_1
agent_slave_1.channels = fileChannel1_1

Related

NixOS service systemd unit's $PATH does not contain expected dependency

I have the following definition:
hello123 =
(pkgs.writeScriptBin "finderapp" ''
#!${pkgs.stdenv.shell}
# Call hello with a traditional greeting
ls ${pkgs.ffmpeg-full}/bin/ffmpeg
ffmpeg --help
echo hello
''
);
And the service:
systemd.services = {
abcxyz = {
enable = true;
description = "abcxyz";
serviceConfig = {
WorkingDirectory = "%h/temp/";
Type = "simple";
ExecStart = "${hello123}/bin/finderapp";
Restart = "always";
RestartSec = 60;
};
wantedBy = [ "default.target" ];
};
};
However, this seems to fail being able to execute ffmpeg:
Jul 10 19:47:54 XenonKiloCranberry systemd[1]: Started abcxyz.
Jul 10 19:47:54 XenonKiloCranberry finderapp[10042]: /nix/store/9yx9s5yjc6ywafadplblzdfaxqimz95w-ffmpeg-full-4.2.3/bin/ffmpeg
Jul 10 19:47:54 XenonKiloCranberry finderapp[10042]: /nix/store/bxfwljbpvl21wsba00z5dm9dmshsk3bx-finderapp/bin/finderapp: line 5: ffmpeg: command not found
Jul 10 19:47:54 XenonKiloCranberry finderapp[10042]: hello
Why is this failing? I assume it's correctly getting ffmpeg as a runtime dependency (verified with nix-store -q --references ...) as stated in another question here: https://stackoverflow.com/a/68330101/1663462
If I add a echo $PATH to the script, it outputs the following:
Jul 10 19:53:44 XenonKiloCranberry finderapp[12011]: /nix/store/x0jla3hpxrwz76hy9yckg1iyc9hns81k-coreutils-8.31/bin:/nix/store/97vambzyvpvrd9wgrrw7i7svi0s8vny5-findutils-4.7.0/bin:/nix/store/srmjkp5pq8c055j0lak2hn0ls0fis8yl-gnugrep-3.4/bin:/nix/store/p34p7ysy84579lndk7rbrz6zsfr03y71-gnused-4.8/bin:/nix/store/vfzp1mavwiz5w3v10hs69962k0gwl26c-systemd-243.7/bin:/nix/store/x0jla3hpxrwz76hy9yckg1iyc9hns81k-coreutils-8.31/sbin:/nix/store/97vambzyvpvrd9wgrrw7i7svi0s8vny5-findutils-4.7.0/sbin:/nix/store/srmjkp5pq8c055j0lak2hn0ls0fis8yl-gnugrep-3.4/sbin:/nix/store/p34p7ysy84579lndk7rbrz6zsfr03y71-gnused-4.8/sbin:/nix/store/vfzp1mavwiz5w3v10hs69962k0gwl26c-systemd-243.7/sbin
Or these paths basically:
/nix/store/x0jla3hpxrwz76hy9yckg1iyc9hns81k-coreutils-8.31/bin
/nix/store/97vambzyvpvrd9wgrrw7i7svi0s8vny5-findutils-4.7.0/bin
/nix/store/srmjkp5pq8c055j0lak2hn0ls0fis8yl-gnugrep-3.4/bin
/nix/store/p34p7ysy84579lndk7rbrz6zsfr03y71-gnused-4.8/bin
/nix/store/vfzp1mavwiz5w3v10hs69962k0gwl26c-systemd-243.7/bin
/nix/store/x0jla3hpxrwz76hy9yckg1iyc9hns81k-coreutils-8.31/sbin
/nix/store/97vambzyvpvrd9wgrrw7i7svi0s8vny5-findutils-4.7.0/sbin
/nix/store/srmjkp5pq8c055j0lak2hn0ls0fis8yl-gnugrep-3.4/sbin
/nix/store/p34p7ysy84579lndk7rbrz6zsfr03y71-gnused-4.8/sbin
/nix/store/vfzp1mavwiz5w3v10hs69962k0gwl26c-systemd-243.7/sbin
Which shows that ffmpeg is not in there.
I don't think this is the most elegant solution as the dependencies have to be known in the service definition instead of the package/derivation, but it's a solution none the less.
We can add additional paths with path = [ pkgs.ffmpeg-full ];:
abcxyz = {
enable = true;
description = "abcxyz";
path = [ pkgs.ffmpeg-full ];
serviceConfig = {
WorkingDirectory = "%h/temp/";
Type = "simple";
ExecStart = "${hello123}/bin/finderapp";
Restart = "always";
RestartSec = 60;
};
wantedBy = [ "default.target" ];
};
In addition to the previous answers
not using PATH
adding to PATH via systemd config
you can add it to the PATH inside the wrapper script, making the script more self-sufficient and making the extended PATH available to subprocesses, if ffmpeg or any other command needs it (probably not in this case).
The ls command has no effect on subsequent commands, like it shouldn't.
What you want is to add it to PATH:
hello123 =
(pkgs.writeScriptBin "finderapp" ''
#!${pkgs.stdenv.shell}
# Call hello with a traditional greeting
PATH="${pkgs.ffmpeg-full}/bin${PATH:+:${PATH}}"
ffmpeg --help
echo hello
''
);
The part ${PATH:+:${PATH}} takes care of the : and pre-existing PATH, if there is one. The simplistic :${PATH} could effectively add . to PATH when it was empty, although that would be rare.

Count the number of running process with Telegraf

I'm using telegraf, influxdb and grafana to make a monitoring system for a distributed application. The first thing I want to do is to count the number of java process running on a machine.
But when I make my request, the number of process is nearly random (always between 1 and 8 instead of always having 8).
I think there is a mistake in my telegraf configuration but i don't see where.. I tried to change interval but nothing was different : it seems influxdb doesn't have all the data.
I'm running centos 7 and Telegraf v1.5.0 (git: release-1.5 a1668bbf)
All Java process I want to count :
[root#localhost ~]# pgrep -f java
10665
10688
10725
10730
11104
11174
16298
22138
My telegraf.conf :
[global_tags]
# Configuration for telegraf agent
[agent]
interval = "5s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = "my_server"
omit_hostname = false
My input.conf :
# Read metrics about disk usagee
[[inputs.disk]]
fielddrop = [ "inodes*" ]
mount_points=["/", "/workspace"]
# File
[[inputs.filestat]]
files = ["myfile.log"]
# Read the number of running java process
[[inputs.procstat]]
user = "root"
pattern = "java"
My request :
The response :
If you just want to count PID, it's a good way to use exec like this :
[[inputs.exec]]
commands = ["pgrep -c java"] #command to execute
name_override = "the_name" #database's name
data_format = "my_value" #colunm's name
For commands, use pgrep -c java without option -f because it's "full" and also counts the command pgrep (and you have almost the same problem as if you use procstat).
Solution found here
With pattern matching, if it matches multi pids, multi data points are generated with identical tags and timestamp. When these points are sent to influxdb, only the last point is stored.
Example of what may happen with your configuration:
00:00 => pid 1
00:05 => pid 2
00:10 => pid 1
00:15 => pid 5
00:20 => pid 7
00:25 => pid 3
00:30 => pid 3
00:35 => pid 4
00:40 => pid 6
00:45 => pid 7
00:50 => pid 6
00:55 => pid 5
Different pids over one minute = 7 (pid 8 was not stored a single time)
Since it's random, you sometimes hit the 8 different pids in a minute, but most of the time you don't.
To differentiate between processes whose tags are otherwise the same, use pid_tag = true :
[[inputs.procstat]]
user = "root"
pattern = "java"
pid_tag = true
However, if you just want to count the number of processes (and don't care about the stats), just use the exec plugin with a custom command like pgrep -c -f java. This will be more optimized than having multiples time series (with pid_tag you end up with one per pid).

FLUME EXCEPTION

I am trying to configure flume and am following this link. The following command works for me:
flume-ng agent -n TwitterAgent -c conf -f /usr/lib/apache-flume-1.7.0-bin/conf/flume.conf
The result I got with error is,
17/01/31 12:04:08 INFO source.DefaultSourceFactory: Creating instance of source Twitter, type com.cloudera.flume.source.TwitterSource
17/01/31 12:04:08 ERROR node.PollingPropertiesFileConfigurationProvider: Failed to load configuration data.
Exception follows. org.apache.flume.FlumeException:
Unable to load source type:
com.cloudera.flume.source.TwitterSource, class:
com.cloudera.flume.source.TwitterSource.
(This is part of the result, I just copied the error part of it)
Can anyone help to solve this error please? I need to fix it to go on step 24 which is the last step.
Please find CDH 5.12 Flume Twitter Setup:
1. Here is file /usr/lib/flume-ng/conf/flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type= com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = Hadoop,BigData
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/cloudera/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
2. Rename the below flume-env.sh.template file as flume-env.sh
~]$ sudo cp /usr/lib/flume-ng/conf/flume-env.sh.template /usr/lib/flume-ng/conf/flume-env.sh
3. Set JAVA_HOME and FLUME_CLASSPATH in flume-env.sh file as:
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
FLUME_CLASSPATH="/usr/lib/flume-ng/lib/flume-sources-1.0-SNAPSHOT.jar"
4. If you don't find "/usr/lib/flume-ng/lib/flume-sources-1.0-SNAPSHOT.jar" on your system then download the apache-flume-1.6.0-bin from google and copy lib folder of this to current lib folder.
Link: https://www.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
4.1. Rename old lib folder
4.2. Download this above link to your cloudera desktop and do the following:
~]$ sudo mv /usr/lib/flume-ng/lib /usr/lib/flume-ng/lib_cloudera
~]$ sudo mv /home/cloudera/Desktop/apache-flume-1.6.0-bin/lib /usr/lib/flume-ng/lib
5. Now run Flume Agent Command:
~]$ flume-ng agent --conf-file /usr/lib/flume-ng/conf/flume.conf --name TwitterAgent -Dflume.root.logger=INFO,console -n TwitterAgent
This should run successfully.
All the Best.

loading file into hdfs using flume

***I want to load a text file from my system into hdfs.
this is my conf file:
agent.sources = seqGenSrc
agent.sinks = loggerSink
agent.channels = memoryChannel
agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.command = tail -F my.system.IP/D:/salespeople.txt
agent.sinks.loggerSink.type = hdfs
agent.sinks.loggerSink.hdfs.path = hdfs://IP.address:port:user/flume
agent.sinks.loggerSink.hdfs.filePrefix = events-
agent.sinks.loggerSink.hdfs.round = true
agent.sinks.loggerSink.hdfs.roundValue = 10
agent.sinks.loggerSink.hdfs.roundUnit = minute
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000
agent.channels.memoryChannel.transactionCapacity = 100
agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.channel = memoryChannel
** when i run it .. i get following .. and then it gets stuck.
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Channel memoryChannel
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Waiting for channel:
memoryChannel to start. Sleeping for 500 ms
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Sink loggerSink
13/07/23 16:30:44 INFO nodemanager.DefaultLogicalNodeManager: Starting Source seqGenSrc
13/07/23 16:30:44 INFO source.ExecSource: Exec source starting with command:tail -F 10.48.226.27/D:/salespeople.txt
** where am i wrong, or what could be the error ??
I assume you want to write your file to /user/flume, so your path should be :
agent.sinks.loggerSink.hdfs.path = hdfs://IP.address:port/user/flume
As your agent uses tail -F there is no message that tells you it is finished (because it never is ^^). if you want to know if your file is created you have to look at /user/flume folder.
I'm using a configuration like yours and it works perfectly. You could try using
-Dflume.root.logger=INFO,console to have more information ?

Not able to get output in hdfs directory using hdfs as sink in flume

I am trying to give normal text file to flume as source and sink is hdfs ,the source ,channel and sink are showing registered and started but nothing is comming in output directory of hdfs.M new to flume can anyone help me through this???????
Conf for flume .conf file are
agent12.sources = source1
agent12.channels = channel1
agent12.sinks = HDFS
agent12.sources.source1.type = exec
agent12.sources.source1.command = tail -F /usr/sap/sample.txt
agent12.sources.source1.channels = channel1
agent12.sinks.HDFS.channels = channel1
agent12.sinks.HDFS.type = hdfs
agent12.sinks.HDFS.hdfs.path= hdfs://172.18.36.248:50070:user/root/xz
agent12.channels.channel1.type =memory
agent12.channels.channel1.capacity = 1000
agent started using
/usr/bin/flume-ng agent -n agent12 -c usr/lib//flume-ng/conf/sample.conf -f /usr/lib/flume-ng/conf/flume-conf.properties.template

Resources