Apache Flume agent not starting but not showing error - flume

I'm attempting to run an Apache Flume agent from an AWS EC2 cluster but when I start the agent, it neither starts nor throws an obvious error.
I'm just starting with the simple example from Apache's documentation.
When I run:
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console
The console output is the following:
Info: Sourcing environment configuration script /home/ubuntu/Flume/conf/flume-env.sh
Info: Including HBASE libraries found via (/home/ubuntu/hbase-2.4.4/bin/hbase) for HBASE access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx20m -Dflume.root.logger=DEBUG,console -Dflume.root.logger=DEBUG,console -cp '/home/ubuntu/Flume/conf:/home/ubuntu/Flume/lib/*:/home/ubuntu/hbase-2.4.4/conf:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/home/ubuntu/hbase-2.4.4:/home/ubuntu/hbase-2.4.4/lib/shaded-clients/hbase-shaded-client-2.4.4.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/slf4j-api-1.7.30.jar:/home/ubuntu/hbase-2.4.4/conf:/lib/*' -Djava.library.path=:/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib org.apache.flume.node.Application --conf-file example.conf --name a1 ./bin/flume-ng agent --conf-file example.conf --name a1
The agent doesn't throw an error but never gets further than this. I have also tried some variations including --conf-file conf/example.conf
Flume and Java appear to be installed correctly:
Flume
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng version
Source code repository: https://git.apache.org/repos/asf/flume.git
Revision: 1a15927e594fd0d05a59d804b90a9c31ec93f5e1
Compiled by rgoers on Sun Oct 16 14:44:15 MST 2022
From source with checksum bbbca682177262aac3a89defde369a37
Java
ubuntu#ip-172-31-41-5:~/Flume$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu222.04, mixed mode, sharing)
Example.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume.env
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced
# during Flume startup.
# Enviroment variables can be set here.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
# Let Flume write raw event data and configuration information to its log files for debugging
# purposes. Enabling these flags is not recommended in production,
# as it may result in logging sensitive user information or encryption secrets.
# export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "
# Note that the Flume conf directory is always included in the classpath.
#FLUME_CLASSPATH=""
The only clue that I have is is in flume.log which shows the following error. I've even copied example.conf into the main Flume directory but it doesn't seem to make a difference.
03 Dec 2022 21:10:03,538 ERROR [main] (org.apache.flume.node.Application.main:506) - A fatal error occurred while running. Exception follows.org.apache.flume.conf.ConfigurationException: Unable to read file /home/ubuntu/Flume/example.confat org.apache.flume.node.FileConfigurationSource.<init>(FileConfigurationSource.java:52) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.FileConfigurationSourceFactory.createConfigurationSource(FileConfigurationSourceFactory.java:40) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.ConfigurationSourceFactory.getConfigurationSource(ConfigurationSourceFactory.java:39) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.Application.main(Application.java:476) ~[flume-ng-node-1.11.0.jar:1.11.0]Caused by: java.nio.file.NoSuchFileException: /home/ubuntu/Flume/example.confat sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_352]at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_352]at java.nio.file.Files.readAllBytes(Files.java:3152) ~[?:1.8.0_352]at org.apache.flume.node.FileConfigurationSource.<init>(FileConfigurationSource.java:49) ~[flume-ng-node-1.11.0.jar:1.11.0]... 3 more

Related

How do I capture hs_err_pid.log in container environments

When i run my jar application, the jvm is getting crash with the error
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=7, tid=0x00007ff273404b38
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.60.0.22-SA-linux-musl-x64) (8.0_322-b06) (build 1.8.0_322-b06)
# Java VM: OpenJDK 64-Bit Server VM (25.322-b06 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000003fd6
#
# Core dump written. Default location: /usr/app/core or core.7
#
# An error report file with more information is saved as:
# /usr/app/hs_err_pid7.log
#
# If you would like to submit a bug report, please visit:
# http://www.azul.com/support/
#
The error says that a report file is saved hs_err_pid7.log, but i'm not able to view it because as soon as the jvm crashes, the container stops running, Is there a way to save this file outside the container?
Thanks
Try to mount persistent volume rather than the bind volume !

Flink Statefun netty DisconnectedException

I am trying to run a flink statefun (version 3.2.0) application on my local machine using docker, with a single task manager and a single job manager. The application is a pipeline of multiple services that communicate to each other via sending messages through Kafka to HTTP function endpoints using aiohttp with gunicorn. At the beginning of the pipeline is a service that pulls results from Amazon S3 and sends them to the rest of pipeline, at a rate of about 8000-10000 requests per minute.
When I run it, it at first runs successfully, but looking at the docker logs for the flink worker (task manager) container, I repeatedly see these warnings:
2022-03-18 17:35:43,315 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(analytics-transformer, dispatch, 77ce0dcb-347c-4c03-bc32-f7ebb734b930), batchSize=1, totalSizeInBytes=1323, numberOfStates=0)
org.apache.flink.statefun.flink.core.nettyclient.exceptions.DisconnectedException: Disconnected
18:25:27,594 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(web, statefun, 82936819-b3d9-4a24-b4eb-81a189d6306c), batchSize=1, totalSizeInBytes=1434, numberOfStates=0)
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
Eventually I see this warning as well:
2022-03-18 18:06:44,848 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(web, statefun, f004409f-77be-433c-8ab1-ae5f9dad605c), batchSize=1, totalSizeInBytes=1172, numberOfStates=0)
java.lang.IllegalStateException: FixedChannelPool was closed
And after some time the flink master fails due to request timeout and has to restart:
org.apache.flink.statefun.flink.core.functions.StatefulFunctionInvocationException: An error occurred when attempting to invoke function FunctionType(analytics-transformer, dispatch).
at org.apache.flink.statefun.flink.core.functions.StatefulFunction.receive(StatefulFunction.java:50) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.ReusableContext.apply(ReusableContext.java:74) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.LocalFunctionGroup.processNextEnvelope(LocalFunctionGroup.java:60) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.Reductions.processEnvelopes(Reductions.java:164) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.AsyncSink.drainOnOperatorThread(AsyncSink.java:119) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.lang.IllegalStateException: Failure forwarding a message to a remote function Address(analytics-transformer, dispatch, 77d07eb3-f499-4265-a456-b0f75d738830)
at org.apache.flink.statefun.flink.core.reqreply.RequestReplyFunction.onAsyncResult(RequestReplyFunction.java:170) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.reqreply.RequestReplyFunction.invoke(RequestReplyFunction.java:124) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.StatefulFunction.receive(StatefulFunction.java:48) ~[statefun-flink-core.jar:3.2.0]
... 16 more
Caused by: org.apache.flink.statefun.flink.core.nettyclient.exceptions.RequestTimeoutException
I am guessing this is a load issue due to the number of incoming requests, where the worker is unable to handle them all. This is what I have configured for each of the HTTP function endpoints in the module.yaml:
spec:
functions: <function>
urlPathTemplate: <url>
transport:
type: io.statefun.transports.v1/async
call: 15min
connect: 15min
pool_ttl: 45s
pool_size: 1024
payload_max_bytes: 33554432
I find that decreasing the pool size to a small value like 20 reduces the number of warnings, but then later on I see this warning a lot:
2022-03-18 15:44:52,566 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(analytics-transformer, dispatch, 7facef98-b659-442e-846f-4e4d45559555), batchSize=1, totalSizeInBytes=739, numberOfStates=0)
org.apache.flink.shaded.netty4.io.netty.channel.pool.FixedChannelPool$AcquireTimeoutException: Acquire operation took longer then configured maximum time
which I'm assuming means that the connections were not able to form before the timeout due to the small pool size compared to the large number of requests.
Here is the flink-conf.yaml:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is the base for the Apache Flink configuration
statefun.flink-job-name: Statefun Application
#==============================================================================
# Configurations strictly required by Stateful Functions. Do not change.
#==============================================================================
classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
#==============================================================================
# Fault tolerance, checkpointing and recovery.
# For more related configuration options, please see: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#fault-tolerance
#==============================================================================
# Uncomment the below to enable checkpointing for your application
#execution.checkpointing.mode: EXACTLY_ONCE
#execution.checkpointing.interval: 5sec
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647
restart-strategy.fixed-delay.delay: 1sec
state.backend.local-recovery: true
state.backend: rocksdb
state.backend.rocksdb.timer-service.factory: ROCKSDB
state.backend.rocksdb.localdir: /local/state/rocksdb
state.backend.rocksdb.memory.partitioned-index-filters: true
state.backend.rocksdb.checkpoint.transfer.thread.num: 8
state.backend.rocksdb.thread.num: 4
state.checkpoints.dir: file:///checkpoint-dir
state.backend.incremental: true
taskmanager.state.local.root-dirs: file:///local/state/recovery
#==============================================================================
# Recommended memory configurations. Users may change according to their needs.
#==============================================================================
jobmanager.memory.process.size: 1g
taskmanager.memory.process.size: 4g
#==============================================================================
# Support easy upgrades as the module.yaml file updates
#==============================================================================
pipeline.auto-generate-uids: false
execution.savepoint.ignore-unclaimed-state: true
statefun.async.max-per-task: 163840
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.interval: 5sec
I have also tried increasing the number of gunicorn workers and setting taskmanager.network.netty.server.numThreads to 100 in the flink-conf.yaml, but this does not seem to fix the issue.

Docker container Application logs to ELK stack without filebeat

I'm using the Elasti Cloud as it appears to be the most suitable for quickly setting up application logging. I have 24 docker container running in different nodes, and some containers have no of replicas also. i want to export inside docker container logs to elk stack.. I don't want to install Filebeat on each of my containers because that seems like it goes directly against Docker's separation of duties mantra.
.... how do I get logs from my application containers to log stash server
You can send your syslog to Logstash by configuring rsyslogd like this
# /etc/rsyslog.d/99-ship-syslog.conf
*.*;syslog;auth,authpriv.none action(
type="omfwd"
Target="myremote.elk-server.net"
Port="5001"
Protocol="udp"
)
If you don't have rsyslog running yet, you can add it like so (alpine linux example):
# Dockerfile
FROM alpine:3.7
RUN apk update \
&& apk add rsyslog
COPY rsyslog.conf /etc/rsyslog.conf
EXPOSE 514 514/udp
VOLUME [ "/var/log", "/etc/rsyslog.d" ]
ENTRYPOINT [ "rsyslogd", "-n" ]
--
# rsyslogd.conf
#
# if you experience problems, check:
# http://www.rsyslog.com/troubleshoot
#### MODULES ####
module(load="imuxsock") # local system logging support (e.g. via logger command)
#module(load="imklog") # kernel logging support (previously done by rklogd)
module(load="immark") # --MARK-- message support
module(load="imudp") # UDP listener support
input(type="imudp" port="514")
# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.* action(type="omfile" file="/dev/console")
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none action(type="omfile" file="/var/log/messages")
# The authpriv file has restricted access.
authpriv.* action(type="omfile" file="/var/log/secure")
# Log all the mail messages in one place.
mail.* action(type="omfile" file="/var/log/maillog")
# Log cron stuff
cron.* action(type="omfile" file="/var/log/cron")
# Everybody gets emergency messages
*.emerg action(type="omusrmsg" users="*")
# Save news errors of level crit and higher in a special file.
uucp,news.crit action(type="omfile" file="/var/log/spooler")
# Save boot messages also to boot.log
local7.* action(type="omfile" file="/var/log/boot.log")
# log every host in its own directory
if $fromhost-ip then /var/log/$fromhost-ip/messages
# Include all .conf files in /etc/rsyslog.d
$IncludeConfig /etc/rsyslog.d/*.conf
$template GRAYLOGRFC5424,"<%PRI%>%PROTOCOL-VERSION% %TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% %MSGID% %STRUCTURED-DATA% %msg%\n"
*.info;mail.none;authpriv.none;cron.none;*.* ##graylog:514;GRAYLOGRFC5424 # forward everything to remote server
As you're running within a java-application, you can even send you logs directly to syslog. Here's a small configuration example with log4j
log4j.rootLogger=INFO, SYSLOG
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.syslogHost=myremote.elk-server.net
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.conversionPattern=%d{ISO8601} %-5p [%t] %c{2} %x - %m%n
log4j.appender.SYSLOG.Facility=LOCAL1

Running spark examples on Cloudera VM 5.7 and

I am learning hadoop, machine learning and spark. I have downloaded Cloudera 5.7 Quick Start VM. I have also downloaded the examples from https://github.com/apache/spark as a zip file and copied them to the Cloudera VM. I have a challenge running the machine learning and any examples from https://github.com/apache/spark. I tried running the simple word count example but failed. Below are my steps and the error i get
[cloudera#quickstart.cloudera] cd /spark-master/examples/src/main/python/ml
[cloudera#quickstart.cloudera] spark-submit word2vec_example.py
All examples I try to run fail with the below error.
Traceback (most recent call last):
File "/home/cloudera/training/spark-master/examples/src/main/python/ml/word2vec_example.py", line 23, in
from pyspark.sql import SparkSession
I did a search for the file pyspark.sql but I could only find the below file
cd /spark-master
find . -name pyspark.sql
./python/docs/pyspark.sql.rst
Please advise on how i can resolve these errors so that i can run this example in order speed up my machine learning and big data.
the code for the word count example is below
cat word2vec_example.py
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import print_function
# $example on$
from pyspark.ml.feature import Word2Vec
# $example off$
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("Word2VecExample")\
.getOrCreate()
# $example on$
# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
result = model.transform(documentDF)
for feature in result.select("result").take(3):
print(feature)
# $example off$
spark.stop()
line 23: spark = SparkSession\
SparkSession is new in Spark 2.0, and Cloudera only ships with Spark 1.6 by default. You can either download the examples from Spark 1.6 or install Spark 2.0 on Cloudera.

Issues with Flume HDFS sink from Twitter

I currently have this configuration in Flume :
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = YPTxqtRamIZ1bnJXYwGW
TwitterAgent.sources.Twitter.consumerSecret = Wjyw9714OBzao7dktH0csuTByk4iLG9Zu4ddtI6s0ho
TwitterAgent.sources.Twitter.accessToken = 2340010790-KhWiNLt63GuZ6QZNYuPMJtaMVjLFpiMP4A2v
TwitterAgent.sources.Twitter.accessTokenSecret = x1pVVuyxfvaTbPoKvXqh2r5xUA6tf9einoByLIL8rar
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
The twitter app auth keys are correct.
And I keep getting this error in the flume log file:
ERROR org.apache.flume.SinkRunner
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop1
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:446)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2310)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2344)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2326)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:353)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:227)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:221)
at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:589)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:161)
at org.apache.flume.sink.hdfs.BucketWriter.access$800(BucketWriter.java:57)
at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:586)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
... 1 more
Caused by: java.net.UnknownHostException: hadoop1
... 23 more
Does any one here knows why and could explain it to me?
Thanks in advance.
According to the Exception, the problem is that the host hadoop1 is unknown.
according to the flume configuration file the path you have given is
hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
which is supposed to be accessible from the machine with the flume agent. since machine names cannot be used to access the HDFS without being in the same domain, you need to access the HDFS using the IP address as set in core-site.xml

Resources