how to improve apache flume performance to write data in hbase - flume

I m using apache-flume1.4.0 with hbase0.94.10 and hadoop1.1.2.
flume agent have spool directory as source and hbase as sink and file channel.It is running successfully but very slow.what should I do for improving hbase write performance.
Flume agent conf is as below:
agent1.sources = spool
agent1.channels = fileChannel
agent1.sinks = sink
agent1.sources.spool.type = spooldir
agent1.sources.spool.spoolDir = /opt/spoolTest/
agent1.sources.spool.fileSuffix = .completed
agent1.sources.spool.channels = fileChannel
#agent1.sources.spool.deletePolicy = immediate
agent1.sinks.sink.type = org.apache.flume.sink.hbase.HBaseSink
agent1.sinks.sink.channel = fileChannel
agent1.sinks.sink.table = test
agent1.sinks.sink.columnFamily = log
agent1.sinks.sink.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent1.sinks.sink.serializer.regex = (.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)
agent1.sinks.sink.serializer.colNames = id,no_fill_reason,adInfo,locationInfo,handsetInfo,siteInfo,reportDate,ipaddress,headerContent,userParaContent,reqParaContent,otherPara,others,others1
agent1.sinks.sink1.batchSize = 100
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /usr/flumeFileChannel/chkpointFlume
agent1.channels.fileChannel.dataDirs = /usr/flumeFileChannel/dataFlume
agent1.channels.fileChannel.capacity = 10000000
agent1.channels.fileChannel.transactionCapacity = 100000
What should be capacity,transaction capacity of file channel and batch size of sink.
Please help me.
Thanks in advance.

Related

The first few messages are lost when transmitted to mqtt clients that were offline

I have vernemq server and mqtt clients using paho mqtt library (with python or C - no matter). Both subscribers and publishers use Qos2 and clean_session == False. So the problem is when subscriber is offline, I try to send some messages. Some of them are lost. After a detailed study of the parameters, I found out that the first max_inflight_messages are lost. What I mean. In the config file vernemq.conf I set max_inflight_messages = 20 (by default). Then subscriber go to offline, I send 21 messages, then subscriber go online, and first 20 are lost, 21s is delivered. I tried it many times with different amount of messages - the same result, first 20 messages are lost, from 21 and next are received. When I try max_inflight_messages = 1, first message is lost, others are received. Any ideas? My file vernemq.conf:
allow_anonymous = on
allow_register_during_netsplit = off
allow_publish_during_netsplit = off
allow_subscribe_during_netsplit = off
allow_unsubscribe_during_netsplit = off
allow_multiple_sessions = off
coordinate_registrations = on
max_inflight_messages = 20
max_online_messages = 1000
max_offline_messages = 1000
max_message_size = 0
upgrade_outgoing_qos = off
listener.max_connections = 10000
listener.nr_of_acceptors = 10
listener.tcp.default = 0.0.0.0:1883
listener.vmq.clustering = 0.0.0.0:44053
listener.http.default = 0.0.0.0:8888
systree_enabled = on
systree_interval = 20000
graphite_enabled = off
graphite_host = localhost
graphite_port = 2003
graphite_interval = 20000
shared_subscription_policy = prefer_local
plugins.vmq_passwd = off
plugins.vmq_acl = on
plugins.vmq_diversity = off
plugins.vmq_webhooks = off
plugins.vmq_bridge = off
metadata_plugin = vmq_plumtree
vmq_acl.acl_file = ./etc/vmq.acl
vmq_acl.acl_reload_interval = 10
vmq_passwd.password_file = ./etc/vmq.passwd
vmq_passwd.password_reload_interval = 10
vmq_diversity.script_dir = ./share/lua
vmq_diversity.auth_postgres.enabled = off
vmq_diversity.postgres.ssl = off
vmq_diversity.postgres.password_hash_method = crypt
vmq_diversity.auth_cockroachdb.enabled = off
vmq_diversity.cockroachdb.ssl = on
vmq_diversity.cockroachdb.password_hash_method = bcrypt
vmq_diversity.auth_mysql.enabled = off
vmq_diversity.mysql.password_hash_method = password
vmq_diversity.auth_mongodb.enabled = off
vmq_diversity.mongodb.ssl = off
vmq_diversity.auth_redis.enabled = off
vmq_bcrypt.pool_size = 1
log.console = both
log.console.level = debug
log.console.file = ./log/console.log
log.error.file = ./log/error.log
log.syslog = off
log.crash = on
log.crash.file = ./log/crash.log
log.crash.maximum_message_size = 64KB
log.crash.size = 10MB
log.crash.rotation = $D0
log.crash.rotation.keep = 5
nodename = VerneMQ#127.0.0.1
distributed_cookie = vmq
erlang.async_threads = 64
erlang.max_ports = 262144
leveldb.maximum_memory.percent = 70
The problem was in paho mqtt library. When client connect to broker, he receive all messages, but handlers for this messages assigned only when it subscrybe to concrete topic.

Flume configuration not working showing exception

I have the followinf flume configuration. I am trying to transfer a file of size 9GB to hdfs using flume from spool directory. I have the following flume configuration.
#initialize agent's source, channel and sink
wagent.sources = wavetronix
wagent.channels = memoryChannel2
wagent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
wagent.sources.wavetronix.type = spooldir
wagent.sources.wavetronix.spoolDir = /johir/WAVETRONIX/output/Yesterday
wagent.sources.wavetronix.fileHeader = false
wagent.sources.wavetronix.basenameHeader = true
#agent.sources.wavetronix.fileSuffix = .COMPLETED
# Setting the channel to memory
wagent.channels.memoryChannel2.type = memory
# Max number of events stored in the memory channel
wagent.channels.memoryChannel2.capacity = 50000
agent.channels.memoryChannel2.batchSize = 1000
wagent.channels.memoryChannel2.transactioncapacity = 1000
# Setting the sink to HDFS
wagent.sinks.flumeHDFS.type = hdfs
#agent.sinks.flumeHDFS.useLocalTimeStamp = true
wagent.sinks.flumeHDFS.hdfs.path =/user/root/WAVETRONIXFLUME/%Y-%m-%d/
wagent.sinks.flumeHDFS.hdfs.useLocalTimeStamp = true
wagent.sinks.flumeHDFS.hdfs.filePrefix= %{basename}
wagent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
wagent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
wagent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
wagent.sinks.flumeHDFS.hdfs.rollCount=0
wagent.sinks.flumeHDFS.hdfs.rollInterval=0
wagent.sinks.flumeHDFS.hdfs.rollSize = 6400000
wagent.sinks.flumeHDFS.hdfs.batchSize =1000
# never rollover based on the number of events
wagent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
wagent.sources.wavetronix.channels = memoryChannel2
wagent.sinks.flumeHDFS.channel = memoryChannel2
But I am getting the following exception.
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor"
java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1043)
at java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535)
at java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:463)
at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.log4j.spi.LoggingEvent.(LoggingEvent.java:165)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:479)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:461)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Can anyone help me to solve this problem?
Please edit the file ${FLUME_HOME}/conf/flume-env.sh, then add following code:
export JAVA_OPTS="-Xms1000m -Xmx12000m -Dcom.sun.management.jmxremote"
You can adjust the options "Xmx" and "Xms".

Flume Multiplexing not working

I have configured my flume agent like below. Somehow, the flume agent doesn't run properly. It keeps hanging without any errors. Is there any problem with the below configuration.
FYI: I have a file named "country" with hard-coded header as state
#Define sources, sink and channels
foo.sources = s1
foo.channels = chn-az chn-oth
foo.sinks = sink-az sink-oth
#
### # # Define a source on agent and connect to channel memory-channel.
foo.sources.s1.type = exec
foo.sources.s1.command = cat /home/hadoop/flume/country.txt
foo.sources.s1.batchSize = 1
foo.sources.s1.channels = chn-ca chn-oth
#selector configuration
foo.sources.s1.selector.type = multiplexing
foo.sources.s1.selector.header = state
foo.sources.s1.selector.mapping.AZ = chn-az
foo.sources.s1.selector.default = chn-oth
#
#
### Define a memory channel on agent called memory-channel.
foo.channels.chn-az.type = memory
foo.channels.chn-oth.type = memory
#
#
##Define sinks that outputs to hdfs.
foo.sinks.sink-az.channel = chn-az
foo.sinks.sink-az.type = hdfs
foo.sinks.sink-az.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-az.hdfs.filePrefix = statefilter
foo.sinks.sink-az.hdfs.fileType = DataStream
foo.sinks.sink-az.hdfs.writeFormat = Text
foo.sinks.sink-az.batchSize = 1
foo.sinks.sink-az.rollInterval = 0
#
foo.sinks.sink-oth.channel = chn-oth
foo.sinks.sink-oth.type = hdfs
foo.sinks.sink-oth.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-oth.hdfs.filePrefix = statefilter
foo.sinks.sink-oth.hdfs.fileType = DataStream
foo.sinks.sink-oth.batchSize = 1
foo.sinks.sink-oth.rollInterval = 0
Thanks,
Vinoth
Regarding the channels list configured at the source:
foo.sources.s1.channels = chn-ca chn-oth
I think chn-ca should be chn-az.
Nevertheless, I think such a configuration will never work since the "state" header used by the selector is not created by any Flume component. You must introduce an interceptor for that, typically the Regex Extractor Interceptor.

NoSuchMethod error in flume with kafka

kafka version is 0.7.2,flume version is 1.5.0, flume + kafka plugin: https://github.com/baniuyao/flume-kafka
error info:
2014-08-20 18:55:51,755 (conf-file-poller-0) [ERROR - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:149)] Unhandled error
java.lang.NoSuchMethodError: scala.math.LowPriorityOrderingImplicits.ordered()Lscala/math/Ordering;
at kafka.producer.ZKBrokerPartitionInfo$$anonfun$kafka$producer$ZKBrokerPartitionInfo$$getZKTopicPartitionInfo$1.apply(ZKBrokerPartitionInfo.scala:172)
flume configuration:
agent_log.sources = r1
agent_log.sinks = kafka
agent_log.channels = c1
agent_log.sources.r1.type = exec
agent_log.sources.r1.channels = c1
agent_log.sources.r1.command = tail -f /var/log/test.log
agent_log.channels.c1.type = memory
agent_log.channels.c1.capacity = 1000
agent_log.channels.c1.trasactionCapacity = 100
agent_log.sinks.kafka.type = com.vipshop.flume.sink.kafka.KafkaSink
agent_log.sinks.kafka.channel = c1
agent_log.sinks.kafka.zk.connect = XXXX:2181
agent_log.sinks.kafka.topic = my-replicated-topic
agent_log.sinks.kafka.batchsize = 200
agent_log.sinks.kafka.producer.type = async
agent_log.sinks.kafka.serializer.class = kafka.serializer.StringEncoder
​
what could be the error? THX
​
scala.math.LowPriorityOrderingImplicits.ordered()
Perhaps you need to import the Scala standard libarary and have it in your Flume lib directory.

multiple flume twitter agents

im learning hadoop, flume etc and one of the projects I started was sentiment analysis, which is OK but now im trying to expand by collecting multiple sets of data, this is my flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS HDFS2
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxx
TwitterAgent.sources.Twitter.keywords = bbc
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://xxx:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
what im hoping to achieve is put all tweets about bbc in the above location but also use the following config to put tweets about liverpool into a seperate folder:
TwitterAgent.sources.Twitter.keywords = liverpool
TwitterAgent.sinks.HDFS2.channel = MemChannel
TwitterAgent.sinks.HDFS2.type = hdfs
TwitterAgent.sinks.HDFS2.hdfs.path = hdfs://xxx:8020/user/flume/tweets/liverpool/
TwitterAgent.sinks.HDFS2.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS2.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS2.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS2.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS2.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel2.type = memory
TwitterAgent.channels.MemChannel2.capacity = 10000
TwitterAgent.channels.MemChannel2.transactionCapacity = 10
This isnt working and I cant work out why, can anyone point me in the right direction?
This answer is probably a bit late but I think it doesn't work because you can only have one open connection to the Twitter Streaming API using the same app.
https://dev.twitter.com/discussions/14935
https://dev.twitter.com/discussions/7542
#kurrik Arne Roomann-Kurrik
Which streaming endpoint are you using?
For general streams, you should only make one connection from the same
IP. For userstreams, one or two connections from the same IP. For site
streams, multiple connections are supported (note that site streams is
still in limited beta).

Resources