multiple flume twitter agents - twitter

im learning hadoop, flume etc and one of the projects I started was sentiment analysis, which is OK but now im trying to expand by collecting multiple sets of data, this is my flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS HDFS2
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxx
TwitterAgent.sources.Twitter.keywords = bbc
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://xxx:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
what im hoping to achieve is put all tweets about bbc in the above location but also use the following config to put tweets about liverpool into a seperate folder:
TwitterAgent.sources.Twitter.keywords = liverpool
TwitterAgent.sinks.HDFS2.channel = MemChannel
TwitterAgent.sinks.HDFS2.type = hdfs
TwitterAgent.sinks.HDFS2.hdfs.path = hdfs://xxx:8020/user/flume/tweets/liverpool/
TwitterAgent.sinks.HDFS2.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS2.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS2.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS2.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS2.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel2.type = memory
TwitterAgent.channels.MemChannel2.capacity = 10000
TwitterAgent.channels.MemChannel2.transactionCapacity = 10
This isnt working and I cant work out why, can anyone point me in the right direction?

This answer is probably a bit late but I think it doesn't work because you can only have one open connection to the Twitter Streaming API using the same app.
https://dev.twitter.com/discussions/14935
https://dev.twitter.com/discussions/7542
#kurrik Arne Roomann-Kurrik
Which streaming endpoint are you using?
For general streams, you should only make one connection from the same
IP. For userstreams, one or two connections from the same IP. For site
streams, multiple connections are supported (note that site streams is
still in limited beta).

Related

The first few messages are lost when transmitted to mqtt clients that were offline

I have vernemq server and mqtt clients using paho mqtt library (with python or C - no matter). Both subscribers and publishers use Qos2 and clean_session == False. So the problem is when subscriber is offline, I try to send some messages. Some of them are lost. After a detailed study of the parameters, I found out that the first max_inflight_messages are lost. What I mean. In the config file vernemq.conf I set max_inflight_messages = 20 (by default). Then subscriber go to offline, I send 21 messages, then subscriber go online, and first 20 are lost, 21s is delivered. I tried it many times with different amount of messages - the same result, first 20 messages are lost, from 21 and next are received. When I try max_inflight_messages = 1, first message is lost, others are received. Any ideas? My file vernemq.conf:
allow_anonymous = on
allow_register_during_netsplit = off
allow_publish_during_netsplit = off
allow_subscribe_during_netsplit = off
allow_unsubscribe_during_netsplit = off
allow_multiple_sessions = off
coordinate_registrations = on
max_inflight_messages = 20
max_online_messages = 1000
max_offline_messages = 1000
max_message_size = 0
upgrade_outgoing_qos = off
listener.max_connections = 10000
listener.nr_of_acceptors = 10
listener.tcp.default = 0.0.0.0:1883
listener.vmq.clustering = 0.0.0.0:44053
listener.http.default = 0.0.0.0:8888
systree_enabled = on
systree_interval = 20000
graphite_enabled = off
graphite_host = localhost
graphite_port = 2003
graphite_interval = 20000
shared_subscription_policy = prefer_local
plugins.vmq_passwd = off
plugins.vmq_acl = on
plugins.vmq_diversity = off
plugins.vmq_webhooks = off
plugins.vmq_bridge = off
metadata_plugin = vmq_plumtree
vmq_acl.acl_file = ./etc/vmq.acl
vmq_acl.acl_reload_interval = 10
vmq_passwd.password_file = ./etc/vmq.passwd
vmq_passwd.password_reload_interval = 10
vmq_diversity.script_dir = ./share/lua
vmq_diversity.auth_postgres.enabled = off
vmq_diversity.postgres.ssl = off
vmq_diversity.postgres.password_hash_method = crypt
vmq_diversity.auth_cockroachdb.enabled = off
vmq_diversity.cockroachdb.ssl = on
vmq_diversity.cockroachdb.password_hash_method = bcrypt
vmq_diversity.auth_mysql.enabled = off
vmq_diversity.mysql.password_hash_method = password
vmq_diversity.auth_mongodb.enabled = off
vmq_diversity.mongodb.ssl = off
vmq_diversity.auth_redis.enabled = off
vmq_bcrypt.pool_size = 1
log.console = both
log.console.level = debug
log.console.file = ./log/console.log
log.error.file = ./log/error.log
log.syslog = off
log.crash = on
log.crash.file = ./log/crash.log
log.crash.maximum_message_size = 64KB
log.crash.size = 10MB
log.crash.rotation = $D0
log.crash.rotation.keep = 5
nodename = VerneMQ#127.0.0.1
distributed_cookie = vmq
erlang.async_threads = 64
erlang.max_ports = 262144
leveldb.maximum_memory.percent = 70
The problem was in paho mqtt library. When client connect to broker, he receive all messages, but handlers for this messages assigned only when it subscrybe to concrete topic.

Flume : Kafka sink not able to get all tweets

I'm trying to collect tweets from Twitter API, and write them to Kafka via Flume. I use a Kafka sink. The problem is that the Kafka Sink does not collect all the tweets collected by Flume. I ran zookeeper and kafka server, created topic twitter and listenned to the topic with a consumer. for instance, for 1000 tweets that Flume collects, Kafka only displays 100, after 1 minute processing.
Here is flume conf file :
TwitterAgent.sources = Twitter
TwitterAgent.channels= MemChannel
TwitterAgent.sinks = kafka
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.kafka.channel = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10020
TwitterAgent.channels.MemChannel.transactionCapacity = 1300
TwitterAgent.sinks.kafka.type = org.apache.flume.sink.kafka.KafkaSink
TwitterAgent.sinks.kafka.topic = twitter
TwitterAgent.sinks.kafka.brokerList = localhost:9092
TwitterAgent.sinks.kafka.batchsize = 100
TwitterAgent.sinks.kafka.request.required.acks = -1
Thanks for your help

Flume Multiplexing not working

I have configured my flume agent like below. Somehow, the flume agent doesn't run properly. It keeps hanging without any errors. Is there any problem with the below configuration.
FYI: I have a file named "country" with hard-coded header as state
#Define sources, sink and channels
foo.sources = s1
foo.channels = chn-az chn-oth
foo.sinks = sink-az sink-oth
#
### # # Define a source on agent and connect to channel memory-channel.
foo.sources.s1.type = exec
foo.sources.s1.command = cat /home/hadoop/flume/country.txt
foo.sources.s1.batchSize = 1
foo.sources.s1.channels = chn-ca chn-oth
#selector configuration
foo.sources.s1.selector.type = multiplexing
foo.sources.s1.selector.header = state
foo.sources.s1.selector.mapping.AZ = chn-az
foo.sources.s1.selector.default = chn-oth
#
#
### Define a memory channel on agent called memory-channel.
foo.channels.chn-az.type = memory
foo.channels.chn-oth.type = memory
#
#
##Define sinks that outputs to hdfs.
foo.sinks.sink-az.channel = chn-az
foo.sinks.sink-az.type = hdfs
foo.sinks.sink-az.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-az.hdfs.filePrefix = statefilter
foo.sinks.sink-az.hdfs.fileType = DataStream
foo.sinks.sink-az.hdfs.writeFormat = Text
foo.sinks.sink-az.batchSize = 1
foo.sinks.sink-az.rollInterval = 0
#
foo.sinks.sink-oth.channel = chn-oth
foo.sinks.sink-oth.type = hdfs
foo.sinks.sink-oth.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-oth.hdfs.filePrefix = statefilter
foo.sinks.sink-oth.hdfs.fileType = DataStream
foo.sinks.sink-oth.batchSize = 1
foo.sinks.sink-oth.rollInterval = 0
Thanks,
Vinoth
Regarding the channels list configured at the source:
foo.sources.s1.channels = chn-ca chn-oth
I think chn-ca should be chn-az.
Nevertheless, I think such a configuration will never work since the "state" header used by the selector is not created by any Flume component. You must introduce an interceptor for that, typically the Regex Extractor Interceptor.

Graphite carbon-relay not working

I have two graphite setup and I am trying to relay the traffic between the two, but somehow the carbon-relay is not working.
My cache runs on 2003/2004 and relay on 2013/2014
Following are the configurations done :
#carbon file
[cache:b]
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_PORT = 2004
CACHE_QUERY_PORT = 7012
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2003:a, aa.bb.cc.dd:2003:b
#relay-rules file
[default]
default = true
destinations = 127.0.0.1:2003:a, aa.bb.cc.dd:2003:b
Any pointers will be helpful
As part of the recent project at work, I've figured out that carbon demons uses PICKLE protocol when sending data to the destinations.
So the destination of carbon-relay should be carbon-cache's pickle receiver port instead.
#carbon.conf
....
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2004:a, aa.bb.cc.dd:2004:b
Also modify the relay-rules.conf with the same destinations specified in carbon.conf
relay-rules.conf
.....
[default]
default = true
destinations = 127.0.0.1:2004:a, aa.bb.cc.dd:2004:b

how to improve apache flume performance to write data in hbase

I m using apache-flume1.4.0 with hbase0.94.10 and hadoop1.1.2.
flume agent have spool directory as source and hbase as sink and file channel.It is running successfully but very slow.what should I do for improving hbase write performance.
Flume agent conf is as below:
agent1.sources = spool
agent1.channels = fileChannel
agent1.sinks = sink
agent1.sources.spool.type = spooldir
agent1.sources.spool.spoolDir = /opt/spoolTest/
agent1.sources.spool.fileSuffix = .completed
agent1.sources.spool.channels = fileChannel
#agent1.sources.spool.deletePolicy = immediate
agent1.sinks.sink.type = org.apache.flume.sink.hbase.HBaseSink
agent1.sinks.sink.channel = fileChannel
agent1.sinks.sink.table = test
agent1.sinks.sink.columnFamily = log
agent1.sinks.sink.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent1.sinks.sink.serializer.regex = (.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)
agent1.sinks.sink.serializer.colNames = id,no_fill_reason,adInfo,locationInfo,handsetInfo,siteInfo,reportDate,ipaddress,headerContent,userParaContent,reqParaContent,otherPara,others,others1
agent1.sinks.sink1.batchSize = 100
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /usr/flumeFileChannel/chkpointFlume
agent1.channels.fileChannel.dataDirs = /usr/flumeFileChannel/dataFlume
agent1.channels.fileChannel.capacity = 10000000
agent1.channels.fileChannel.transactionCapacity = 100000
What should be capacity,transaction capacity of file channel and batch size of sink.
Please help me.
Thanks in advance.

Resources