I'm trying to collect tweets from Twitter API, and write them to Kafka via Flume. I use a Kafka sink. The problem is that the Kafka Sink does not collect all the tweets collected by Flume. I ran zookeeper and kafka server, created topic twitter and listenned to the topic with a consumer. for instance, for 1000 tweets that Flume collects, Kafka only displays 100, after 1 minute processing.
Here is flume conf file :
TwitterAgent.sources = Twitter
TwitterAgent.channels= MemChannel
TwitterAgent.sinks = kafka
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.kafka.channel = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10020
TwitterAgent.channels.MemChannel.transactionCapacity = 1300
TwitterAgent.sinks.kafka.type = org.apache.flume.sink.kafka.KafkaSink
TwitterAgent.sinks.kafka.topic = twitter
TwitterAgent.sinks.kafka.brokerList = localhost:9092
TwitterAgent.sinks.kafka.batchsize = 100
TwitterAgent.sinks.kafka.request.required.acks = -1
Thanks for your help
Related
I am using gatling and influxdb in windows 10. I am trying to send some results from gatling to influxdb. But the results are not being pushed to the influxdb. Can someone help me.
My graphite config file
data {
#writers = [console, file, graphite]
console {
#light = false
#writePeriod = 5
}
file {
bufferSize = 8192 # FileDataWriter's internal data buffer size, in bytes
}
leak {
#noActivityTimeout = 30 # Period, in seconds, for which Gatling may have no activity before considering a leak may be happening
}
graphite {
# light = false # only send the all* stats
host = "localhost" # The host where the Carbon server is located
port = 2003 # The port to which the Carbon server listens to (2003 is default for plaintext, 2004 is default for pickle)
protocol = "tcp" # The protocol used to send data to Carbon (currently supported : "tcp", "udp")
rootPathPrefix = "gatling" # The common prefix of all metrics sent to Graphite
bufferSize = 8192 # Internal data buffer size, in bytes
writePeriod = 1 # Write period, in seconds
}
}
My influxdb config file is
[[graphite]]
enabled = true
database = "gatlingdb"
retention-policy = ""
bind-address = ":2003"
protocol = "tcp"
consistency-level = "one"
batch-size = 5000
batch-pending = 10
batch-timeout = "1s"
udp-read-buffer = 0
separator = "."
templates = [
"gatling.*.*.*.count measurement.simulation.request.status.field",
"gatling.*.*.*.min measurement.simulation.request.status.field",
"gatling.*.*.*.max measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles50 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles75 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles95 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles99 measurement.simulation.request.status.field"
]
Not sure why it is not working.
Uncomment #writers = [console, file, graphite]
I follow the guid of spark streaming + flume integration. But i can't get any events in the end.
(https://spark.apache.org/docs/latest/streaming-flume-integration.html)
Can any one help me analysis it?
In the fume, I created the file of "avro_flume.conf" as follows:
Describe/configure the source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 123.57.54.113
a1.sources.r1.port = 4141
Describe the sink
a1.sinks = k1
a1.sinks.k1.type = avro
Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 123.57.54.113
a1.sinks.k1.port = 6666
a1.sources = r1
a1.sinks = spark
a1.channels = c1
In the file , 123.57.54.113 is the ip of localhost.
I start the programing as follows:
1.Start agent
flume-ng agent -c . -f conf/avro_spark.conf -n a1 Start Spark-streaming
2.Start spark-streaming example
bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 123.57.54.113 6666
3.Then I start the avro-cilent
flume-ng avro-client -c . -H 123.57.54.113 -p 4141 -F test/log.01
4.test/log.01" is a file created by echo which contains some string
In the end ,there is no events at all.
What's the problem?
Thanks !
I see "a1.sinks = spark" under heading "Binding the source and sink to the channel". But the sink with name "spark" is not defined elsewhere in your configuration.
Are you trying approach 1 or approach 2 from "https://spark.apache.org/docs/latest/streaming-flume-integration.html"?
Try removing the line "a1.sinks = spark" if you are trying approach 1.
For approach 2 use the following template:
agent.sinks = spark
agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.spark.hostname = <hostname of the local machine>
agent.sinks.spark.port = <port to listen on for connection from Spark>
agent.sinks.spark.channel = memoryChannel
kafka version is 0.7.2,flume version is 1.5.0, flume + kafka plugin: https://github.com/baniuyao/flume-kafka
error info:
2014-08-20 18:55:51,755 (conf-file-poller-0) [ERROR - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:149)] Unhandled error
java.lang.NoSuchMethodError: scala.math.LowPriorityOrderingImplicits.ordered()Lscala/math/Ordering;
at kafka.producer.ZKBrokerPartitionInfo$$anonfun$kafka$producer$ZKBrokerPartitionInfo$$getZKTopicPartitionInfo$1.apply(ZKBrokerPartitionInfo.scala:172)
flume configuration:
agent_log.sources = r1
agent_log.sinks = kafka
agent_log.channels = c1
agent_log.sources.r1.type = exec
agent_log.sources.r1.channels = c1
agent_log.sources.r1.command = tail -f /var/log/test.log
agent_log.channels.c1.type = memory
agent_log.channels.c1.capacity = 1000
agent_log.channels.c1.trasactionCapacity = 100
agent_log.sinks.kafka.type = com.vipshop.flume.sink.kafka.KafkaSink
agent_log.sinks.kafka.channel = c1
agent_log.sinks.kafka.zk.connect = XXXX:2181
agent_log.sinks.kafka.topic = my-replicated-topic
agent_log.sinks.kafka.batchsize = 200
agent_log.sinks.kafka.producer.type = async
agent_log.sinks.kafka.serializer.class = kafka.serializer.StringEncoder
what could be the error? THX
scala.math.LowPriorityOrderingImplicits.ordered()
Perhaps you need to import the Scala standard libarary and have it in your Flume lib directory.
im learning hadoop, flume etc and one of the projects I started was sentiment analysis, which is OK but now im trying to expand by collecting multiple sets of data, this is my flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS HDFS2
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxx
TwitterAgent.sources.Twitter.keywords = bbc
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://xxx:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
what im hoping to achieve is put all tweets about bbc in the above location but also use the following config to put tweets about liverpool into a seperate folder:
TwitterAgent.sources.Twitter.keywords = liverpool
TwitterAgent.sinks.HDFS2.channel = MemChannel
TwitterAgent.sinks.HDFS2.type = hdfs
TwitterAgent.sinks.HDFS2.hdfs.path = hdfs://xxx:8020/user/flume/tweets/liverpool/
TwitterAgent.sinks.HDFS2.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS2.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS2.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS2.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS2.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel2.type = memory
TwitterAgent.channels.MemChannel2.capacity = 10000
TwitterAgent.channels.MemChannel2.transactionCapacity = 10
This isnt working and I cant work out why, can anyone point me in the right direction?
This answer is probably a bit late but I think it doesn't work because you can only have one open connection to the Twitter Streaming API using the same app.
https://dev.twitter.com/discussions/14935
https://dev.twitter.com/discussions/7542
#kurrik Arne Roomann-Kurrik
Which streaming endpoint are you using?
For general streams, you should only make one connection from the same
IP. For userstreams, one or two connections from the same IP. For site
streams, multiple connections are supported (note that site streams is
still in limited beta).
I m using apache-flume1.4.0 with hbase0.94.10 and hadoop1.1.2.
flume agent have spool directory as source and hbase as sink and file channel.It is running successfully but very slow.what should I do for improving hbase write performance.
Flume agent conf is as below:
agent1.sources = spool
agent1.channels = fileChannel
agent1.sinks = sink
agent1.sources.spool.type = spooldir
agent1.sources.spool.spoolDir = /opt/spoolTest/
agent1.sources.spool.fileSuffix = .completed
agent1.sources.spool.channels = fileChannel
#agent1.sources.spool.deletePolicy = immediate
agent1.sinks.sink.type = org.apache.flume.sink.hbase.HBaseSink
agent1.sinks.sink.channel = fileChannel
agent1.sinks.sink.table = test
agent1.sinks.sink.columnFamily = log
agent1.sinks.sink.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent1.sinks.sink.serializer.regex = (.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)
agent1.sinks.sink.serializer.colNames = id,no_fill_reason,adInfo,locationInfo,handsetInfo,siteInfo,reportDate,ipaddress,headerContent,userParaContent,reqParaContent,otherPara,others,others1
agent1.sinks.sink1.batchSize = 100
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /usr/flumeFileChannel/chkpointFlume
agent1.channels.fileChannel.dataDirs = /usr/flumeFileChannel/dataFlume
agent1.channels.fileChannel.capacity = 10000000
agent1.channels.fileChannel.transactionCapacity = 100000
What should be capacity,transaction capacity of file channel and batch size of sink.
Please help me.
Thanks in advance.