spark streaming integration flume - flume

I follow the guid of spark streaming + flume integration. But i can't get any events in the end.
(https://spark.apache.org/docs/latest/streaming-flume-integration.html)
Can any one help me analysis it?
In the fume, I created the file of "avro_flume.conf" as follows:
Describe/configure the source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 123.57.54.113
a1.sources.r1.port = 4141
Describe the sink
a1.sinks = k1
a1.sinks.k1.type = avro
Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 123.57.54.113
a1.sinks.k1.port = 6666
a1.sources = r1
a1.sinks = spark
a1.channels = c1
In the file , 123.57.54.113 is the ip of localhost.
I start the programing as follows:
1.Start agent
flume-ng agent -c . -f conf/avro_spark.conf -n a1 Start Spark-streaming
2.Start spark-streaming example
bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 123.57.54.113 6666
3.Then I start the avro-cilent
flume-ng avro-client -c . -H 123.57.54.113 -p 4141 -F test/log.01
4.test/log.01" is a file created by echo which contains some string
In the end ,there is no events at all.
What's the problem?
Thanks !

I see "a1.sinks = spark" under heading "Binding the source and sink to the channel". But the sink with name "spark" is not defined elsewhere in your configuration.
Are you trying approach 1 or approach 2 from "https://spark.apache.org/docs/latest/streaming-flume-integration.html"?
Try removing the line "a1.sinks = spark" if you are trying approach 1.
For approach 2 use the following template:
agent.sinks = spark
agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.spark.hostname = <hostname of the local machine>
agent.sinks.spark.port = <port to listen on for connection from Spark>
agent.sinks.spark.channel = memoryChannel

Related

Flume Multiplexing not working

I have configured my flume agent like below. Somehow, the flume agent doesn't run properly. It keeps hanging without any errors. Is there any problem with the below configuration.
FYI: I have a file named "country" with hard-coded header as state
#Define sources, sink and channels
foo.sources = s1
foo.channels = chn-az chn-oth
foo.sinks = sink-az sink-oth
#
### # # Define a source on agent and connect to channel memory-channel.
foo.sources.s1.type = exec
foo.sources.s1.command = cat /home/hadoop/flume/country.txt
foo.sources.s1.batchSize = 1
foo.sources.s1.channels = chn-ca chn-oth
#selector configuration
foo.sources.s1.selector.type = multiplexing
foo.sources.s1.selector.header = state
foo.sources.s1.selector.mapping.AZ = chn-az
foo.sources.s1.selector.default = chn-oth
#
#
### Define a memory channel on agent called memory-channel.
foo.channels.chn-az.type = memory
foo.channels.chn-oth.type = memory
#
#
##Define sinks that outputs to hdfs.
foo.sinks.sink-az.channel = chn-az
foo.sinks.sink-az.type = hdfs
foo.sinks.sink-az.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-az.hdfs.filePrefix = statefilter
foo.sinks.sink-az.hdfs.fileType = DataStream
foo.sinks.sink-az.hdfs.writeFormat = Text
foo.sinks.sink-az.batchSize = 1
foo.sinks.sink-az.rollInterval = 0
#
foo.sinks.sink-oth.channel = chn-oth
foo.sinks.sink-oth.type = hdfs
foo.sinks.sink-oth.hdfs.path = hdfs://master:9099/user/hadoop/flume
foo.sinks.sink-oth.hdfs.filePrefix = statefilter
foo.sinks.sink-oth.hdfs.fileType = DataStream
foo.sinks.sink-oth.batchSize = 1
foo.sinks.sink-oth.rollInterval = 0
Thanks,
Vinoth
Regarding the channels list configured at the source:
foo.sources.s1.channels = chn-ca chn-oth
I think chn-ca should be chn-az.
Nevertheless, I think such a configuration will never work since the "state" header used by the selector is not created by any Flume component. You must introduce an interceptor for that, typically the Regex Extractor Interceptor.

NoSuchMethod error in flume with kafka

kafka version is 0.7.2,flume version is 1.5.0, flume + kafka plugin: https://github.com/baniuyao/flume-kafka
error info:
2014-08-20 18:55:51,755 (conf-file-poller-0) [ERROR - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:149)] Unhandled error
java.lang.NoSuchMethodError: scala.math.LowPriorityOrderingImplicits.ordered()Lscala/math/Ordering;
at kafka.producer.ZKBrokerPartitionInfo$$anonfun$kafka$producer$ZKBrokerPartitionInfo$$getZKTopicPartitionInfo$1.apply(ZKBrokerPartitionInfo.scala:172)
flume configuration:
agent_log.sources = r1
agent_log.sinks = kafka
agent_log.channels = c1
agent_log.sources.r1.type = exec
agent_log.sources.r1.channels = c1
agent_log.sources.r1.command = tail -f /var/log/test.log
agent_log.channels.c1.type = memory
agent_log.channels.c1.capacity = 1000
agent_log.channels.c1.trasactionCapacity = 100
agent_log.sinks.kafka.type = com.vipshop.flume.sink.kafka.KafkaSink
agent_log.sinks.kafka.channel = c1
agent_log.sinks.kafka.zk.connect = XXXX:2181
agent_log.sinks.kafka.topic = my-replicated-topic
agent_log.sinks.kafka.batchsize = 200
agent_log.sinks.kafka.producer.type = async
agent_log.sinks.kafka.serializer.class = kafka.serializer.StringEncoder
​
what could be the error? THX
​
scala.math.LowPriorityOrderingImplicits.ordered()
Perhaps you need to import the Scala standard libarary and have it in your Flume lib directory.

scapy dns sniff with additional records

i have python/scapy sniffer for DNS.
I am able to sniff DNS messages and get IP/UDP source and destination IP address and ports as well as DNS but I have problems parsing and getting additional answers and additional records if there is more then one.
from scapy i see DNS data i can get but do not know how to get additional records with ls(DNS),ls(DNSQR) and ls(DNSRR)
I would appreciate some help or solution to work this out.
My python/scapy script is below
#!/usr/bin/env python
from scapy.all import *
from datetime import datetime
import time
import datetime
import sys
############# MODIFY THIS PART IF NECESSARY ###############
interface = 'eth0'
filter_bpf = 'udp and port 53'
# ------ SELECT/FILTER MSGS
def select_DNS(pkt):
pkt_time = pkt.sprintf('%sent.time%')
# ------ SELECT/FILTER DNS MSGS
try:
if DNSQR in pkt and pkt.dport == 53:
# queries
print '[**] Detected DNS Message at: ' + pkt_time
p_id = pkt[DNS].id
cli_ip = pkt[IP].src
cli_port = pkt.sport
srv_ip = pkt[IP].dst
srv_port = pkt.dport
query = pkt[DNSQR].qname
q_class = pkt[DNSQR].qclass
qr_class = pkt[DNSQR].sprintf('%qclass%')
type = pkt[DNSQR].sprintf('%qtype%')
#
elif DNSRR in pkt and pkt.sport == 53:
# responses
pkt_time = pkt.sprintf('%sent.time%')
p_id = pkt[DNS].id
srv_ip = pkt[IP].src
srv_port = pkt.sport
cli_ip = pkt[IP].dst
cli_port = pkt.dport
response = pkt[DNSRR].rdata
r_class = pkt[DNSRR].rclass
rr_class = pkt[DNSRR].sprintf("%rclass%")
type = pkt[DNSRR].sprintf("%type%")
ttl = pkt[DNSRR].ttl
len = pkt[DNSRR].rdlen
#
print response
except:
pass
# ------ START SNIFFER
sniff(iface=interface, filter=filter_bpf, store=0, prn=select_DNS)

Graphite carbon-relay not working

I have two graphite setup and I am trying to relay the traffic between the two, but somehow the carbon-relay is not working.
My cache runs on 2003/2004 and relay on 2013/2014
Following are the configurations done :
#carbon file
[cache:b]
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_PORT = 2004
CACHE_QUERY_PORT = 7012
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2003:a, aa.bb.cc.dd:2003:b
#relay-rules file
[default]
default = true
destinations = 127.0.0.1:2003:a, aa.bb.cc.dd:2003:b
Any pointers will be helpful
As part of the recent project at work, I've figured out that carbon demons uses PICKLE protocol when sending data to the destinations.
So the destination of carbon-relay should be carbon-cache's pickle receiver port instead.
#carbon.conf
....
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2004:a, aa.bb.cc.dd:2004:b
Also modify the relay-rules.conf with the same destinations specified in carbon.conf
relay-rules.conf
.....
[default]
default = true
destinations = 127.0.0.1:2004:a, aa.bb.cc.dd:2004:b

how to improve apache flume performance to write data in hbase

I m using apache-flume1.4.0 with hbase0.94.10 and hadoop1.1.2.
flume agent have spool directory as source and hbase as sink and file channel.It is running successfully but very slow.what should I do for improving hbase write performance.
Flume agent conf is as below:
agent1.sources = spool
agent1.channels = fileChannel
agent1.sinks = sink
agent1.sources.spool.type = spooldir
agent1.sources.spool.spoolDir = /opt/spoolTest/
agent1.sources.spool.fileSuffix = .completed
agent1.sources.spool.channels = fileChannel
#agent1.sources.spool.deletePolicy = immediate
agent1.sinks.sink.type = org.apache.flume.sink.hbase.HBaseSink
agent1.sinks.sink.channel = fileChannel
agent1.sinks.sink.table = test
agent1.sinks.sink.columnFamily = log
agent1.sinks.sink.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent1.sinks.sink.serializer.regex = (.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)^C(.*)
agent1.sinks.sink.serializer.colNames = id,no_fill_reason,adInfo,locationInfo,handsetInfo,siteInfo,reportDate,ipaddress,headerContent,userParaContent,reqParaContent,otherPara,others,others1
agent1.sinks.sink1.batchSize = 100
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /usr/flumeFileChannel/chkpointFlume
agent1.channels.fileChannel.dataDirs = /usr/flumeFileChannel/dataFlume
agent1.channels.fileChannel.capacity = 10000000
agent1.channels.fileChannel.transactionCapacity = 100000
What should be capacity,transaction capacity of file channel and batch size of sink.
Please help me.
Thanks in advance.

Resources