I have tested Apache Flume to transfer files from local to HDFS. But if the source files from multiple servers (transfer files from different servers' local to HDFS), can I just run one Flume instance and just add more agents into flume-conf.properties?
If I can, how can I edit the following parameters in flume-conf.properties:
agent1.sources.spooldirSource1.spoolDir = ?(server1/path)
agent2.sources.spooldirSource2.spoolDir = ?(server2/path)
And also, how can I run flume?
./flume-ng agent -n agent -c conf -f apache-flume-1.4.0-bin/conf/flume-conf.properties
can only run one flume. What about more than two?
Add multiple sources for what you need but configure them to use the same channel - which will then use the same source. So it's something like (note that this snippet is incomplete):
agent1.sources.spooldirSource1.spooldir = server1/path
agent1.sources.spooldirSource1.channel = myMemoryChannel
agent1.sources.spooldirSource2.spooldir = server2/path
agent1.sources.spooldirSource2.channel = myMemoryChannel
Using the same channel for two source isn't the good pratice , you can easly get outOfMemory for the channel (for MemoryChannel) and in this case.
it's better to use a channel for every source (for the same agent)
a1.sources = r1 r2
a1.sinks = k1 k2
a1.channels = c1 c2
then link source r1 to the channel c1 and source r2 to channel c2
Related
I need to run identical steps (manipulation of files) on different servers (agents) that have different files. E.g.:
server A has files A1-A3
server B has files B1-B6
server C has a file C1
Every server has to run the steps with its own filelist independently and in parallel of other servers. Every file has to be processed in 3 steps. Step1 -> then step2 -> then send to network share (step3).
So my ideas were:
Idea1. Create a map:
filelist = [
[serverhostname: serverA, files: [A1, A2, A3]],
[serverhostname: serverB, files: [B1, B2, B3, B4, B5, B6]],
[serverhostname: serverC, files: [C1]]
]
Idea2. Generate parallel stages and pass the map into these stages. I read an article Generating parallel stages in Jenkinsfile according to passed parameters but couldn't combine examples into a working code.
Idea3. In order to limit servers loading (step2 eats CPU, step3 eats network bandwidth) I want to process only 1 file at the moment by every server but not whole fileset.
I am trying to create LQR for acrobot system from scratch:
file_name = "acrobot.sdf" # from drake/multibody/benchmarks/acrobot/acrobot.sdf
acrobot = MultibodyPlant()
parser = Parser(plant=acrobot)
parser.AddModelFromFile(file_name)
acrobot.AddForceElement(UniformGravityFieldElement([0, 0, -9.81]))
acrobot.Finalize()
acrobot_context = acrobot.CreateDefaultContext()
shoulder = acrobot.GetJointByName("ShoulderJoint")
elbow = acrobot.GetJointByName("ElbowJoint")
shoulder.set_angle(context=acrobot_context, angle=0.0)
elbow.set_angle(context=acrobot_context, angle=0.0)
Q = np.identity(4)
R = np.identity(1)
N = np.zeros([4, 4])
controller = LinearQuadraticRegulator(acrobot, acrobot_context, Q, R)
Running this script I receive error at the last string:
RuntimeError: Vector-valued input port acrobot_actuation must be either fixed or connected to the output of another system.
None of my approaches to fix/connect input ports were eventually successful.
P.S. I know that there exists AcrobotPlant, but the idea is to create LQR from sdf on the run.
P.P.S. Why acrobot.get_num_input_ports() return 5 instead of 1?
Here are the deltas that I had to apply to have it at least pass that error:
https://github.com/EricCousineau-TRI/drake/commit/e7167fb8a
Main notes:
You had either (a) use plant_context.FixInputPort on the relevant ports, or (b) use DiagramBuilder to compose systems by using AddSystem + Connect(output_port, input_port.
I'd recommend naming the MBP instance plant, so that way you can refer to model instances directly.
Does this help some?
P.P.S. Why acrobot.get_num_input_ports() return 5 instead of 1?
It's because it's a MultibodyPlant instance, which has several more ports. Preview from plot_system_graphviz:
I am trying to connect twitter source to hive sink using flume.
I have my property file given below
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = k1
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sources.Twitter.keywords = kafka, flume, hadoop, hive
# Describing/Configuring the sink
TwitterAgent.channels = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.sinks = k1
TwitterAgent.sinks.k1.type = hive
TwitterAgent.sinks.k1.channel = MemChannel
TwitterAgent.sinks.k1.hive.metastore = thrift://xxxx:9083
TwitterAgent.sinks.k1.hive.database = sample
TwitterAgent.sinks.k1.hive.table = tweets_twitter
#TwitterAgent.sinks.k1.hive.partition = user_location
TwitterAgent.sinks.k1.useLocalTimeStamp = false
TwitterAgent.sinks.k1.round = true
TwitterAgent.sinks.k1.roundValue = 10
TwitterAgent.sinks.k1.roundUnit = minute
TwitterAgent.sinks.k1.serializer = DELIMITED
TwitterAgent.sinks.k1.serializer.delimiter = "\t"
TwitterAgent.sinks.k1.serializer.serdeSeparator = '\t'
#TwitterAgent.sinks.k1.serializer.fieldnames =user_friends_count,user_location,user_email
# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.channels.MemChannel.byteCapacity = 6912212
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.k1.channel = MemChannel
I am not creating any database or table in hive here. Should i need to create database name, table name, partition column, field names?? before starting the agent ??
If so where should i get the schema of twitter streaming data??
I am starting the flume agent using this below command
bin/flume-ng agent --conf ./conf/ -f conf/twitter_hive.conf Dflume.root.logger=DEBUG,console -n TwitterAgent --classpath "/usr/hdp/2.6.3.0-235/hive-hcatalog/share/hcatalog/*":"/usr/hdp/2.6.3.0-235/hive/lib/*"
Where should i get the schema of twitter data to create hive tables, to be mention in twitter.conf property file
HiveSink was introduced in version 1.6 and as per the documentation, yes the metastore, database name and the table name are mandatory. The partition part however is optional as flume can create the missing partitions.
As for the schema for twitter, it seems to be a problem some others have also faced and I found this link quite useful (You may have already come across this). It mentions some of the data structures available in hive that you may need to work with data coming in JSON format. You may however need to alter some of the bits and pieces for your scenario but this should give you a good start.
I hope this helps.
I use Channel as a backup in flume without any sink and it's working correctly. Below is my working code, but how can I give directory or file name dynamically? (I want to give name date wise when date is change new directory is created Dynamically and exist previous as a backup.)
# Name the components on this agent
a1.sources = r1
a1.channels =c1
# Describe/configure the source r1
a1.sources.r1.type = http
a1.sources.r1.port = 40441
a1.sources.r1.bind = X.X.X.X
a1.sources.r1.channels = c1
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.dataDirs = /data/disk11/flume/Test/dataDirs{**%y%m%d**}
a1.channels.c1.checkpointDir =data/disk11/flume/Test/checkpointDir{**%y%m%d**}
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
I have a flume agent that ingests data into elasticsearch. The agent is using spoolDir source. There is another agent which writes the files into the spoolDir of the elasticsearch agent.
Over the time the files are increased and the difference between the processed files files and the unprocessed files increase.
I want to increase the number of events processed by the flume agent to speedup the ingesting process.
Here is the configuration of the flume agent.
agent04.sources = s1
agent04.channels = ch1
agent04.channels = memoryChannel
agent04.channels.memoryChannel.type = memory
agent04.channels.memoryChannel.capacity=100000
agent04.channels.memoryChannel.transactionCapacity=1000
agent04.sources.s1.channels = memoryChannel
agent04.sources.s1.type = spooldir
agent04.sources.s1.spoolDir = /DataCollection/Flume_Cleaner_Output/Json_Elastic/
agent04.sources.s1.deserializer.maxLineLength = 100000
agent04.sinks = elasticsearch
agent04.sinks.elasticsearch.channel = memoryChannel
agent04.sinks.elasticsearch.type=org.css.cssElasticsearchSink
agent04.sinks.elasticsearch.batchSize=400
agent04.sinks.elasticsearch.hostNames = elastic-node01.css.org
agent04.sinks.elasticsearch.indexName = all_collections
agent04.sinks.elasticsearch.indexType = live_tweets
agent04.sinks.elasticsearch.indexNameBuilder= org.css.sa.flume.elasticsearch.sink.indexNameBuilder.HeaderValueBasedIndexNameBuilder
agent04.sinks.elasticsearch.clusterName = css_rai_social
agent04.sinks.elasticsearch.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer
agent04.sinks.elasticsearch.cache_period_ms=90d
Why are you chaining two Flume agents using the spooldir? That'll be really slow and is a surprising configuration. You're incurring the cost of frequent fsyncs as each batch gets processed.
I recommend you chain them using the Avro Sink and Avro Source. I would also up the batch size to at least 1000. (Computers really like batches and Flume is setup to do that).