twitter source to hive sink using flume - twitter

I am trying to connect twitter source to hive sink using flume.
I have my property file given below
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = k1
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sources.Twitter.keywords = kafka, flume, hadoop, hive
# Describing/Configuring the sink
TwitterAgent.channels = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.sinks = k1
TwitterAgent.sinks.k1.type = hive
TwitterAgent.sinks.k1.channel = MemChannel
TwitterAgent.sinks.k1.hive.metastore = thrift://xxxx:9083
TwitterAgent.sinks.k1.hive.database = sample
TwitterAgent.sinks.k1.hive.table = tweets_twitter
#TwitterAgent.sinks.k1.hive.partition = user_location
TwitterAgent.sinks.k1.useLocalTimeStamp = false
TwitterAgent.sinks.k1.round = true
TwitterAgent.sinks.k1.roundValue = 10
TwitterAgent.sinks.k1.roundUnit = minute
TwitterAgent.sinks.k1.serializer = DELIMITED
TwitterAgent.sinks.k1.serializer.delimiter = "\t"
TwitterAgent.sinks.k1.serializer.serdeSeparator = '\t'
#TwitterAgent.sinks.k1.serializer.fieldnames =user_friends_count,user_location,user_email
# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.channels.MemChannel.byteCapacity = 6912212
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.k1.channel = MemChannel
I am not creating any database or table in hive here. Should i need to create database name, table name, partition column, field names?? before starting the agent ??
If so where should i get the schema of twitter streaming data??
I am starting the flume agent using this below command
bin/flume-ng agent --conf ./conf/ -f conf/twitter_hive.conf Dflume.root.logger=DEBUG,console -n TwitterAgent --classpath "/usr/hdp/2.6.3.0-235/hive-hcatalog/share/hcatalog/*":"/usr/hdp/2.6.3.0-235/hive/lib/*"
Where should i get the schema of twitter data to create hive tables, to be mention in twitter.conf property file

HiveSink was introduced in version 1.6 and as per the documentation, yes the metastore, database name and the table name are mandatory. The partition part however is optional as flume can create the missing partitions.
As for the schema for twitter, it seems to be a problem some others have also faced and I found this link quite useful (You may have already come across this). It mentions some of the data structures available in hive that you may need to work with data coming in JSON format. You may however need to alter some of the bits and pieces for your scenario but this should give you a good start.
I hope this helps.

Related

Influxdb Json (MQTT) rename field with wildcards

I need help with an issue I am having sending the data of my Zigbee thermometers to influx, via telegraf.
this is the path:
Zigbee Sonoff SNZB-02 --> Tasmota ZBBridge --> MQTT --> Telegraf --> InfluxDB
The id of the zigbee thermometer: 0x4EF9 might change since it's randomly assigned to the device, in Tasmota I am able to assign a "friendly name", in this case: ZB_Sonoff_Temp01
With simple tasmota devices I have no issues, I have the single entry of the device in the MQTT Topic and Telegraf plays nicely with those.
My issues is with the data from the Zigbee Bridge, since it has a single topic and the output in Influx is a bit difficult to work with:
Example MQTT message for Zigbee thermometer:
tele/tasmota_ABDCEF/ZB_Sonoff_Temp01/SENSOR {"ZbReceived":{"0x4EF9":{"Device":"0x4EF9","Name":"ZB_Sonoff_Temp01","Humidity":94.84,"Endpoint":1,"LinkQuality":34}}}
the data is in Json format as you can see,
in Telegraf I am using mqtt_consumer, here is the config:
/etc/telegraf/telegraf.d/mqtt.conf
[[inputs.mqtt_consumer]]
servers = ["tcp://192.168.10.10:1883"]
## Topics that will be subscribed to.
topics = [
"tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR"
]
qos = 0
connection_timeout = "30s"
username = "user"
password = "password"
data_format = "json"
[[outputs.influxdb]]
urls = ["http://localhost:8086"]
database = "test_temp"
# skip_database_creation = true
and this is my /etc/telegraf/telegraf.conf:
[global_tags]
[agent]
logfile = "/var/log/telegraf/telegraf.log"
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
this is the data in influx:
time ZbReceived_0x4EF9_Endpoint ZbReceived_0x4EF9_Humidity ZbReceived_0x4EF9_LinkQuality ZbReceived_0x4EF9_Temperature host topic
---- -------------------------- -------------------------- ----------------------------- ----------------------------- ---- -----
2021-12-24T15:43:55.26962955Z 1 99.99 26 32.24 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
2021-12-24T15:43:55.560162845Z 1 21 25.18 influxdb-test tele/tasmota_ABDCEF/ZB_Sonoff_Temp01/SENSOR
which could be ok, since I am able to choose the device via the "topic" field, but the problems is that the field "ZbReceived_0x4EF9_Temperature" is not "sane" in case the devices change ID when re-associating with zigbee, which might happen..
the workaround I found is to add a rename for the fields:
[[processors.rename]]
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Temperature"
dest = "Temperature"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Humidity"
dest = "Humidity"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_Endpoint"
dest = "Endpoint"
[[processors.rename.replace]]
field = "ZbReceived_0x4EF9_LinkQuality"
dest = "LinkQuality"
which changes the fields as I want (there is no Humidity but it's not always pushed, so it's ok, I am dropping the db between changes):
time Endpoint LinkQuality Temperature host topic
---- -------- ----------- ----------- ---- -----
2021-12-24T15:47:09.992947108Z 1 21 23 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
2021-12-24T15:47:25.868416967Z 1 13 27.06 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR
I don't like this solution very much since it has the device ids hardcoded in telegraf config, so when I add or change a sensor I need to edit telegraf.
the problem I now have is that I would like to find a wildcard or a method to change the fields independently from the device id, like:
[[processors.rename]]
[[processors.rename.replace]]
field = "*_Temperature"
dest = "Temperature"
but I am not able to find it, I've read all the docs of the methods (also strings) but I could not find a way to achieve that..
do you have any tip that could help me?
thank you very much and happy holidays!
I needed to use processors.regex.field_rename, but I had an older version of telegraf than the latest ((1.21.1-1) over (1.20.2-1)), and so field_rename was not available.
The regex I've used:
[[processors.regex]]
[[processors.regex.field_rename]]
pattern = '(^ZbReceived_)\w+_'
replacement = "${2}"
the result:
time Endpoint Humidity LinkQuality Temperature host topic
---- -------- -------- ----------- ----------- ---- -----
2021-12-24T23:51:48.289677669Z 1 0 25.26 influxdb-test tele/tasmota_ABCDEF/ZB_Sonoff_Temp01/SENSOR

Can we nicely convert MQTT topic descriptions into InfluxDB tags, using the mqtt_consumer Telegraf input plugin?

I am publishing data to MQTT topics with the Mosquitto broker, and am trying to pipe the data points through to my InfluxDB database. Currently I am trying to do this using the mqtt_consumer input plugin for Telegraf.
A simplified version of my telegraf.conf file looks like this:
# Global Agent Configuration
[agent]
interval = "1m"
# Input Plugins
[[inputs.mqtt_consumer]]
name_prefix = "sensors_"
servers = ["localhost:1883"]
qos = 0
connection_timeout = "30s"
topics = [
"/inside/kitchen/temperature"
]
persistent_session = false
data_format = "value"
data_type = "float"
# Output Plugins
[[outputs.influxdb]]
database = "telegraf"
urls = [ "http://localhost:8086" ]
Now, I can manually publish a data point using MQTT via the following:
~ $ mosquitto_pub -d -t /inside/kitchen/temperature -m 23.5
where I have used the MQTT topic /inside/kitchen/temperature and a value of 23.5.
Examining my InfluxDB database, I see the following data points:
name: sensors_mqtt_consumer
time topic value
---- ----- -----
2020-06-27T20:08:50 /inside/kitchen/temperature 23.5
2020-06-27T20:08:40 /inside/kitchen/temperature 23.5
Is there any way that I can use the MQTT topic name description to properly allocate Influx tags? For example, I would like something like:
name: temperature
time location room value
---- ----- ----- -----
2020-06-27T20:08:50 inside kitchen 23.5
2020-06-27T20:08:40 inside kitchen 23.5
Is there a way to do this with the InfluxDB/Mosquitto/Telegraf configurations? Later I will be adding more topics (locations, etc) and measurements (humidity, voltage, etc), so the solution should allow for this.
(I know that this can be done by by choosing data_format = "influx" as described here, where Telegraf interprets the message as InfluxDB line protocol and passes it through directly. However, then I would have to publish to the topic in this way:
mosquitto_pub -d -t /inside/kitchen/temperature -m "temperature,location=inside,room=kitchen value=23.5"
where you can see that most of the information has been input twice, even though it already existed. The same can be said for the data_format="json" option. What I need is more of a mapping).
I found a comprehensive answer in the telegraf repo's issue threads. Copied here for convenience. Since I had control over the message source I used the last option.
You have a few options, example configs below are untested and might need tweaked.
You can have multiple plugins, one for each type and use name_override to vary the measurement name:
[[inputs.mqtt_consumer]]
name_override = "my_integer"
topics = ["/XXX"]
data_format = "value"
data_type = "integer"
[[inputs.mqtt_consumer]]
name_override = "my_string"
topics = ["/SSS"]
data_format = "value"
data_type = "string"
You can have a single plugin reading all values as strings, and rename the fields and convert the integers using processors:
[[inputs.mqtt_consumer]]
topics = ["/#"]
data_format = "value"
data_type = "string"
[[processors.rename]]
order = 1
[processors.rename.tagpass]
topic = ["/SSS"]
[[processors.rename.replace]]
field = "value"
dest = "my_string"
[[processors.rename]]
order = 2
[processors.rename.tagpass]
topic = ["/XXX"]
[[processors.rename.replace]]
field = "value"
dest = "my_integer"
[[processors.converter]]
order = 3
[processors.converter.tagpass]
topic = ["/XXX"]
[[processors.converter.fields]]
integer = ["my_integer"]
Last "option" is to write a format that can be converted automatically on read, such as InfluxDB Line Protocol which will give you the most control over how the data looks. I recommend this if it is possible to modify the incoming data:
[[inputs.mqtt_consumer]]
topics = ["/#"]
data_format = "influx"
New to telegraf but I have had some success with something along the lines of:
[[inputs.mqtt_consumer]]
name_override = "sensors"
servers = ["localhost:1883"]
qos = 0
connection_timeout = "30s"
topics = [
"/inside/kitchen/temperature"
]
persistent_session = false
data_format = "value"
data_type = "float"
[[processors.regex]]
namepass = ["sensors"]
# Base topic present
[[processors.regex.tags]]
key = "topic"
pattern = "(.*)/.*/.*/"
replacement = "${1}"
result_key = "location"
[[processors.regex.tags]]
key = "topic"
pattern = ".*/(.*)/.*"
replacement = "${1}"
result_key = "room"
[[processors.regex.tags]]
key = "topic"
pattern = ".*/.*/(.*)"
replacement = "${1}"
result_key = "value"
The regex processor uses golang syntax.

How do I give Channel Directory (Checkpoint & Data Dir) name Date Dynamically

I use Channel as a backup in flume without any sink and it's working correctly. Below is my working code, but how can I give directory or file name dynamically? (I want to give name date wise when date is change new directory is created Dynamically and exist previous as a backup.)
# Name the components on this agent
a1.sources = r1
a1.channels =c1
# Describe/configure the source r1
a1.sources.r1.type = http
a1.sources.r1.port = 40441
a1.sources.r1.bind = X.X.X.X
a1.sources.r1.channels = c1
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.dataDirs = /data/disk11/flume/Test/dataDirs{**%y%m%d**}
a1.channels.c1.checkpointDir =data/disk11/flume/Test/checkpointDir{**%y%m%d**}
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

How to increase the processing rate of flume agent

I have a flume agent that ingests data into elasticsearch. The agent is using spoolDir source. There is another agent which writes the files into the spoolDir of the elasticsearch agent.
Over the time the files are increased and the difference between the processed files files and the unprocessed files increase.
I want to increase the number of events processed by the flume agent to speedup the ingesting process.
Here is the configuration of the flume agent.
agent04.sources = s1
agent04.channels = ch1
agent04.channels = memoryChannel
agent04.channels.memoryChannel.type = memory
agent04.channels.memoryChannel.capacity=100000
agent04.channels.memoryChannel.transactionCapacity=1000
agent04.sources.s1.channels = memoryChannel
agent04.sources.s1.type = spooldir
agent04.sources.s1.spoolDir = /DataCollection/Flume_Cleaner_Output/Json_Elastic/
agent04.sources.s1.deserializer.maxLineLength = 100000
agent04.sinks = elasticsearch
agent04.sinks.elasticsearch.channel = memoryChannel
agent04.sinks.elasticsearch.type=org.css.cssElasticsearchSink
agent04.sinks.elasticsearch.batchSize=400
agent04.sinks.elasticsearch.hostNames = elastic-node01.css.org
agent04.sinks.elasticsearch.indexName = all_collections
agent04.sinks.elasticsearch.indexType = live_tweets
agent04.sinks.elasticsearch.indexNameBuilder= org.css.sa.flume.elasticsearch.sink.indexNameBuilder.HeaderValueBasedIndexNameBuilder
agent04.sinks.elasticsearch.clusterName = css_rai_social
agent04.sinks.elasticsearch.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer
agent04.sinks.elasticsearch.cache_period_ms=90d
Why are you chaining two Flume agents using the spooldir? That'll be really slow and is a surprising configuration. You're incurring the cost of frequent fsyncs as each batch gets processed.
I recommend you chain them using the Avro Sink and Avro Source. I would also up the batch size to at least 1000. (Computers really like batches and Flume is setup to do that).

Apache Flume multiple agent

I have tested Apache Flume to transfer files from local to HDFS. But if the source files from multiple servers (transfer files from different servers' local to HDFS), can I just run one Flume instance and just add more agents into flume-conf.properties?
If I can, how can I edit the following parameters in flume-conf.properties:
agent1.sources.spooldirSource1.spoolDir = ?(server1/path)
agent2.sources.spooldirSource2.spoolDir = ?(server2/path)
And also, how can I run flume?
./flume-ng agent -n agent -c conf -f apache-flume-1.4.0-bin/conf/flume-conf.properties
can only run one flume. What about more than two?
Add multiple sources for what you need but configure them to use the same channel - which will then use the same source. So it's something like (note that this snippet is incomplete):
agent1.sources.spooldirSource1.spooldir = server1/path
agent1.sources.spooldirSource1.channel = myMemoryChannel
agent1.sources.spooldirSource2.spooldir = server2/path
agent1.sources.spooldirSource2.channel = myMemoryChannel
Using the same channel for two source isn't the good pratice , you can easly get outOfMemory for the channel (for MemoryChannel) and in this case.
it's better to use a channel for every source (for the same agent)
a1.sources = r1 r2
a1.sinks = k1 k2
a1.channels = c1 c2
then link source r1 to the channel c1 and source r2 to channel c2

Resources