Let us assume I have various MQTT clients who send data within some topic, for instance for temperature sensors tele/temp/%devicename%/SENSOR in a JSON Format, such as
{"Time":"2020-03-24T20:17:04","DS18S20":{"Temperature":22.8},"TempUnit":"C"}}
My basic telegraf.conf looks as following
# Influxdb Output
[[outputs.influxdb]]
database = "telegraf"
# sensors
[[inputs.mqtt_consumer]]
name_override = "sensor"
topics = ["tele/temp/+/SENSOR"]
data_format = "json"
My problem is now that I fail to do basic operations on that json data.
I do not want to save the host and the topic. How can I drop fields?
The topic contains the %devicename%. How can I add it as tag?
I cannot use json_query, since the %devicename% is there and there is only one field.
How can I rename the field %devicename%_Temperature to "temperature"?
In general, I would like to have an easy way to keep only the measurement of a multitude of sensors in the following format
timestamp | temperature | device
2020-03-24T20:17:04 | 22.8 | DS18S20
Thanks a lot!
If you don't want to save the topic in InfluxDB set the following as part of your inputs_mqtt_consumer:
topic_tag=""
Adding the topic as a tag name can be done using processors.regex
[[processors.regex]]
[[processors.regex.tags]]
key = "topic"
pattern = ".*/(.*)/.*"
replacement = "${1}"
result_key = "devicename"
In this case the 2nd section of the topic would become the devicename.
Related
I have a Kafka producer that reads data from two large files and sends them in the JSON format with the same structure:
def create_sample_json(row_id, data_file): return {'row_id':int(row_id), 'row_data': data_file}
The producer breaks every file into small chunks and creates JSON format from each chunk and sends them in a for-loop finally.
The process of sending those two files happens simultaneously through multithreading.
I want to do join from those streams (s1.row_id == s2.row_id) and eventually some stream processing while my producer is sending data on Kafka. Because the producer generates a huge amount of data from multiple sources, I can't wait to consume them all, and it must be done simultaneously.
I am not sure if Table API is a good approach but this is my pyflink code so far:
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings
from pyflink.table.expressions import col
from pyflink.table.table_environment import StreamTableEnvironment
KAFKA_SERVERS = 'localhost:9092'
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///flink_jar/kafka-clients-3.3.2.jar")
env.add_jars("file:///flink_jar/flink-connector-kafka-1.16.1.jar")
env.add_jars("file:///flink_jar/flink-sql-connector-kafka-1.16.1.jar")
settings = EnvironmentSettings.new_instance() \
.in_streaming_mode() \
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=settings)
t1 = f"""
CREATE TEMPORARY TABLE table1(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
t2 = f"""
CREATE TEMPORARY TABLE table2(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
p1 = t_env.execute_sql(t1)
p2 = t_env.execute_sql(t2)
// please tell me what should I do next:
// Questions:
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
// 2) If so:
2.1) how can I make join from those streams in Python?
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
// 3) If not, what should I do?
Thank you very much.
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
The later one, data will be consumed by the 'kafka' table connector which we implemented. And you need to define a Sink table as the target you insert, the sink table could a kafka connector table with a topic that you want to ouput.
2.1) how can I make join from those streams in Python?
You can write SQL to join table1 and table2 and then insert into your sink table in Python
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
You can filter these messages before 'join' or before 'insert', a 'WHERE' clause is enough in your case
How can I use Telegraf to extract timestamp and sensor value from an MQTT message and insert it into a PostgreSQL database with separate timestamp and sensor value columns?
I am receiving this JSON object from MQTT:
{"sensor": "current", "data": [[1614945972418042880, 1614945972418042880], [1614945972418294528, 0.010058338362502514], [1614945972418545920, 0.010058338362502514]]}
It contains two fields: "sensor" and "data". The "sensor" field contains a string value that identifies the type of sensor and the "data" field contains an array of arrays, where each sub-array contains a timestamp and a sensor value. I am using Telegraf to output this data to a PostgreSQL database. I would like to separate the timestamp and sensor value and flatten it out of the list and use the sensor name as the column name, how can I configure Telegraf to do this?
So my table would look like this :
timestamp
current
1614945972418042880
1614945972418042880
1614945972418294528
0.010058338362502514
[[inputs.mqtt_consumer]]
servers = ["tcp://localhost:1883"]
topics = ["your_topic"]
data_format = "json"
json_query = "data.*"
tag_keys = ["sensor","timestamp"]
measurement = "sensors"`
I have two databases in Influxdb:
base1
base2
And metrics like these:
base1.example.allRequests.all.percentiles99 807 1607947555
base2.example.allRequests.all.percentiles99 807 1607947555
I have solution with influx.conf where I'm sending data on different ports.
http://localhost:2003 for base1.....
http://localhost:2004 for base2.....
[[graphite]]
…
bind-address = ":2003”
database = "base-1”
…
[[graphite]]
…
bind-address = ":2004”
database = "base-2”
…
However, I think there is a better solution for this case where data can be sent based on measurement not just using ports. Can someone please help me or suggest how to it?
I am new to google cloud data platform as well as to Apache beam api. I would like aggregate data based on multiple keys. In my requirement I will get a transaction feed having fields like customer id,customer name,transaction amount and transaction type. I would like to aggregate the data based on customer id & transaction type. Here is an example.
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
In google most of the examples are based on single key like group by single key. Can any please help me on how my PTransform look like in my requirement and how to produce aggregated data along with rest of the fields.
Regards,
Ravi.
Here is an easy way. I concatenated all the keys together to form a single key and then did the the sub and after than split the key to organize the output to a way you wanted. Please let me know if any question.
The code does not expect header in the CSV file. I just kept it short to show the main point you are asking.
import apache_beam as beam
import sys
class Split(beam.DoFn):
def process(self, element):
"""
Splits each row on commas and returns a tuple representing the row to process
"""
customer_id, customer_name, transction_amount, transaction_type = element.split(",")
return [
(customer_id +","+customer_name+","+transaction_type, float(transction_amount))
]
if __name__ == '__main__':
p = beam.Pipeline(argv=sys.argv)
input = 'aggregate.csv'
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
| 'parse' >> beam.ParDo(Split())
| 'sum' >> beam.CombinePerKey(sum)
| 'convertToString' >>beam.Map(lambda (combined_key, total_balance): '%s,%s,%s,%s' % (combined_key.split(",")[0], combined_key.split(",")[1],total_balance,combined_key.split(",")[2]))
| 'write' >> beam.io.WriteToText(output_prefix)
)
p.run().wait_until_finish()
it will produce output as below:
cust234,Srini,200.0,C
cust444,shaker,500.0,D
cust123,ravi,300.0,D
cust123,ravi,400.0,C
I store data in XML files in Data Lake Store within each folder, like one folder constitutes one source system.
On end of every day, i would like to run some kid of log analytics to find out how many New XML files are stored in Data Lake Store under every folder?. I have enabled Diagnostic Logs and also added OMS Log Analytics Suite.
I would like to know what is the best way to achieve this above report?
It is possible to do some aggregate report (and even create an alert/notification). Using Log Analytics, you can create a query that searches for any instances when a file is written to your Azure Data Lake Store based on either a common root path, or a file naming:
AzureDiagnostics
| where ( ResourceProvider == "MICROSOFT.DATALAKESTORE" )
| where ( OperationName == "create" )
| where ( Path_s contains "/webhdfs/v1/##YOUR PATH##")
Alternatively, the last line, could also be:
| where ( Path_s contains ".xml")
...or a combination of both.
You can then use this query to create an alert that will notify you during a given interval (e.g. every 24 hours) the number of files that were created.
Depending on what you need, you can format the query these ways:
If you use a common file naming, you can find a match where the path contains said file naming.
If you use a common path, you can find a match where the patch matches the common path.
If you want to be notified of all the instances (not just specific ones), you can use an aggregating query, and an alert when a threshold is reached/exceeded (i.e. 1 or more events):
AzureDiagnostics
| where ( ResourceProvider == "MICROSOFT.DATALAKESTORE" )
| where ( OperationName == "create" )
| where ( Path_s contains ".xml")
| summarize AggregatedValue = count(OperationName) by bin(TimeGenerated, 24h), OperationName
With the query, you can create the alert by following the steps in this blog post: https://azure.microsoft.com/en-gb/blog/control-azure-data-lake-costs-using-log-analytics-to-create-service-alerts/.
Let us know if you have more questions or need additional details.