Get the average of data coming from thousands of sensors - google-cloud-dataflow

I've been trying to build a dataflow pipeline that takes in data from Pubsub and publishes it to Bigtable or Bigquery. I can write the raw data for 1 sensor, but I can't do that for thousands of sensors when I try to calculate the mean of a window of data (60 seconds).
To illustrate the scenario:
My data payload
data = {
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location": "location1",
"name" : "name1",
"datapoint1" : "Some Integer",
"datapoint2" : "Some Integer",
"datapoint3" : "Some String",
.....
"datapointN" : "Some Integer",
}
In my example there will be thousands of sensors with the fullname "{location}_{name}". For each sensor, I would like to window the data to 60 seconds and calculate the average of that data.
The final form I am expecting
I will take this final form which exists as 1 element in order to insert into Bigtable and Bigquery
finalform = {
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location": "location1",
"name" : "name1",
"datapoint1" : "Integer That Has Been Averaged",
"datapoint2" : "Integer That Has Been Averaged",
"datapoint3" : "String that has been left alone",
.....
"datapointN" : "Integer That Has Been Averaged",
}
My solution so far which needs help.
p = beam.Pipeline()
rawdata = p | "Read" >> beam.io.ReadFromPubSub(topic=topic)
jsonData = rawdata | "Parse Json" >> beam.Map(json.loads)
windoweddata = jsonData|beam.WindowInto(window.FixedWindows(60))
groupedData = windoweddata | beam.GroupBy(location=lambda s: s["location"], name=lambda s: s["name"])
Now after the last line I am stuck. I want to be able to apply the CombinedValues in order to use mean.
However, after applying GroupBy I get a tuple (namedkey,value). Then I run a ParDo to work on that to split the json up into (key,value) tuples to prepare it for CombinedValues, all the data is mixed up again, and sensor data from various locations are now mixed in the PCollection.
My challenges
So the in its clearest form I have 2 main challenges:
How do I apply combinedvalues to my pipeline
How do I apply mean onto the pipeline but ignore the "string" type entries
Any help will be greatly welcomed.
My partial solution so far with help from chamikara
import apache_beam as beam
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam import window
class AverageFn(beam.CombineFn):
def create_accumulator(self):
print(dir(self))
return (1,2,3,4,5,6)
def add_input(self, sum_count, input):
print("add input",sum_count,input)
return sum_count
def merge_accumulators(self, accumulators):
print(accumulators)
data = zip(*accumulators)
return data
def extract_output(self, sum_count):
print("extract_output",sum_count)
data = sum_count
return data
with beam.Pipeline() as pipeline:
total = (
pipeline
| 'Create plant counts' >> beam.Create([
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L1",
"name":"S1",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L1",
"name":"S2",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L2",
"name":"S3",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L1",
"name":"S1",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L1",
"name":"S2",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L2",
"name":"S3",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L1",
"name":"S1",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L1",
"name":"S2",
"data1":1,
"data2":"STRING",
"data3":3,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L2",
"name":"S3",
"data1":1,
"data2":"STRING",
"data3":3,
},
])
| beam.GroupBy(location=lambda s: s["location"], name=lambda s: s["name"])
| beam.CombinePerKey(AverageFn())
| beam.Map(print))

Please see Combine section (particularly, CombinePerKey) here. You should first arrange your data into a PCollection of KVs with an appropriate key (for example a combination of location and name). This PCollection can be followed by a CombinePerKey with a CombineFn implementation that combines given data objects (by averaging respective fields).
This should be done within your CombineFn implementation, where you should combine relavent fields and ignore string fields.

The final answer is as below. The breakthrough for me was to realise not to use GroupBy but instead to use beam.Map because beam.Map is 1 to 1 transformation. I am transforming 1 row of my data into a tuple with (key,data) where the key is basically whatever I specify to be the unique identifier using Beam.Row() for that row that later I will collect and act on using combineperkey
import apache_beam as beam
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam import window
DATA = [
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L1",
"name":"S1",
"data1":1,
"data2":"STRING",
"data3":5,
"data4":5,
},
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L1",
"name":"S2",
"data1":9,
"data2":"STRING",
"data3":2,
"data4":2,
},
{
"timestamp": "2021-01-27 13:55:41.634717+08:00",
"location":"L2",
"name":"S3",
"data1":10,
"data2":"STRING",
"data3":4,
"data4":1,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L1",
"name":"S1",
"data1":11,
"data2":"STRING",
"data3":2,
"data4":7,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L1",
"name":"S2",
"data1":1,
"data2":"STRING",
"data3":4,
"data4":8,
},
{
"timestamp": "2021-01-27 13:55:51.634717+08:00",
"location":"L2",
"name":"S3",
"data1":9,
"data2":"STRING",
"data3":7,
"data4":8,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L1",
"name":"S1",
"data1":2,
"data2":"STRING",
"data3":3,
"data4":5,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L1",
"name":"S2",
"data1":6,
"data2":"STRING",
"data3":7,
"data4":6,
},
{
"timestamp": "2021-01-27 13:56:01.634717+08:00",
"location":"L2",
"name":"S3",
"data1":8,
"data2":"STRING",
"data3":1,
"data4":2,
},
]
class AverageFn2(beam.CombineFn):
def create_accumulator(self):
accumulator = {},0 #Set accumulator to be payload and count
return accumulator
def add_input(self, accumulator, input):
rowdata, count = accumulator
# Go through each item and try to add it if it is a float if not it is a string
for key,value in input.items():
if key in rowdata:
try:
rowdata[key]+=float(value)
except:
rowdata[key]=None
else:
rowdata[key]=value
return rowdata , count+1
def merge_accumulators(self, accumulators):
rowdata, counts = zip(*accumulators)
payload = {}
# Combine all the accumulators
for dictionary in rowdata:
for key,value in dictionary.items():
if key in payload:
try:
payload[key]+=float(value)
except:
payload[key]=None
else:
payload[key]=value
return payload, sum(counts)
def extract_output(self, accumulator):
rowdata, count = accumulator
for key,value in rowdata.items():
try:
float(value)
rowdata[key] = rowdata[key]/count
except:
pass
return rowdata
with beam.Pipeline() as pipeline:
total = (
pipeline
| 'Create plant counts' >> beam.Create(DATA)
| beam.Map( lambda item: (beam.Row(location=item["location"],name=item["name"]),item) )
| beam.CombinePerKey(AverageFn2())
| beam.Map(print))
Hope this helps another Dataflow newbie like myself.

Related

Time Series Insights not showing sub-object properties of a key/value pair

I have an application that is pushing data into IoT Hub which is being used as a data source for TSI. Below is an example message:
{
"EnqueuedTimeUtc": "2021-06-17T22:00:47.2170000Z",
"Properties": {},
"SystemProperties": {
"connectionDeviceId": "Device1",
"connectionAuthMethod": "{\"scope\":\"device\",\"type\":\"sas\",\"issuer\":\"iothub\",\"acceptingIpFilterRule\":null}",
"connectionDeviceGenerationId": "637425408342887985",
"contentType": "application/json",
"contentEncoding": "utf-8",
"enqueuedTime": "2021-06-17T22:00:47.2170000Z"
},
"Body": {
"topic": {
"namespace": "spBv1.0",
"edgeNodeDescriptor": "Routed Group/E2",
"groupId": "Routed Group",
"edgeNodeId": "E2",
"deviceId": "D2",
"type": "DBIRTH"
},
"payload": {
"timestamp": "2021-06-17T22:00:47.082Z",
"metrics": [{
"name": "Ramp1",
"timestamp": "2021-06-17T22:00:47.082Z",
"dataType": "Int32",
"metaData": {},
"properties": {
"Quality": {
"type": "Int32",
"value": 192
},
"My Property": {
"type": "String",
"value": "{\"\":\"\"}"
}
},
"value": 77
}],
"seq": 1
}
}
}
I found documentation showing that my array of 'metrics' is supported as shown here:
https://learn.microsoft.com/en-us/azure/time-series-insights/concepts-json-flattening-escaping-rules
With this message, I can see 'Ramp1' show up in TSI with a value and timestamp as expected. However, the 'properties' under each metric do not show up. In this example that is 'Quality' and 'My Property'. Is there a way to get this data into TSI with an association to 'Ramp1'?

How to generate the pyarrow schema for the dynamic values

I am trying to write a parquest schema for my json message that needs to be written back to a GCS bucket using apache_beam
My json is like below:
data = {
"name": "user_1",
"result": [
{
"subject": "maths",
"marks": 99
},
{
"subject": "science",
"marks": 76
}
],
"section": "A"
}
result array in the above example can have many value minimum is 1.
This is the schema you need:
import pyarrow as pa
schema = pa.schema(
[
pa.field("name", pa.string()),
pa.field(
"result",
pa.list_(
pa.struct(
[
pa.field("subject", pa.string()),
pa.field("marks", pa.int32()),
]
)
),
),
pa.field("section", pa.string()),
]
)
If you have a file containing one record per line:
{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}
You can load it using:
from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))

Loosing geojson's feature property information when visualising by using folium

currently I have already made a geojson file (called output):
{"type": "FeatureCollection", "features": [ {"type": "Feature", "geometry": {"type": "Point", "coordinates": [103.815381, 1.279109]}, "properties": {"temperature": 24, "marker-symbol": "park", "marker-color": "#AF4646"}}, {"type": "Feature", "geometry": {"type": "MultiLineString", "coordinates": [[[103.809297, 1.294906], [103.799445, 1.283906], [103.815381, 1.294906]]]}, "properties": {"temperature": 24, "stroke": "#AF4646"}}]}
It contains a multiline string type and a point type. The expected output should be like this (visualised by using geojson.io), where all the properties (e.g. colour of string and the marker, the forest icon of the marker) are all kept:
My goal is to generate an html or an image file (the best choice) of this map. So I turned to folium. However, when I use command:
m = folium.Map(location=[1.2791,103.8154], zoom_start=12)
folium.GeoJson(output, name='test').add_to(m)
m.save('map.html')
The visualisation is like this:
Where all the property information has been wiped out. Are there any way to keep those property information? Thanks.
Provided GeoJSON (output) contains styling properties defined in simplestyle spec which are not supported by leaflet L.geoJSON
leaflet-simplestyle plugin could be utilized which extends L.geoJSON to support the simplestyle spec, here is a an example on how to utilize it in folium
import folium
from folium.elements import JSCSSMixin
from folium.map import Layer
from jinja2 import Template
class StyledGeoJson(JSCSSMixin, Layer):
"""
Creates a GeoJson which supports.
"""
_template = Template(u"""
{% macro script(this, kwargs) %}
var {{ this.get_name() }} = L.geoJson({{ this.data }},
{
useSimpleStyle: true,
useMakiMarkers: true
}
).addTo({{ this._parent.get_name() }});
{% endmacro %}
""")
default_js = [
('leaflet-simplestyle', 'https://unpkg.com/leaflet-simplestyle'),
]
def __init__(self, data,name=None, overlay=True, control=True, show=True):
super(StyledGeoJson, self).__init__(name=name, overlay=overlay,
control=control, show=show)
self._name = 'StyledGeoJson'
self.data = data
Usage
output = {"type": "FeatureCollection", "features": [ {"type": "Feature", "geometry": {"type": "Point", "coordinates": [103.815381, 1.279109]}, "properties": {"temperature": 24, "marker-symbol": "park", "marker-color": "#AF4646"}}, {"type": "Feature", "geometry": {"type": "MultiLineString", "coordinates": [[[103.809297, 1.294906], [103.799445, 1.283906], [103.815381, 1.294906]]]}, "properties": {"temperature": 24, "stroke": "#AF4646"}}]}
m = folium.Map(location=[1.2791,103.8154], zoom_start=14)
StyledGeoJson(output).add_to(m)
m
Result

InfluxDB write points with same timestamp but different measurement

Requirement:
I want to create an influxDB database to store time-series data from multiple sensors reporting temperatures from different locations.
Problem:
When I write points to the database with same timestamp but different tags (example : location) and field (temperature) value , influx overwrites both tags and fields values with the latest timestamp
I followed the documentation available on their website and they show a sample db with above requirement but am not able find the schema used.
Example Table with duplicate timestamps
Additional Information :
Sample Input :
json_body_1 = [
{
"measurement": "cpu_load_short",
"tags": {
"host": "server02",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"Float_value": 0.7,
"Int_value": 6,
"String_value": "Text",
"Bool_value": False
}
},
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"Float_value": 1.0,
"Int_value": 2,
"String_value": "Text",
"Bool_value": False
}
}]
I used the example given in official documentation , still instead of 2 records , I get only one. Please note host tag is different which should ideally make each point unique.
Documentation
Today I too faced the same issue and now I come to the solution. :)
from influxdb import InfluxDBClient
client = InfluxDBClient(host='host name', port=8086, database='test_db',username='writer', password=Config.INFLUXDB_WRITE_PWD)
points = [{
"measurement": "cpu_load_short",
"tags": {
"host": "server02",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"Float_value": 0.7,
"Int_value": 6,
"String_value": "Text",
"Bool_value": False
}
},
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"Float_value": 1.0,
"Int_value": 2,
"String_value": "Text",
"Bool_value": False
}
}]
status = client.write_points(json_body_1, database='test_db', batch_size=10000, protocol='json')
Here is the output
> select * from cpu_load_short;
name: cpu_load_short
time Bool_value Float_value Int_value String_value host region
---- ---------- ----------- --------- ------------ ---- ------
1257894000000000000 false 1 2 Text server01 us-west
1257894000000000000 false 0.7 6 Text server02 us-west

How can I restore data from a previous result in the browser?

I have a query that I ran in the neo4j web browser:
MATCH p=(z)<-[*]-(a)-[:foo]->(b) WHERE b.value = "bar" return p
This returned a large number of connected nodes. Something happened that seems to have deleted all of these nodes (in a separate incident), but I still have the output of the old query. The code section of the browser has the response data listed:
...
"graph": {
"nodes": [
{
"id": "1148578",
"labels": [
"text"
],
"properties": {
"value": "bar",
"timestamp": 1478747946867
}
},
...
Is there a way for me to recreate all of this data from the output of an old query?
You could use apoc.load.json to do this. Note that this solution will not preserve the internal node ids. APOC is a procedure library that extends built-in Neo4j functionality.
Given the JSON file
{"graph": {
"nodes": [
{
"id": "32496",
"labels": [
"Person"
],
"properties": {
"born": 1967,
"name": "Carrie-Anne Moss"
}
},
{
"id": "32505",
"labels": [
"Movie"
],
"properties": {
"tagline": "Evil has its winning ways",
"title": "The Devil's Advocate",
"released": 1997
}
},
{
"id": "32494",
"labels": [
"Movie"
],
"properties": {
"tagline": "Welcome to the Real World",
"title": "The Matrix",
"released": 1999
}
},
{
"id": "32495",
"labels": [
"Person"
],
"properties": {
"born": 1964,
"name": "Keanu Reeves"
}
}
],
"relationships": [
{
"id": "83204",
"type": "ACTED_IN",
"startNode": "32495",
"endNode": "32505",
"properties": {
"role": "Kevin Lomax"
}
},
{
"id": "83183",
"type": "ACTED_IN",
"startNode": "32496",
"endNode": "32494",
"properties": {
"role": "Trinity"
}
},
{
"id": "83182",
"type": "ACTED_IN",
"startNode": "32495",
"endNode": "32494",
"properties": {
"role": "Neo"
}
}
]
}
}
}
We can recreate the graph using this query:
CALL apoc.load.json("https://dl.dropboxusercontent.com/u/67572426/small_movie_graph.json") YIELD value AS row
WITH row, row.graph.nodes AS nodes
UNWIND nodes AS node
CALL apoc.create.node(node.labels, node.properties) YIELD node AS n
SET n.id = node.id
WITH row
UNWIND row.graph.relationships AS rel
MATCH (a) WHERE a.id = rel.startNode
MATCH (b) WHERE b.id = rel.endNode
CALL apoc.create.relationship(a, rel.type, rel.properties, b) YIELD rel AS r
RETURN *

Resources