How to aggregate telemetry data from different devices contained in one asset? - thingsboard

I'm working with Thingsboard Community Edition 2.0.
I have one asset that contains two different devices. Both devices send telemetry data with the same key. I want to be able to show the sum of both values as the total for the asset.
Did anyone figure out how to do it?
Thanks.

The idea is basically to create a memory attribute containing a JSON of the data you want to save.
It's not straight forward but i found a way that works to do it.
Main steps :
Change originator to asset.
Each time telemetry comes, use enrichment to get the attribute memory, then with a script node put if from metadata to msg.
Merge the memory and the incomming telemetry (replace old value with new ones)
Compute what you want (min, max, mean, standard deviation, sum, etc...) and save it in telemetry or attribute.
Save in parallel the merged memory (note you can't save JS objects in telemetry or attributes, you have to use JSON.stringify() to save it and JSON.parse to take it back using enrichment node.
Hope it helps
Corentin

Related

How are Dataflow bundles created after GroupBy/Combine?

Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

Real time stream processing for IOT through Google Cloud Platform

I was concerned about real time stream processing for IOT through GCD pub/sub, Cloud Dataflow and perform analytics through BigQuery.I am seeking help for how to implement this.
Here is the architecture for IOT real-time stream processing
I'm assuming you mean that you want to stream some sort of data from outside the Google Cloud Platform into BigQuery.
Unless you're transforming the data somehow, I don't think that Data Flow is necessary.
Note, that BigQuery has its own Streaming API so you don't necessarily have to use Pub/Sub to get data into BigQuery.
In any case, these are the steps you should generally follow.
Method 1
Issue a service account (and download the .json file from IAM on Google Console)
Write your application to get the data you want to stream in
Inside that application, use the service account to stream directly into a BQ dataset and table
Analyse the data on the BigQuery console (https://bigquery.cloud.google.com)
Method 2
Setup PubSub queue
Write an application that collections the information you want to stream in
Push to PubSub
Configure DataFlow to pull from PubSub, transform the data however you need to and push to BigQuery
Analyse the data on the BigQuery console as above.
Raw Data
If you just want to put very raw data (no processing) into BQ, then I'd suggest using the first method.
Semi Processed / Processed Data
If you actually want to transform the data somehow, then I'd use the second method as it allows you to massage the data first.
Try to always use Method 1
However, I'd usually always recommend using the first method, even if you want to transform the data somehow.
That way, you have a data_dump table (raw data) in your dataset and you can still use DataFlow after that to transform the data and put it back into an aggregated table.
This gives you maximum flexibility because it allows you to create potentially n transformed datasets from the single data_dump table in BQ.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Can TextIO write to prefixes derived from the window maxTimestamp?

I am processing a windowed stream of PubSub messages and I would like to archive them to GCS. I'd like the archived files to have a prefix that's derived from the window timestamp (something like gs://bucket/messages/2015/01/messages-2015-01-01.json). Is this possible with TextIO.Write, or do I need to implement my own FileBasedSink?
This can be done with the recently added feature for windowed writes in TextIO. Please see the documentation for TextIO, in particular see withWindowedWrites and to(FilenamePolicy). This feature is also present in AvroIO.
Are you simply looking for the function TextIO.Write.Bound<String>.withSuffix() or TextIO.Write.Bound<String>.to()? It seems these would allow you to provide a suffix or prefix for the output filename.
Right now, TextIO.Write does not support operation in streaming mode – writing to GCS is tricky, e.g., because you can't write to a file concurrently from multiple workers and you can't append to files once they close. We have plans to add streaming support to TextIO.
You'll get the best support for this today using BigQuery rather than GCS – because we already support BigQuery writes during streaming, and you choose which table you write to based on the window name, and BigQuery supports writes from many different workers at once.
TextIO.Write ought to work. No need for custom filesink.
In your case, you want to write your PubSub messages to an output text file - not locally, but on remote GS. You ought to be able to use:
PCollection .apply.TextIO.Write().to(
Since you are processing a stream of PubSub messages, your window is unbounded and your PubSub data source already provides a timestamp for each element in the PCollection.
If you wish to assign a timestamp, your ParDo transform needs to use a DoFn that outputs elements using ProcessContext.outputWithTimestamp().
In summary, you can use TextIO.Write aftre ensuring the elements in your PCollection are output with timestamp.

Is it possible to create a local file to store data?

I'm currently using the dataAPI to keep the dataitems synchronized between handheld and wearable.
Still I want to make sure that every data is stored and there is no data lost in the process.
I'm currently reading GPS parameters when the wear is not connected to the handheld and when they connect, they sync the dataitems.
How reliable is DataAPI?
Is my idea of creating a local file doubling my effort?
How can I create a local file on my wear device and then access it?
Syncing data using DataApi is reliable and I recommend using that; if you come across a scenario that sync is not happening reliably, that should be considered a bug and needs to be reported as such. One issue that folks run into is that they create the same data item and they don't get the onDataChanged() callback but that is by design, if the very same data is being added multiple times, there is no change, hence no callback triggers.
Another factor you might want to consider is whether the data you create on one node is for consumption by all other nodes or only a targeted one; DataApi syncs data across all connected nodes so if I create a data item on watch1 and want to sync that with my phone and if there is a watch2 in the picture as well, watch2 also gets the same data.
If you end up using the DataApi, I strongly recommend to make sure to put in place a policy that removes the data once it is synced and consumed otherwise data will be accumulated with no supervision and you'll finally run out of space.
To answer your questions:
I don't know how reliable it effectively is, but we had problems where data updates didn't trigger the appropriate listeners on the watch side. So I'm not sure. Maybe someone has an official statement for this?
I think it depends on the amount of data you want to store. So I suggest you first become clear about the amount and then choose the format. Keep in mind that there is also the possibility to store data in the Shared Preferences.
These guys here tried to save an image on the watch, but that makes no difference wheter it is an image file or text or whatever file.

Resources