I was concerned about real time stream processing for IOT through GCD pub/sub, Cloud Dataflow and perform analytics through BigQuery.I am seeking help for how to implement this.
Here is the architecture for IOT real-time stream processing
I'm assuming you mean that you want to stream some sort of data from outside the Google Cloud Platform into BigQuery.
Unless you're transforming the data somehow, I don't think that Data Flow is necessary.
Note, that BigQuery has its own Streaming API so you don't necessarily have to use Pub/Sub to get data into BigQuery.
In any case, these are the steps you should generally follow.
Method 1
Issue a service account (and download the .json file from IAM on Google Console)
Write your application to get the data you want to stream in
Inside that application, use the service account to stream directly into a BQ dataset and table
Analyse the data on the BigQuery console (https://bigquery.cloud.google.com)
Method 2
Setup PubSub queue
Write an application that collections the information you want to stream in
Push to PubSub
Configure DataFlow to pull from PubSub, transform the data however you need to and push to BigQuery
Analyse the data on the BigQuery console as above.
Raw Data
If you just want to put very raw data (no processing) into BQ, then I'd suggest using the first method.
Semi Processed / Processed Data
If you actually want to transform the data somehow, then I'd use the second method as it allows you to massage the data first.
Try to always use Method 1
However, I'd usually always recommend using the first method, even if you want to transform the data somehow.
That way, you have a data_dump table (raw data) in your dataset and you can still use DataFlow after that to transform the data and put it back into an aggregated table.
This gives you maximum flexibility because it allows you to create potentially n transformed datasets from the single data_dump table in BQ.
Related
I'm working with Thingsboard Community Edition 2.0.
I have one asset that contains two different devices. Both devices send telemetry data with the same key. I want to be able to show the sum of both values as the total for the asset.
Did anyone figure out how to do it?
Thanks.
The idea is basically to create a memory attribute containing a JSON of the data you want to save.
It's not straight forward but i found a way that works to do it.
Main steps :
Change originator to asset.
Each time telemetry comes, use enrichment to get the attribute memory, then with a script node put if from metadata to msg.
Merge the memory and the incomming telemetry (replace old value with new ones)
Compute what you want (min, max, mean, standard deviation, sum, etc...) and save it in telemetry or attribute.
Save in parallel the merged memory (note you can't save JS objects in telemetry or attributes, you have to use JSON.stringify() to save it and JSON.parse to take it back using enrichment node.
Hope it helps
Corentin
I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.
I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.
I am processing a windowed stream of PubSub messages and I would like to archive them to GCS. I'd like the archived files to have a prefix that's derived from the window timestamp (something like gs://bucket/messages/2015/01/messages-2015-01-01.json). Is this possible with TextIO.Write, or do I need to implement my own FileBasedSink?
This can be done with the recently added feature for windowed writes in TextIO. Please see the documentation for TextIO, in particular see withWindowedWrites and to(FilenamePolicy). This feature is also present in AvroIO.
Are you simply looking for the function TextIO.Write.Bound<String>.withSuffix() or TextIO.Write.Bound<String>.to()? It seems these would allow you to provide a suffix or prefix for the output filename.
Right now, TextIO.Write does not support operation in streaming mode – writing to GCS is tricky, e.g., because you can't write to a file concurrently from multiple workers and you can't append to files once they close. We have plans to add streaming support to TextIO.
You'll get the best support for this today using BigQuery rather than GCS – because we already support BigQuery writes during streaming, and you choose which table you write to based on the window name, and BigQuery supports writes from many different workers at once.
TextIO.Write ought to work. No need for custom filesink.
In your case, you want to write your PubSub messages to an output text file - not locally, but on remote GS. You ought to be able to use:
PCollection .apply.TextIO.Write().to(
Since you are processing a stream of PubSub messages, your window is unbounded and your PubSub data source already provides a timestamp for each element in the PCollection.
If you wish to assign a timestamp, your ParDo transform needs to use a DoFn that outputs elements using ProcessContext.outputWithTimestamp().
In summary, you can use TextIO.Write aftre ensuring the elements in your PCollection are output with timestamp.
I'm setting up a server-client solution with ASP.NET MVC4 and a WCF-service, and was thinking you might have some input to a couple of questions.
The WCF-service get its data from a 3rd-party service which is quite slow. So my plan is the following scenario:
User logs in, setting off a jQuery-ajax-request to the MVC-controller
The Controller requests the WCF-service for data
The Service retrieves a small amount of data from the 3rd-party and before it returns it...
HERE IT COMES: the service spawns a background-thread to download a large amount of data from the 3rd-party
The service returns the small amount of data
The client gets the small amount of data and displays it, but also starts polling the service for the large amount of data
The large amount of data is downloaded to the WCF-service, and put into a cache-database
The Service returns the large amount of data to the client upon next polling-request.
My questions:
Am I not thinking straight about this?
What kind of background-threading mechanism should I use? The WCF-service is hosted in IIS.
Is polling from the client the right way to retrieve the next chunk of data?
Thanks for your time!