Transform data to pubsub events - google-cloud-dataflow

I have a dataflow pipeline that collects user data like navigation, purchases, crud actions etc. I have this requirement to be able to identify patterns real time and then dispatch pubsub events that other services can listen to in order to provide the user real time tips, offers or promotions.
I'm thinking to start grouping the events by user id and then if the match a pattern to create a PCollection that contains the events names that need to be triggered via pubsub.
Is this the right approach? Is there a better way?

This could certainly work for some use cases.
If you are using session based windowing in combination with early firings (triggering upon arrival of each element). You can have all the data needed to identify patterns each time a new element arrives.
However, depending on the rate of user data being pushed and the size of the session, this might result in holding a lot of data in the PCollection and repeating this pattern matching a lot (on the same data), since you have to reuse all the data in the session. Furthermore you cannot use elements that arrived before this session.
Sometimes, you might be better off by keeping a state for each user (without redoing the pattern matching on all the data of the user for this session). Using a state would in fact remove the need to work with windowing.
The new process would now look like this:
For each element that arrives:
Fetch the current state
Calculate the new state (based on the old state and the new element)
If needed, emit a message to PubSub.
To hold your state, you could use BigTable or Datastore.

Related

EventStore - read specific time frame from stream

I have a system that produces thousands of messages per hour. Some location tracking system that gather events from different devices and doing different calculations based on that messages.
I'm trying to evaluate if event store suites for this use case. So my plane is associate stream per device and accumulate messages in those streams.
Now the question - will I be able to read those messages for specific time frame in the past? I don't want to replay all the messages from the beginning, I just need fast access to messages from date1 to date2.
Any ideas? So far what I saw in the docs only relates to reading all messages either from the beginning or from the end and do the filtration during the process. But this pattern doesn't look very optimal to me. Am I doing something wrong?
EventStoreDB index allows you to read events from a specific stream by the event number, but not by date. What you wrote about reading from the beginning or the end is not entirely correct, as you can read from any position in the stream, both backwards and forwards, but then again it has nothing to do with dates.
Essentially, the data when the event was written to the database is considered not important and transient. For example, if you decide to move your data to another store using replication, all the events will get the new date. That's why if the date is important, it should be stored somewhere in the event data or metadata. EventStoreDB doesn't know about the event payload (or meta), and doesn't index that.
If you are looking to find a database kind that allows you to query records by time, the best chance is to look at time series databases like Prometheus and InfluxDB. These databases are specifically designed to index primarily by timestamp, and optimised to store data like sensor readings where each reading is a replacement of the previous one. EventStoreDB is not designed for that purpose, it's the database built to support event-sourced applications, and sensor readings is not that.

TFDDataset detect changes after call to refresh

Is there any way of detecting whether the data in a TFDDataset has changed as a result of a call to the dataset's Refresh function?
The nature of the Refresh method is that it discards tuples fetched in its internal storage so after calling it you have no resultset for comparison. Hence the only way would be storing the original resultset before calling it.
But in your comment you've mentioned that your overall aim is to know whether a certain detaset has changed as a result of another user modification. That said, it sounds that you are polling the tables which is not efficient in general.
If that is so, I would suggest considering either database events (if your DBMS supports them) or better yet business tier (ideally combined with the database events). These events or tier would then generate event received by the client only when something in the database actually changes saving (potentionally lots of) empty round trips.

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

The best way to handle erratic data on iOS

I am working on an application where I have a connection to a database. The database contains from 300MB to 4GB worth of data as each customer has their own database. My issue that I am having is in gathering the data, because of the potential database size, just downloading and storing the information locally isn't possible. The data can get quite complex and can vary. For an example:
A customer has a Job and they want to search for that job from the app.
I then fetch a list of jobs matching the search criteria.
The customer sees the job they want to view and I start the gathering process.
This job can potentially touch many tables, sometimes repeatedly..
There is the jobs table, a relational table to map to a person. Then there is another table that contains non-customer relational information, then there are calendar events associated to the job, which in tun can associate different people. Then there are emails attached to the job, which in turn can bring in additional people and events.
So I have a working model that gathers all of this information. The problem I have is that I cannot figure out a great method of signaling to my view that the data is completely downloaded. My initial thought was to use the NotificationCenter to message when the certain parts of the task were finished, allowing the core Job object to notify the view when everything was complete.
I know this is a pretty generalized question, but I'm honestly stumped as to how to take an unknown number of table results and translate that into a notice that my app can actually use.
My initial recommendation would be Core Data. It's designed for this kind of problem. No, I'm not saying to download the entire database into Core Data. I'm saying to use Core Data to manage your object model, because that's what it's good at.
As you receive data from the server, compose it into NSManagedObjects and stick them in the data store. On the UI side, create an NSFetchedResultsController to keep you informed as the data updates asynchronously. You don't necessarily need to persist this store. You could just keep it in memory and throw it away whenever you're done with the query, but keeping it on disk could be a nice caching solution. Again, don't think of Core Data as "a local database." Think of it as a model persistence engine that you can query for objects.
One advantage of this model is that you can provide the best available data to the user as it becomes available. But say you really don't want to get the information until it's all available. That's fine, too. Just let the network side keep updating its context, and then only save it when everything's complete. That way NSFetchedResultsController gets a single atomic update. The nice things with Core Data is that it has these concepts built in, so you can adjust your update strategy without requiring massive redesign.
The Notification Center will work great for this.
Post the notification at logical points in your data load to trigger a UI update for your users.

Resources