How to predict event sequence with RNN/LSTM? - machine-learning

I have a system which generates event sequences continously, the event sequence looks like below in json
...
{'event_id':1001, timestamp:ts1, attr1:numerical_val1, attr2:categorical_val2, etc...}
{'event_id':1003, timestamp:ts2, attr1:numerical_val1, attr3:numerical_val3, attr4:categorical_val4, etc...}
{'event_id':1001, timestamp:ts3, attr1:numerical_val1, attr2:categorical_val2, etc...}
{'event_id':1005, timestamp:ts4, attr1:numerical_val1, attr5:numerical_val5, result:<success|fail> etc...}
Those generated events have common attributes like event_id, timestamp, attr1, etc, but also have event specific attributes as you can see in example. the events are generated in sequential mode, each event means something (sub-procedure) has happened in the system and the attributes in the event might be numerical or categorical indicating the performance or result. there is one specific event (assume it's 1005) which indicates the overall procedure is success or fail.
and the question is how to predict the potential failure (the label is uni-variate : success or fail) via the input streams of those events, including
What model shall be used? I feel RNN/LSTM is a good fit for time series prediction, but not sure how it can be used for such event sequence prediction
How to prepare the input data to be learned by RNN/LSTM? since the events do not always have same set of attributes

Related

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

Remove duplicates across window triggers/firings

Let's say I have an unbounded pcollection of sentences keyed by userid, and I want a constantly updated value for whether the user is annoying, we can calculate whether a user is annoying by passing all of the sentences they've ever said into the funcion isAnnoying(). Forever.
I set the window to global with a trigger afterElement(1), accumulatingFiredPanes(), do GroupByKey, then have a ParDo that emits userid,isAnnoying
That works forever, keeps accumulating the state for each user etc. Except it turns out the vast majority of the time a new sentence does not change whether a user isAnnoying, and so most of the times the window fires and emits a userid,isAnnoying tuple it's a redundant update and the io was unnecessary. How do I catch these duplicate updates and drop while still getting an update every time a sentence comes in that does change the isAnnoying value?
Today there is no way to directly express "output only when the combined result has changed".
One approach that you may be able to apply to reduce data volume, depending on your pipeline: Use .discardingFiredPanes() and then follow the GroupByKey with an immediate filter that drops any zero values, where "zero" means the identity element of your CombineFn. I'm using the fact that associativity requirements of Combine mean you must be able to independently calculate the incremental "annoying-ness" of a sentence without reference to the history.
When BEAM-23 (cross-bundle mutable per-key-and-window state for ParDo) is implemented, you will be able to manually maintain the state and implement this sort of "only send output when the result changes" logic yourself.
However, I think this scenario likely deserves explicit consideration in the model. It blends the concepts embodied today by triggers and the accumulation mode.

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).
Right now I have two batch pipelines:
First one just runs the rating over the data.
The second one first filters out the corrupted data and runs the analysis.
I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).
The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.
The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.
I know about side input, I don't know if it can change during runtime.
I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?
Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.
Side inputs in the Beam programming model are windowed.
So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.
In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Why did #sideInput() method move from Context to ProcessContext in Dataflow beta

I wonder why has the #sideInput() method moved to ProcessContext class?
Previously I could do some additional processing in the #startBundle() method and cache the result.
Doing that in #processElement() sounds less efficient. Of course I could do the preprocessing before passing the data to the view, but there still is the overhead of calling #sideInput() for each element...
Thanks,
G
Great question. The reason is that we added support for windowed PCollections as side inputs. This enables additional scenarios, including using side inputs with unbounded PCollections in streaming mode.
Before the change, we only supported side inputs that were globally windowed, and then entire side input PCollection was available while processing every element of the main input PCollection. This works fine for bounded PCollections in traditional batch style processing, but didn't extend to windowed or unbounded PCollections.
After the change, the window of the current element you are processing in your ParDo controls what subset of the side input is visible. (And so you can't access side inputs in startBundle(), where there is no current element and hence no current window.)
For example, consider an example where you have a streaming pipeline processing your website logs and providing real time updates to a live usage dashboard. You've got two unbounded input PCollections: one contains new user signups and the other contains user clicks. You can identify which user clicks come from new users by windowing both PCollections by hour and doing a ParDo over the user clicks that takes new user signups as a side input. Now when you process a user click which is in a given hour, you automatically see just the subset of the new user sign ups from the same hour. You can do different variants on this by changing the windowing functions and moving element timestamps forward in time on the side input -- like continuing to window the user clicks per hour, but using the new signups from the last 24 hours.
I do agree this change makes it harder to cache any postprocessing on your side input. We added View.asMultimap to handle a common case where you turn the Iterable into a lookup table. If your post-processing is element-wise, you can do it with a ParDo before creating the PCollectionView. For anything else right now, I'd recommend doing it lazily from within processElement. I'd be interested in hearing about other patterns that occur, so we can work on ways to make them more efficient.

Resources