How to set a TTL for all items in a MapState in Flink? - join

I would like to clean up all entries in a MapState at a given timestamp.
I am considering two ways to do:
Hold the cleanup timestamp in a ValueState, Register a timer for the cleanup timestamp, When the timer fires clear the MapState. Though the cleanup timestamp might be the same, this would happen for every item added to the MapState. I am relying on Flink de-duping the timers.
Calculate the TTL based on (cleanup timestamp - current timestamp), use StateTtlConfig to set a TTL for the MapState
Which is a better approach (performance, accuracy etc.)?
Does StateTtlConfig work for even time processing?

If your intention is to clear out all entries in the MapState at the same time, then I would not use StateTtlConfig, since Flink will spend 8 bytes to store a timer with each map entry. This is a lot of unnecessary storage overhead.
With StateTtlConfig, the state expiry can only be specified in terms of processing time.
Also keep in mind that StateTtlConfig cannot be added to, or removed from, an existing state descriptor.

If You want to clean all records at given timestamp, then option 1 should be more performant definetely. In general using TTL on flink state will add additional overhead, since it will be checked every time key in MapState is accessed. It depends on how do You exactly use the state(how many records do You plan to store there, how often will You access it and the type of the state), but I'd say that if You just want to drop all records at given timestamp then option 1 is better idea.

I think use timerService to register a timer gives you better performance.but event-time timers only fire with watermarks coming in, you may also schedule and coalesce these timers with the next watermark by using the current one.
val coalescedTime = ctx.timerService.currentWatermark + 1
ctx.timerService.registerEventTimeTimer(coalescedTime)

Related

How to check and release the memory occupied by DolphinDB stream engine?

I have already unsubscribed a table in DolphinDB, but when I executed the function getStreamEngineStat().AsofJoinEngine, there was still memory occupied by the engine.
Does the function return the real-time memory information on the stream engine? How can I check the current memory status and release the memory?
getStreamEngineStat().AsofJoinEngine does return the real-time memory information.
Undefining the subscription is not equivalent to dropping the engine. In your case, if you want to release the memory occupied by the engine, you need first use dropStreamEngine and set the variable returned by createEngine to be NULL.
Specific examples are as follows:
Suppose you define the engine as follows:
ajEngine=createAsofJoinEngine(name="aj1", leftTable=trades, rightTable=quotes, outputTable=prevailingQuotes, metrics=<[price, bid, ask, abs(price-(bid+ask)/2)]>, matchingColumn=`sym, timeColumn=`time, useSystemTime=false, delayedTime=1)
After undefining the subscription, you can correctly release memory by the following code:
dropStreamEngine("aj1") // 1.release the specified engine
ajEngine = NULL // 2. Release the engine handler from memory
If you find that the memory of the asofjoin engine is too large in the subscription, you can also specify the parameter garbageSize to clean up historical data that is no longer needed. The size of garbageSize needs to be set case by case, roughly according to how many records each key will have per hour.
The principle is: First, garbageSize is set for the key, that is, the data within each key group will be cleaned up. If garbageSize is too small, frequent cleaning of historical data will bring unnecessary overhead; if garbageSize is too large, it may not reach the threshold of garbageSize and cannot trigger cleaning, resulting in a leftover of unwanted historical data.
Therefore, it is recommended to clean up once per hour and set the garbageSize with a rough estimate.

Combine session and tumbling window: time windows that are aligned to the first event for each key

i read about flink`s window assigners over here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#window-assigners , but i cant find any solution for my problem.
as part of my project i need a windowing that the timer will start given the first element of the key and will be closed and set ready for processing after X minutes. for example:
first keyA comes at (hh:mm:ss) 00:00:02, i want all keyA will be windowing until 00:01:02, and then the timer of 1 minutes will start again only when keyA will be given as input.
Is it possible to do something like that in flink? is there a workaround?
hope i made it clear enough.
Implementing keyed windows that are aligned with the first event, rather than with the epoch, is quite difficult, in general, which I believe is why this isn't supported by Flink's window API. The problem is that with an out-of-order stream using event time processing, as earlier events arrive you may need to revise your notion of when the window began, and when it should end. For example, if the first keyA arrives at 00:00:02, but then some time later an event with keyA arrives with a timestamp of 00:00:01, now suddenly the window should end at 00:01:01, rather than 00:01:02. And if the out-of-orderness is large compared to the window length, handling this becomes quite complex -- imagine, for example, that the event from 00:00:01 arrives 2 minutes after the event from 00:00:02.
Rather than trying to implement this with the window API, I would use a KeyedProcessFunction. If you only need to support processing time windows, then these concerns about out-of-orderness do not apply, and the solution can be fairly simple. It suffices to keep one object in keyed state, which might be a list holding all of the events in the window, or a counter or other aggregator, depending on what you're trying to accomplish.
When an event arrives, if the state (for this key) is null, then there is no open window for this key. Initialize the state (i.e., create a new, empty list, or set the counter to zero), and create a Timer to fire at the appropriate time. Then regardless of whether the state had been null, add the incoming event to the state (i.e., append it to the list, or increment the counter).
When the timer fires, emit the window's result and reset the state to null.
If, on the other hand, you want to do this with event time windows, first sort the stream and then use the same approach. Note that you won't be able to handle late events, so plan your watermarking accordingly (reducing the likelihood of late events to a manageable level), or go for a more complex implementation.

Circumventing negative side effects of default request sizes

I have been using Reactor pretty extensively for a while now.
The biggest caveat I have had coming up multiple times is default request sizes / prefetch.
Take this simple code for example:
Mono.fromCallable(System::currentTimeMillis)
.repeat()
.delayElements(Duration.ofSeconds(1))
.take(5)
.doOnNext(n -> log.info(n.toString()))
.blockLast();
To the eye of someone who might have worked with other reactive libraries before, this piece of code
should log the current timestamp every second for five times.
What really happens is that the same timestamp is returned five times, because delayElements doesn't send one request upstream for every elapsed duration, it sends 32 requests upstream by default, replenishing the number of requested elements as they are consumed.
This wouldn't be a problem if the environment variable for overriding the default prefetch wasn't capped to minimum 8.
This means that if I want to write real reactive code like above, I have to set the prefetch to one in every transformation. That sucks.
Is there a better way?

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

Remove duplicates across window triggers/firings

Let's say I have an unbounded pcollection of sentences keyed by userid, and I want a constantly updated value for whether the user is annoying, we can calculate whether a user is annoying by passing all of the sentences they've ever said into the funcion isAnnoying(). Forever.
I set the window to global with a trigger afterElement(1), accumulatingFiredPanes(), do GroupByKey, then have a ParDo that emits userid,isAnnoying
That works forever, keeps accumulating the state for each user etc. Except it turns out the vast majority of the time a new sentence does not change whether a user isAnnoying, and so most of the times the window fires and emits a userid,isAnnoying tuple it's a redundant update and the io was unnecessary. How do I catch these duplicate updates and drop while still getting an update every time a sentence comes in that does change the isAnnoying value?
Today there is no way to directly express "output only when the combined result has changed".
One approach that you may be able to apply to reduce data volume, depending on your pipeline: Use .discardingFiredPanes() and then follow the GroupByKey with an immediate filter that drops any zero values, where "zero" means the identity element of your CombineFn. I'm using the fact that associativity requirements of Combine mean you must be able to independently calculate the incremental "annoying-ness" of a sentence without reference to the history.
When BEAM-23 (cross-bundle mutable per-key-and-window state for ParDo) is implemented, you will be able to manually maintain the state and implement this sort of "only send output when the result changes" logic yourself.
However, I think this scenario likely deserves explicit consideration in the model. It blends the concepts embodied today by triggers and the accumulation mode.

Resources