Reactor 3.x - limit the time of a groupBy Flux - project-reactor

is there any way to force a Flux generated by groupBy() to complete after a period of time (or similarly, limit the maximum # of "open" groups) regardless of the completeness of the upstream? I have something like the following:
Flux<Foo> someFastPublisher;
someFastPublisher
.groupBy(f -> f.getKey())
.delayElements(Duration.ofSeconds(1)) // rate limit each group
.flatMap(g -> g) // unwind the group
.subscribe()
;
and am running into the case where the Flux hangs, assumed because the number of groups is greater than the flatMap's concurrency. I could increase the flatMap concurrency, but there's no easy way to tell what the max possible size is. Instead, i know the Foo's being grouped by Foo.key are going to be close to each other in time/publication order and would rather use some sort of time window on the groupBy Flux vs. flatMap concurrency (and ending up w/ two different groups w/ the same key() isn't a big deal).
I'm guessing the groupBy Flux's won't onComplete until someFastPubisher onCompletes - i.e. the Flux's handed off to flatMap just stay "open" (although they're not likely to ever get a new event).
I am able to work around this either by prefetching Integer.MAX in the groupBy or Integer.MAXing the concurrency - but is there a way to control the "life" of the group?

yes: you can apply a take(Duration) to the groups in order to ensure they close early and a new group with the same key will open after that:
source.groupBy(v -> v.intValue() % 2)
.flatMap(group -> group
.take(Duration.ofMillis(1000))
.count()
.map(c -> "group " + group.key() + " size = " + c)
)
.log()
.blockLast();

Related

Reactor groupBy: What happens with remaining items after GroupedFlux is canceled?

I need to group infinite Flux by key with high cardinality.
For example:
group key is domain url
calls to one domain should be strictly sequential (next call happens after previous one is completed)
calls to different domains should be concurrent
time interval between items with same key (url) is unknown, but expected to have burst nature. Several items emitted in short period of time then long pause until next group.
queue
.groupBy(keyMapper, groupPrefetch)
.flatMap(
{ group ->
group.concatMap(
{ task -> makeSlowRemoteCall(task) },
0
)
.takeUntil { remoteCallResult -> remoteCallResult == DONE }
.timeout(groupTimeout, Mono.empty())
.then()
}
, concurrency
)
I cancel the group in two cases:
makeSlowRemoteCall() result indicates that with high probability there will be no new items in this group in near future.
Next item is not emitted during groupTimeout. I use timeout(timeout, fallback) variant to suppress TimeoutException and allow flatMap's inner publisher to complete successfully.
I want possible future items with same key to make new GroupedFlux and be processed with same flatMap inner pipeline.
But what happens if GroupedFlux has remaining unrequested items when I cancel it?
Does groupBy operator re-queue them into new group with same key or they are lost forever. If later what is the proper way to solve my problem. I am also not sure if I need to set concatMap() prefetch to 0 in this case.
I think groupBy() operator is not fit for my task with infinite source and a lot of groups. It makes infinite groups so it is necessary to somehow cancel idle groups downstream. But it is not possible to cancel GroupedFlux with guarantee that it has no unconsumed elements.
I think it will be great to have groupBy variant that emits finite groups.
Something like groupBy(keyMapper, boundryPredicate). When boundryPredicate returns true current group is complete and next element with same key will start new group.

How do I select a random element from an ets set in Erlang/Elixir?

I have a large number of processes that I need to keep track of in an ets set, and then randomly select single processes. So I created the set like this:
:ets.new(:pid_lookup, [:set, :protected, :named_table])
then for argument's sake let's just stick self() in it 1000 times:
Enum.map 1..1000, fn x -> :ets.insert(:pid_lookup, {x, self()}) end
Now I need to select one at random. I know I could just select a random one using :ets.lookup(:pid_lookup, :rand.uniform(1000)), but what if I don't know the size of the set (in the above case, 1000) in advance?
How do I find out the size of an ets set? And/or is there a better way to choose a random pid from an ets data structure?
If keys are sequential number
tab = :ets.new(:tab, [])
Enum.each(1..1000, & :ets.insert(tab, {&1, :value}))
size = :ets.info(tab, :size)
# size = 1000
value_picked_randomly = :ets.lookup(tab, Enum.random(1..1000))
:ets.info(tab, :size) returns a size of a table; which is a number of records inserted on given table.
If you don't know that the keys are
first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
if function_that_may_return_true() do
key = case :ets.next(tab, key) do
:'$end_of_table' -> throw :reached_end_of_table
key -> func.(key)
end
else
:ets.lookup(tab, key)
end
end
func.()
func will iterate over the ets table and returns a random value.
This will be time consuming, so it will not be an ideal solution for tables with large number of records.
As I understood from the comments, this is an XY Problem.
What you essentially need is to track down the changing list and pick up one of its elements randomly. ETS in general and :ets.set in particular are by no mean intended to be queried for size. They serve different purposes.
Spawn an Agent within your supervision tree, holding the list of PIDs of already started servers and use Kernel.length/1 to query its size, or even use Enum.random/1 if the list is not really huge (the latter traverses the whole enumerable to get a random element.)

Esper EPL statement each time a value has increased a multiple

I am looking for an EPL statement which fires an event each time a certain value has increased by a specified amount, with any number of events in between, for example:
Considering a stream, which continuously provides new prices.
I want to get a notification, e.g. if the price is greater than the first price + 100. Something like
select * from pattern[a=StockTick -> every b=StockTick(b.price>=a.price+100)];
But how to realize that I get the next event(s), if the increase is >= 200, >=300 and so forth?
Diverse tests with context and windows has not been successful so far, so I appreciate any help! Thanks!
The contexts would be the right way to go.
You could start by defining a start event like this:
create schema StartEvent(threshold int);
And then have context that uses the start event:
create context ThresholdContext inititiated by StartEvent as se
terminated after 5 years
context ThresholdContext select * from pattern[a=StockTick -> every b=StockTick(b.price>=context.se.threshold)];
You can generate the StartEvent using "insert into" from the same pattern (probably want to remove the "every") or have the listener send in a StartEvent or declare another pattern that fires just once for creating a StartEvent.

specific query with cypher

I need help with specific query. I am using neo4j. My database consists of companies (nodes) and transactions between them(relationship). Each relationship(PAID) has properties:
amount- for amount of transaction
year - year of transaction
month - month of transaction
What I need, is to find all cycles in a graph, starting at node A. It must also be true that transaction occurred one after another.
So valid example would be A PIAD B in march, B PAID C in april, C PAID A in june.
So is there any way to get all cycles from node A, so that transactions occur in continuous order?
You may want to set up a sample graph at Neo4j console to share or at least tell more about what version of Neo4j you are using, but if you're using 2.0 and if you store year and month as long or integer, then maybe you could try something like
MATCH a-[ab:PAID]->b-[bc:PAID]->c-[ca:PAID]->a
WHERE (ab.year + ab.month) > (bc.year + bc.month) > (ca.year + ca.month)
RETURN a,b,c
EDIT:
Actually that was hasty, the additions won't work that way of course, but the structure should be ok. Maybe
WHERE ((ab.year > bc.year) or (ab.year = bc.year AND ab.month > bc.month))
AND ((bc.year > ca.year) OR (bc.year = ca.year AND bc.month > ca.month))
or
WHERE (ab.year * 12 + ab.month) > (bc.year * 12 + bc.month) > (ca.year * 12 + ca.month)
If you only use dates for this type of comparison, consider storing them as one property, perhaps as milliseconds since 'epoch' 1/1 -70 GMT. That makes comparisons very easy. But if you need to return and display dates frequently, then keeping them separate might make sense.
EDIT2:
I can't think of a way to build your condition of "r1.date < r2.date" into the pattern, which means matching all variable depth cycles and then discarding some (most) of them. That's wont to become expensive in a large graph, and you may be better off building a traversal or server plugin, which can make complex iterative decisions during the traversal. In 2.0, thanks to Wes' elegant collection slicing, you could try something like this
MATCH path=a-[ab:PAID*..10]->a
WHERE ALL (ix IN range(0,length(ab)-2)
WHERE ((ab[ix]).year * 12 +(ab[ix]).month)<((ab[ix+1]).year * 12 +(ab[ix+1]).month))
RETURN path
The same could probably be achieved in 1.9 with HEAD() and TAIL(). Again, share sample data in a console and maybe someone else can pitch in.

Getting lots of data from Mnesia - fastest way

I have a record:
-record(bigdata, {mykey,some1,some2}).
Is doing a
mnesia:match_object({bigdata, mykey, some1,'_'})
the fastest way fetching more than 5000 rows?
Clarification:
Creating "custom" keys is an option (so I can do a read) but is doing 5000 reads fastest than match_object on one single key?
I'm curious as to the problem you are solving, how many rows are in the table, etc., without that information this might not be a relevant answer, but...
If you have a bag, then it might be better to use read/2 on the key and then traverse the list of records being returned. It would be best, if possible, to structure your data to avoid selects and match.
In general select/2 is preferred to match_object as it tends to better avoid full table scans. Also, dirty_select is going to be faster then select/2 assuming you do not need transactional support. And, if you can live with the constraints, Mensa allows you to go against the underlying ets table directly which is very fast, but look at the documentation as it is appropriate only in very rarified situations.
Mnesia is more a key-value storage system, and it will traverse all its records for getting match.
To fetch in a fast way, you should design the storage structure to directly support the query. To Make some1 as key or index. Then fetch them by read or index_read.
The statement Fastest Way to return more than 5000 rows depends on the problem in question. What is the database structure ? What do we want ? what is the record structure ? After those, then, it boils down to how you write your read functions. If we are sure about the primary key, then we use mnesia:read/1 or mnesia:read/2 if not, its better and more beautiful to use Query List comprehensions. Its more flexible to search nested records and with complex conditional queries. see usage below:
-include_lib("stdlib/include/qlc.hrl").
-record(bigdata, {mykey,some1,some2}).
%% query list comprehenshions
select(Q)->
%% to prevent against nested transactions
%% to ensure it also works whether table
%% is fragmented or not, we will use
%% mnesia:activity/4
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:e(QH) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:e(Q)
end.
%% to read by a given field or even several
%% you use a list comprehension and pass the guards
%% to filter those records accordingly
read_by_field(some2,Value)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 == Value]),
select(QueryHandle).
%% selecting by several conditions
read_by_several()->
%% you can pass as many guard expressions
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 =< 300,
X#bigdata.some1 > 50
]),
select(QueryHandle).
%% Its possible to pass a 'fun' which will do the
%% record selection in the query list comprehension
auto_reader(ValidatorFun)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
ValidatorFun(X) == true]),
select(QueryHandle).
read_using_auto()->
F = fun({bigdata,SomeKey,_,Some2}) -> true;
(_) -> false
end,
auto_reader(F).
So i think if you want fastest way, we need more clarification and problem detail. Speed depends on many factors my dear !

Resources