Split events based on criteria and handle in order - project-reactor

Having the following problem: given a list of events that have a partitionId property (0-10 for example), I'd like incoming events to be split according to the paritionId so that events with same partitionId are handled in order they're received.
With more or less even distribution, that would lead to 10 events (for each partition) being handled in parallel.
Besides creating 10 single-threaded dispatchers and sending the event to the right dispatcher, is there a way to accomplish the above using Project Reactor ?
Thanks.

The code below
splits source stream into partitions,
creates ParallelFlux, one "rail" per partition,
schedules "rails" into separate threads,
collects the results
Having dedicated thread for each partition guaranties its values are processed in original order.
#Test
public void partitioning() throws InterruptedException {
final int N = 10;
Flux<Integer> source = Flux.range(1, 10000).share();
// partition source into publishers
Publisher<Integer>[] publishers = new Publisher[N];
for (int i = 0; i < N; i++) {
final int idx = i;
publishers[idx] = source.filter(v -> v % N == idx);
}
// create ParallelFlux each 'rail' containing single partition
ParallelFlux.from(publishers)
// schedule partitions into different threads
.runOn(Schedulers.newParallel("proc", N))
// process each partition in its own thread, i.e. in order
.map(it -> {
String threadName = Thread.currentThread().getName();
Assert.assertEquals("proc-" + (it % 10 + 1), threadName);
return it;
})
// collect results on single 'rail'
.sequential()
// and on single thread called 'subscriber-1'
.publishOn(Schedulers.newSingle("subscriber"))
.subscribe(it -> {
String threadName = Thread.currentThread().getName();
Assert.assertEquals("subscriber-1", threadName);
});
Thread.sleep(1000);
}

Related

Search for sequence in Uint8List

Is there a fast (native) method to search for a sequence in a Uint8List?
///
/// Return index of first occurrence of seq in list
///
int indexOfSeq(Uint8List list, Uint8List seq) {
...
}
EDIT: Changed List<int> into Uint8List
No. There is no built-in way to search for a sequence of elements in a list.
I am also not aware of any dart:ffi based implementations.
The simplest approach would be:
extension IndexOfElements<T> on List<T> {
int indexOfElements(List<T> elements, [int start = 0]) {
if (elements.isEmpty) return start;
var end = length - elements.length;
if (start > end) return -1;
var first = elements.first;
var pos = start;
while (true) {
pos = indexOf(first, pos);
if (pos < 0 || pos > end) return -1;
for (var i = 1; i < elements.length; i++) {
if (this[pos + i] != elements[i]) {
pos++;
continue;
}
}
return pos;
}
}
}
This has worst-case time complexity O(length*elements.length). There are several more algorithms with better worst-case complexity, but they also have larger constant factors and more expensive pre-computations (KMP, BMH). Unless you search for the same long list several times, or do so in a very, very long list, they're unlikely to be faster in practice (and they'd probably have an API where you compile the pattern first, then search with it.)
You could use dart:ffi to bind to memmem from string.h as you suggested.
We do the same with binding to malloc from stdlib.h in package:ffi (source).
final DynamicLibrary stdlib = Platform.isWindows
? DynamicLibrary.open('kernel32.dll')
: DynamicLibrary.process();
final PosixMalloc posixMalloc =
stdlib.lookupFunction<Pointer Function(IntPtr), Pointer Function(int)>('malloc');
Edit: as lrn pointed out, we cannot expose the inner data pointer of a Uint8List at the moment, because the GC might relocate it.
One could use dart_api.h and use the FFI to pass TypedData through the FFI trampoline as Dart_Handle and use Dart_TypedDataAcquireData from the dart_api.h to access the inner data pointer.
(If you want to use this in Flutter, we would need to expose Dart_TypedDataAcquireData and Dart_TypedDataReleaseData in dart_api_dl.h https://github.com/dart-lang/sdk/issues/40607 I've filed https://github.com/dart-lang/sdk/issues/44442 to track this.)
Alternatively, could address https://github.com/dart-lang/sdk/issues/36707 so that we could just expose the inner data pointer of a Uint8List directly in the FFI trampoline.

How to eagerly merge two Flux?

Flux<Long> flux1 = Flux
.<Long>create(fluxSink -> {
for (long i = 0; i < 20; i++) {
fluxSink.next(i);
}
})
.filter(aLong -> aLong % 2 == 0)
.doOnNext(aLong -> System.out.println("flux 1 : " + aLong));
Flux<Long> flux2 = Flux
.<Long>create(fluxSink -> {
for (long i = 0; i < 20; i++) {
fluxSink.next(i);
}
})
.filter(aLong -> aLong % 2 == 1)
.doOnNext(aLong -> System.out.println("flux 2 : " + aLong));
Flux.merge(flux1, flux2)
.doOnNext(System.out::println)
.then()
.block();
Create two Flux<Long> like upper code.
flux1 create even number stream (0,2,4,6,8 ...)
flux2 create odd number stream (1,3,5,7,9 ...)
i expected when merge this 2 flux1 and flux2 work like
0,1,2,3,4 ... or 0,2,1,3,4.. depends on computing power
but always spend flux1 and spend flux2 (flux1 start)0,2,4,6,8, ... 16,18,(flux1 end)(flux2 start)1,3,5,7 ... 17,19
how to subscribe multiple flux eagerly event?
Both streams run on the same thread. When you subscribe flux1 starts pushing data until it's finished. Only then the thread is free for flux2 to continue. The merge operator emits values in the order they arrive. It doesn't switch between the first and the second stream.
If you want the streams run concurrently you need to run them on different threads, e.g. by using the publishOn operator.
Flux<Long> flux1 = Flux
.<Long>create(fluxSink -> {
for (long i = 0; i < 20; i++) {
fluxSink.next(i);
}
})
.publishOn(Schedulers.newSingle("thread-x")
.filter(aLong -> aLong % 2 == 0)
.doOnNext(aLong -> System.out.println("flux 1 : " + aLong));

Reactive way of implementing 'standard pagination'

I am just starting with Spring Reactor and want to implement something that I would call 'standard pagination', don't know if there is technical term for this. Basically no matter what start and end date is passed to method, I want to return same amound of data, evenly distributed.
This will be used for some chart drawing in the future.
I figured out rough copy with algorithm that does exactly that, unfortunatelly before I can filter results I need to either count() or take last index() and block to get this number.
This block is surelly not the reactive way to do this, also it makes flux to call DB twice for data (or am I missing something?)
Is there any operator than can help me and get result from count() somehow down the stream for further usage, it would need to compute anyway before stream can be processed, but to get rid of calling DB two times?
I am using mongoDB reactive driver.
Flux<StandardEntity> results = Flux.from(
mongoCollectionManager.getCollection(channel)
.find( and(gte("lastUpdated", begin), lte("lastUpdated", end))))
.map(d -> new StandardEntity(d.getString("price"), d.getString("lastUpdated")));
Long lastIndex = results
.count()
.block();
final double standardPage = 10.0D;
final double step = lastIndex / standardPage;
final double[] counter = {0.0D};
return
results
.take(1)
.mergeWith(
results
.skip(1)
.filter(e -> {
if (lastIndex > standardPage)
if (counter[0] >= step) {
counter[0] = counter[0] - step + 1;
return true;
} else {
counter[0] = counter[0] + 1;
return false;
}
else
return true;
}));

Neo4j : Difference between cypher execution and Java API call?

Neo4j : Enterprise version 3.2
I see a tremendous difference between the following two calls in terms for speed. Here are the settings and query/API.
Page Cache : 16g | Heap : 16g
Number of row/nodes -> 600K
cypher code (ignore syntax if any) | Time Taken : 50 sec.
using periodic commit 10000
load with headers from 'file:///xyx.csv' as row with row
create(n:ObjectTension) set n = row
From Java (session pool, with 15 session at time as an example):
Thread_1 : Time Taken : 8 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
Thread_2 : Time taken is 9 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
.
.
.
Thread_3 : Basically the above code is reused..It's just an example.
Thread_N where N = (600K / 10K)
Hence, the over all time taken is around 2 ~ 3 mins.
The question are the following?
How does CSV load handles internally? Like does it open single session and multiple transactions within?
Or
Create multiple session based on the parameter passed as "Using periodic commit 10000", with this 600K/10000 is 60 session? etc
What's the best way to write via Java?
The idea is achieve the same write performance as CSV load via Java. As the csv load 12000 nodes in ~5 seconds or even better.
Your Java code is doing something very different than your Cypher code, so it really makes no sense to compare processing times.
You should change your Java code to read from the same CSV file. File IO is fairly expensive, but your Java code is not doing any.
Also, whereas your pure Cypher query is creating nodes with a fixed (and presumably relatively small) number of properties, your Java pList is growing in size with every loop iteration -- so that each Java loop creates nodes with between 1 to 10K properties! This may be the main reason why your Java code is much slower.
[UPDATE 1]
If you want to ignore the performance difference between using and not using a CSV file, the following (untested) code should give you an idea of what similar logic would look like in Java. In this example, the i loop assumes that your CSV file has 10 columns (you should adjust the loop to use the correct column count). Also, this example gives all the nodes the same properties, which is OK as long as you have not created a contrary uniqueness constraint.
Session session = Driver.session();
Map<String,Object> pList = new HashMap<String, Object>();
for (int i = 0; i < 10; i++) {
pList.put(i, i * i);
}
Map<String, Map> params = new HashMap<String, Map>();
params.put("props", pList);
String query = "create(n:Label) set n = {props})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
for(int k = 0; k < 10000; k++){
tx.run(query, params);
}
}
}
[UPDATE 2 and 3, copied from chat and then fixed]
Since the Cypher planner is able to optimize, the actual internal logic is probably a lot more efficient than the Java code I provided (above). If you want to also optimize your Java code (which may be closer to the code that Cypher actually generates), try the following (untested) code. It sends 10000 rows of data in a single run() call, and uses the UNWIND clause to break it up into individual rows on the server.
Session session = Driver.session();
Map<String, Integer> pList = new HashMap<String, Integer>();
for (int i = 0; i < 10; i++) {
pList.put(Integer.toString(i), i*i);
}
List<Map<String,Integer>> rows = Collections.nCopies(1, pList);
Map<String, List> params = new HashMap<String, List>();
params.put("rows", rows);
String query = "UNWIND {rows} AS row CREATE(n:Label) SET n = {row})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
tx.run(query, params);
}
}
You can try are creating the nodes using Java API, instead of relying on Cypher:
createNode - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/GraphDatabaseService.html#createNode-org.neo4j.graphdb.Label...-
setProperty - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/PropertyContainer.html#setProperty-java.lang.String-java.lang.Object-
Also, as predecessor had mentioned, props variable has different values for your cases.
Additionally, notice that every iteration you are performing query parsing (String query = "Create(n:Label {props})";) - unless it is optimized out by neo4j itself.

I cannot understand the effectiveness of an algorithm in the Dart SDK

I cannot understand the effectiveness of an algorithm in the Dart SDK.
Here is the algorithm (List factory in dart:core, file list.dart)
factory List.from(Iterable other, { bool growable: true }) {
List<E> list = new List<E>();
for (E e in other) {
list.add(e);
}
if (growable) return list;
int length = list.length;
List<E> fixedList = new List<E>(length);
for (int i = 0; i < length; i ) {
fixedList[i] = list[i];
}
return fixedList;
}
If growable is false then both lists will be created.
List<E> list = new List<E>();
List<E> fixedList = new List<E>(length);
But the creation of list #1 in this case is redundant because it's a duplicate of Iterable other. It just wastes CPU time and memory.
In this case this algorithm will be more efficient because it wont create an unnecessary list # 1 (growable is false).
factory List.from(Iterable other, { bool growable: true }) {
if(growable) {
List<E> list = new List<E>();
for (E e in other) {
list.add(e);
}
return list;
}
List<E> fixedList = new List<E>(other.length);
var i = 0;
for (E e in other) {
fixedList[i++] = e;
}
return fixedList;
}
Or am I wrong and missed some subtleties of programming?
We usually avoid invoking the length getter on iterables since it can have linear performance and side-effects. For Example:
List list = [1, 2, 3];
Iterable iterable1 = list.map((x) {
print(x);
return x + 1;
});
Iterable iterable2 = iterable1.where((x) => x > 2);
var fixedList = new List.from(iterable2, growable: false);
If List.from invoked the length getter it would run over all elements twice (where does not cache its result). It would furthermore execute the side-effect (printing 1, 2, 3) twice. For more information on Iterables look here.
Eventually we want to change the List.from code so that we avoid the second allocation and the copying. To do this we need (internal) functionality that transforms a growable list into a fixed-length list. Tracking bug: http://dartbug.com/9459
It looks like it was just an incremental update to the existing function.
See this commit and this diff
The function started just with
List<E> list = new List<E>();
for (E e in other) {
list.add(e);
}
and had some more bits added as part of a fairly major refactoring of numerous libraries.
I would say that the best thing to do is to raise a bug report on dartbug.com, and either add a patch, or commit a CL - see instructions here: https://code.google.com/p/dart/wiki/Contributing (Note, you do need to jump through some hoops first, but once you're set up, it's all good).
It might also be worth dropping a note to one of the committers or reviewers from the original commit to let them know your plans.

Resources