How to create buffers of latest message for each id of group-by with project reactor? - stream

I have a stream in which I'm grouping it by id (to be able to process different ids in parallel but within the same id, by order). I'd like to write to MongoDB with batches that contain each message per id only once (to make sure that each batch operation will update with only one document at a time), and after every write, I'd like to create the next batch for the latest message of each group. I've created an image to demonstrate what I mean - the orange circles are the intended batches. Each rectangle represents a thread that holds GroupedFlux<?>.
I would like to know what is the operation that will enable doing this.

Each rectangle represents a thread that holds GroupedFlux<?>
How do You group by #1 #2 and run groups on separate thread? I think there is no guarantee that group #1 will always run on the same thread?
Anyway I think this could be a solution to Your problem, You have to make parallel execution than merge groups and put it into mongo in reactive way and run other subscribers in separate threads.
public class Test1 {
public static void main(String[] args) throws InterruptedException {
Flux<TestData> cache = Flux.just(
TestData.of("#1", "a"), TestData.of("#1", "b"), TestData.of("#1", "c"), TestData.of("#1", "d"),
TestData.of("#2", "a"), TestData.of("#2", "b"), TestData.of("#2", "c"), TestData.of("#2", "d"),
TestData.of("#3", "a"), TestData.of("#3", "b"), TestData.of("#3", "c"), TestData.of("#3", "d"),
TestData.of("#4", "a"), TestData.of("#4", "b"), TestData.of("#4", "c"), TestData.of("#4", "d")
)
.delayElements(Duration.ofMillis(100))
.groupBy(TestData::getX)
.parallel(4)
.flatMap(g -> g)
.sequential()
.groupBy(TestData::getY)
.map(g -> g.cache(1))
.cache().flatMap(g -> g);
cache
.subscribeOn(Schedulers.parallel())
.subscribe(next -> System.out.printf("All %s%n", next)); //DO WHATEVER YOU WANT
Thread.sleep(1000);
System.out.println("-------------------- 1 second passed ------------------------------");
cache
.subscribeOn(Schedulers.parallel())
.subscribe(next -> System.out.printf("Latest after 1 sec %s%n", next));
Thread.sleep(1000);
System.out.println("--------------- 2 second passed -----------------------");
cache
.subscribe(next -> System.out.printf("Unique 1 sec %s%n", next));
}
#Value(staticConstructor = "of")
public static class TestData {
private final String x;
private final String y;
}
}
Outputs:
All Test1.TestData(x=#1, y=a)
All Test1.TestData(x=#1, y=b)
All Test1.TestData(x=#1, y=c)
All Test1.TestData(x=#1, y=d)
All Test1.TestData(x=#2, y=a)
All Test1.TestData(x=#2, y=b)
All Test1.TestData(x=#2, y=c)
All Test1.TestData(x=#2, y=d)
All Test1.TestData(x=#3, y=a)
-------------------- 1 second passed ------------------------------
Latest after 1 sec passed Test1.TestData(x=#3, y=a)
Latest after 1 sec passed Test1.TestData(x=#2, y=b)
Latest after 1 sec passed Test1.TestData(x=#2, y=c)
Latest after 1 sec passed Test1.TestData(x=#2, y=d)
All Test1.TestData(x=#3, y=b)
Latest after 1 sec passed Test1.TestData(x=#3, y=b)
All Test1.TestData(x=#3, y=c)
Latest after 1 sec passed Test1.TestData(x=#3, y=c)
All Test1.TestData(x=#3, y=d)
Latest after 1 sec passed Test1.TestData(x=#3, y=d)
All Test1.TestData(x=#4, y=a)
Latest after 1 sec passed Test1.TestData(x=#4, y=a)
All Test1.TestData(x=#4, y=b)
Latest after 1 sec passed Test1.TestData(x=#4, y=b)
All Test1.TestData(x=#4, y=c)
Latest after 1 sec passed Test1.TestData(x=#4, y=c)
All Test1.TestData(x=#4, y=d)
Latest after 1 sec passed Test1.TestData(x=#4, y=d)
--------------- 2 second passed -----------------------
Unique 2 sec passed Test1.TestData(x=#4, y=a)
Unique 2 sec passed Test1.TestData(x=#4, y=b)
Unique 2 sec passed Test1.TestData(x=#4, y=c)
Unique 2 sec passed Test1.TestData(x=#4, y=d)

Related

Sample 1 in N elements

I have a long-running Flux and would like to log 1 in N elements to monitor progress. The following code logs one in N milliseconds.
Flux
.fromStream(
IntStream
.range(1, 101)
.mapToObj(Integer::valueOf)
)
.sample(Duration.ofMillis(2))
.subscribe(e -> log.debug(e.toString()));
Sounds like sample(Publisher...) can be used to achieve logging 1 in N elements by producing a Mono.Just("") for the 1 element and Mono.empty() for the rest. But the method does not supply the element being sampled. Request ideas on how to solve this?
You can use Flux#index to get indexed Flux and log each n-th element. This can be cleanly incorporated to your chain using Flux#transform.
log1InN utility method:
<T> Flux<T> log1InN(Flux<T> source, int n) {
return source.index()
.doOnNext(e -> {
if (e.getT1() % n == 0) {
log.info("Element {}: {}", e.getT1(), e.getT2());
}
}
).map(Tuple2::getT2);
}
And then use it to log for example every tenth element:
Flux.fromStream(
IntStream
.range(1, 101)
.mapToObj(Integer::valueOf)
)
.transform(f -> log1InN(f, 10))
.subscribe();

Dask regex extract comparison failing with NotImplementedError

I have a Dask dataframe that looks like this:
class1 statement class2 value
<geoentity_Pic_de_Font_Blanca_2986043> <hasLatitude> 42.64991^^<degrees> 42.64991
<geoentity_Pic_de_Font_Blanca_2986043> <hasLongitude> 1.53335^^<degrees> 1.53335
<geoentity_Pic_de_Font_Blanca_2986043> <hasGeonamesEntityId> 2986043 NaN
<geoentity_Pic_de_Font_Blanca_2986043> rdfs:label Pic de Font Blanca NaN
I'm trying to check whether the number in class1 matches the one in class2 for all the <hasGeonamesEntityId> rows; so that I can get rid of those rows, since they would then carry unnecessarily duplicated data.
I tried:
df[(df['statement'] == '<hasGeonamesEntityId>') & (df['class1'].str.extract(r'_(\d+)>$') == df['class2'])].head()
but this gives me the following error:
E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
3347 graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self, key])
3348 return new_dd_object(graph, name, self, self.divisions)
-> 3349 raise NotImplementedError(key)
3350
3351 def __setitem__(self, key, value):
NotImplementedError: Dask DataFrame Structure:
0 1
npartitions=442
bool bool
... ...
... ... ...
... ...
... ...
Dask Name: and_, 3978 tasks
My dtypes are:
class1 category
statement category
class2 object
value category
I'm not sure why this is failing since the extract on its own seems to return the correct sub string. Anybody know what I'm doing wrong?
It's hard to say without a reproducible example, but it looks like you're trying to index a Dask DataFrame with another Dask DataFrame, which isn't supported and probably isn't what you want.
Using just pandas
In [18]: df = pd.DataFrame({"A": ['a1', 'b2', 'c3']})
In [19]: df[df.A.str.extract('(\d)') == '1']
Out[19]:
A
0 NaN
1 NaN
2 NaN
That's because .str.extract returns a DataFrame. Set expand=False to get a 1D series
In [20]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[20]:
A
0 a1
Which works for Dask as well
In [21]: df = dd.from_pandas(df, 2)
In [22]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[22]:
Dask DataFrame Structure:
A
npartitions=1
0 object
2 ...
Dask Name: getitem, 5 tasks
In [23]: _.compute()
Out[23]:
A
0 a1

Hash value of String that would be stable across iOS releases?

In documentation String.hash for iOS it says:
You should not rely on this property having the same hash value across
releases of OS X.
(strange why they speak of OS X in iOS documentation)
Well, I need a hasshing function that will not change with iOS releases. It can be simple I do not need anything like SHA. Is there some library for that?
There is another question about this here but the accepted (and only) answer there simply states that we should respect the note in documentation.
Here is a non-crypto hash, for Swift 3:
func strHash(_ str: String) -> UInt64 {
var result = UInt64 (5381)
let buf = [UInt8](str.utf8)
for b in buf {
result = 127 * (result & 0x00ffffffffffffff) + UInt64(b)
}
return result
}
It was derived somewhat from a C++11 constexpr
constexpr uint64_t str2int(char const *input) {
return *input // test for null terminator
? (static_cast<uint64_t>(*input) + // add char to end
127 * ((str2int(input + 1) // prime 127 shifts left almost 7 bits
& 0x00ffffffffffffff))) // mask right 56 bits
: 5381; // start with prime number 5381
}
Unfortunately, the two don't yield the same hash. To do that you'd need to reverse the iterator order in strHash:
for b in buf.reversed() {...}
But that will run 13x slower, somewhat comparable to the djb2hash String extension that I got from https://useyourloaf.com/blog/swift-hashable/
Here are some benchmarks, for a million iterations:
hashValue execution time: 0.147760987281799
strHash execution time: 1.45974600315094
strHashReversed time: 18.7755110263824
djb2hash execution time: 16.0091370344162
sdbmhash crashed
For C++, the str2Int is roughly as fast as Swift 3's hashValue:
str2int execution time: 0.136421

Create query with multiple answers for each cell

So I have got 3 sheets with 2 predefined ranges as you can see them in the Example:
# RangeA # RangeB # Wanted Result
======== ======== ===============
A | B A A
-------- -------- ---------------
1 | a a a
2 | a b 1
3 | a 2
4 | b 3
5 | b b
6 | b 4
7 | c 5
8 | c 6
9 | c
...
Now I would like to have a Formular to get the wanted result I have been searching quite long time today already, but I wasn't successful. I hope there is anybody who may help me.
I hope the example is clear enough to understand what i want to do.
Thanks in advance for your time.
I solved it in the end with google apps script.
The function I used is pretty simple just two for loops:
/*
* Merge merges two arrays to get one list of wanted values
* #param needle {array} is a list of wanted values
* #param haystack {array} is a list of values and their group
* #return returns a list of merged values in the format group, value,
* value, group ...
**/
function Merge(needle, haystack) {
var result = [];
// Set default values to parameters if parameter is not set.
needle = needle || [[]];
haystack = haystack || [[]];
// Filter the array and remove empty items. # RangeB
needle = needle.filter(function(item){
return item[0];
});
// Filter the second array and remove empty or incomplete items # RangeA
haystack = haystack.filter(function(item){
return item[0] && item[1];
});
// Merge both arrays to get the # Wanted Result
needle.forEach(function(item){
result.push([item[0]]);
haystack.forEach(function(needle){
if(item[0] == needle[1]) {
result.push([needle[0]]);
}
});
});
// Check if the result has a length
if(result.length > 0) {
return result;
}
// else return null to overcome the #error message
return null;
}

Lua Nested Unpack Bug?

Question:
I'm trying to unpack an array into an array, but it only works if it's the last item unpacked, if there is anything after it only the first element is unpacked. The following is a very basic example of what I'm trying to do. Is there a better way to do this, or is this a bug I'll have to cope with? I don't want to use table.insert as this seems to be much more readable adding within the definition of the table with something like unpack.
Code:
print ("Error 1")
local table1 = { {1,1}, {2,2}, {3,3} }
local table2 = { {0,0}, unpack (table1), {4,4} }
for n,item in ipairs (table2) do print (unpack(item)) end
print ("Good")
table1 = { {1,1}, {2,2}, {3,3} }
table2 = { {0,0}, unpack (table1) }
for n,item in ipairs (table2) do print (unpack(item)) end
print ("Error 2")
table1 = { {1,1}, {2,2}, {3,3} }
table2 = { {0,0}, unpack (table1), unpack (table1) }
for n,item in ipairs (table2) do print (unpack(item)) end
Output:
Error 1
0 0
1 1 -- {2,2} & {3,3} cut off.
4 4
Good
0 0
1 1 -- All elements unpacked.
2 2
3 3
Error 2
0 0
1 1 -- {2,2} & {3,3} cut off.
1 1 -- All elements unpacked.
2 2
3 3
Note:
I'm running version 5.1.
This is not a bug. A function call that returns multiple values is adjusted to the first value if the call is not the last one. The manual says that at http://www.lua.org/manual/5.1/manual.html#2.5

Resources