How to un-subscribe an item from a flux - project-reactor

Consider the following flux
FluxSink<String> sink;
Flux<String> flux1 = Flux
.<String>create(emitter -> {
sink = emitter;
},...)
.cache()
.publish()
.autoConnect();
So to add/subscribe an item, we can do sink.next(“4”);
flux1.subscribe(item -> log.info(“item: “+item);
By filtering flux1, say from element “2” were not removing that item from the flux..
I know the Flux publisher is immutable.
If we can add to it via sink, how can we remove an item from flux1?

The proper thinking gives proper answers
Think about Flux as about immutable Stream of messages. It is like a river, you can add some water to it, but you can't roll back the water you have already given down the flow. However, you can filter that water down the stream.
In case you need to "remove" illegal elements from the stream, you can filter them:
flux1.filter(e -> !e.equal("2"))
.subscribe(item -> log.info(“item: “+item);
The flux is not a data structure to which we got used to, but it is a stream of data, which you cannot modify at the point of data supply, but can manipulate the way they go to the endpoint

Related

Java 8- forEach method iterator behaviour

I recently started checking new Java 8 features.
I've come across this forEach iterator-which iterates over the Collection.
Let's take I've one ArrayList of type <Integer> having values= {1,2,3,4,5}
list.forEach(i -> System.out.println(i));
This statement iteates over a list and prints the values inside it.
I'd like to know How am I going to specify that I want it to iterate over some specific values only.
Like, I want it to start from 2nd value and iterate it till 2nd last value. or something like that- or on alternate elements.
How am I going to do that?
To iterate on a section of the original list, use the subList method:
list.subList(1, list.length()-1)
.stream() // This line is optional since List already has a foreach method taking a Consumer as parameter
.forEach(...);
This is the concept of streams. After one operation, the results of that operation become the input for the next.
So for your specific example, you can follow #Joni's command. But if you're asking in general, then you can create a filter to only get the values you want to loop over.
For example, if you only wanted to print the even numbers, you could create a filter on the streams before you forEached them. Like this:
List<Integer> intList = Arrays.asList(1,2,3,4,5);
intList.stream()
.filter(e -> (e & 1) == 0)
.forEach(System.out::println);
You can similarly pick out the stuff you want to loop over before reaching your terminal operation (in your case the forEach) on the stream. I suggest you read this stream tutorial to get a better idea of how they work: http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.

Is it possible to read a message from a PubSub and separate its data in different elements of a PCollection<String>? If so, how?

Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.

Dart: DoubleLinkedQueue length after entry prepend

I'm manipulating entries inside a DoubleLinkedQueue via the DoubleLinkedQueueElement.append/prepend methods. This results in the new elements being inserted into the queue, but fails to update the length, and the toList() method results in an error being thrown.
I understand queues are only supposed to have elements added at the start/end, but it looks like the interface should allow for adding in the middle via the entries. I find it hard to believe such a common/well understood data structure would have a bug at this point - so am I using DoubleLinkedQueues incorrectly? Is there another data structure that I should be using? I'm looking to merge values from another iterable into my own sorted iterable - a SplayTreeSet might get me there in n log n time, but a simple merge should get me there in linear time...
Example of code that acts unexpectedly:
main() {
var q = new DoubleLinkedQueue<int>.from([1]);
q.firstEntry().prepend(0);
print('length: ${q.length}');
int i = 0;
for (var qi in q){
print('${i++}: $qi');
}
}
Output:
length: 1
0: 0
1: 1
It looks like the length getter is only pointing to an internal counter. This is done because counting the elements everytime might take very long for long lists.
The internal counter is only updated if you use the methods that directly operate on the list instead of using the prepend method of an element. In your example you should use q.addFirst(0); which results in the length being updates. The .prepend() method just inserts a new element and changes the pointer. Which results in correct traversation of the elements, but the counter is wrong anyway.
Unfortunately it looks like you cannot insert elements in the middle of the list, nor can you make the list recount the elements. You should consider creating a bug over at www.dartbug.com.
// Update:
toList() throws an error because there are more elements than length.

how to sort documents using the erlang map reduce in riak

i'm using riak to store json documents right now, and i want to sort them based on some attribute, let's say there's a key, i.e
{
"someAttribute": "whatever",
"order": 1
}
so i want to sort the documents based on the "order".
I am currently retrieving the documents in riak with the erlang interface. i can retrieve the document back as a string, but i dont' really know what to do after that. i'm thinking the map function just reduces the json document itself, and in the reduce function, i'd make a check to see whether the item i'm looking at has a higher "order" than the head of the rest of the list, and if so append to beginning, and then return a lists:reverse.
despite my ideas above i've had zero results after almost an entire day, i'm so confused with the erlang interface in riak. can someone provide insight on how to write this map/reduce function, or just how to parse the json document?
As far as I know, You do not have access to Input list in Map. You emit from Map a document as 1 element list.
Inputs (all the docs to handle as {Bucket, Key}) -> Map (handle single doc) -> Reduce (whole list emitted from Map).
Maps are executed per each doc on many nodes whereas Reduce is done once on so called coordinator node (the one where query was called).
Solution:
Define Inputs (as a list or bucket)
Retrieve Value in Map and emit whole doc or {Id, Val_to_sort_by)
Sort in Reduce (using regular list:keysort)
This is not a map reduce solution but you should check out Riak Search.
so i "solved" the problem using javascript, still can't do it using erlang.
here is my query
{"inputs":"test",
"query":[{"map":{"language":"javascript",
"source":"function(value, keyData, arg){ var data = Riak.mapValuesJson(value)[0]; var obj = {}; obj[data.order] = data; return [ obj ];}"}},
{"reduce":{"language":"javascript",
"source":"function(values, arg){ return [ values.reduce(function(acc, item){ for(var order in item){ acc[order] = item[order]; } return acc; }) ];}",
"keep":true}}
]
}
so in the map phase, all i do is create a new array, obj, with the key as the order, and the value as the data itself. so visually, the obj is like this
{"1":{"firstName":"John","order":1}
in the reduce phase, i'm just putting it in the accumulator, so basically that's the sort if you think about it, because when you're done, everything will be put in order for you. so i put 2 json documents for testing, one is above, the ohter is just firstName: Billie, order 2. and here is my result for the query above
[{"1":{"firstName":"John","order":1},"2":{"firstName":"Billie","order":2}}]
so it works! . but i still need to do this in ERLANG, any insights?

Resources