java stream pipeline to create map of lists - stream

Its not real code, but an analogy to my exact scenario. Main concern here is the util stream pipeline.
Lets assume, I've following object model
#Data
class ItemRequest {
String action;
List<ItemId> itemIds;
}
#Data
class ItemId{
ItemType itemType;
long itemKey;
}
#Getter
#AllArgsConstructor
enum ItemType{
String backingService;
String description;
ITEM_TYPE_1("backingService1","type 1 description"),
ITEM_TYPE_2("backingService1","type 2 description"),
ITEM_TYPE_3("backingService2","type 3 description"),
ITEM_TYPE_4("backingService2","type 4 description"),
ITEM_TYPE_5("backingService2","type 5 description"),
ITEM_TYPE_6("backingService3","type 6 description"),
ITEM_TYPE_7("backingService3","type 7 description"),
// so on
}
Now each ItemType has a different backend microservice.
My ItemType enum has getters to return backing service.
So I want to break down my ItemRequest by backing service.
I can easily do it in imperative way or running 2 stream pipeline.
But I want to use it in one stream pipeline
For Example, in simple terms, my question is
How to combine following 2 steps I wrote below into one pipeline
.
Map<String,ItemRequest> breakItemRequestAsPerBackingService
(ItemRequest originalItemRequest ){
Map<String, List<ItemId>> collect
= originalItemRequest
.getItemIds()
.stream()
.collect(Collectors.groupingBy(
e -> e.getItemType().getBackingService()));
return collect
.entrySet()
.stream()
.collect(toMap(
Map.Entry::getKey,
e -> new ItemRequest(
originalItemRequest.getAction(),
e.getValue())));
}

Your second operation
collect.entrySet().stream()
.collect(toMap(
Map.Entry::getKey,
e -> new ItemRequest(originalItemRequest.getAction(), e.getValue())));
is keeping the result key of the previous operation and only applying a function to the values. You can apply a function to the result of a previous Collector using collectingAndThen. To use it with groupingBy for the map values, you have to realize that groupingBy(f) is a short-hand for groupingBy(f, toList()), so toList() is the collector to combine with collectingAndThen.
Map<String,ItemRequest>
breakItemRequestAsPerBackingService(ItemRequest originalItemRequest) {
return originalItemRequest.getItemIds().stream()
.collect(Collectors.groupingBy(e -> e.getItemType().getBackingService(),
Collectors.collectingAndThen(Collectors.toList(),
list -> new ItemRequest(originalItemRequest.getAction(), list))
));
}

Related

How chain indefinite amount of flatMap operators in Reactor?

I have some initial state in my application and a few of policies that decorates this state with reactively fetched data (each of policy's Mono returns new instance of state with additional data). Eventually I get fully decorated state.
It basically looks like this:
public interface Policy {
Mono<State> apply(State currentState);
}
Usage for fixed number of policies would look like that:
Flux.just(baseState)
.flatMap(firstPolicy::apply)
.flatMap(secondPolicy::apply)
...
.subscribe();
It basically means that entry state for a Mono is result of accumulation of initial state and each of that Mono predecessors.
For my case policies number is not fixed and it comes from another layer of the application as a collection of objects that implements Policy interface.
Is there any way to achieve similar result as in the given code (with 2 flatMap), but for unknown number of policies? I have tried with Flux's reduce method, but it works only if policy returns value, not a Mono.
This seems difficult because you're streaming your baseState, then trying to do an arbitrary number of flatMap() calls on that. There's nothing inherently wrong with using a loop to achieve this, but I like to avoid that unless absolutely necessary, as it breaks the natural reactive flow of the code.
If you instead iterate and reduce the policies into a single policy, then the flatMap() call becomes trivial:
Flux.fromIterable(policies)
.reduce((p1,p2) -> s -> p1.apply(s).flatMap(p2::apply))
.flatMap(p -> p.apply(baseState))
.subscribe();
If you're able to edit your Policy interface, I'd strongly suggest adding a static combine() method to reference in your reduce() call to make that more readable:
interface Policy {
Mono<State> apply(State currentState);
public static Policy combine(Policy p1, Policy p2) {
return s -> p1.apply(s).flatMap(p2::apply);
}
}
The Flux then becomes much more descriptive and less verbose:
Flux.fromIterable(policies)
.reduce(Policy::combine)
.flatMap(p -> p.apply(baseState))
.subscribe();
As a complete demonstration, swapping out your State for a String to keep it shorter:
interface Policy {
Mono<String> apply(String currentState);
public static Policy combine(Policy p1, Policy p2) {
return s -> p1.apply(s).flatMap(p2::apply);
}
}
public static void main(String[] args) {
List<Policy> policies = new ArrayList<>();
policies.add(x -> Mono.just("blah " + x));
policies.add(x -> Mono.just("foo " + x));
String baseState = "bar";
Flux.fromIterable(policies)
.reduce(Policy::combine)
.flatMap(p -> p.apply(baseState))
.subscribe(System.out::println); //Prints "foo blah bar"
}
If I understand the problem correctly, then the most simple solution is to use a regular for loop:
Flux<State> flux = Flux.just(baseState);
for (Policy policy : policies)
{
flux = flux.flatMap(policy::apply);
}
flux.subscribe();
Also, note that if you have just a single baseSate you can use Mono instead of Flux.
UPDATE:
If you are concerned about breaking the flow, you can extract the for loop into a method and apply it via transform operator:
Flux.just(baseState)
.transform(this::applyPolicies)
.subscribe();
private Publisher<State> applyPolicies(Flux<State> originalFlux)
{
Flux<State> newFlux = originalFlux;
for (Policy policy : policies)
{
newFlux = newFlux.flatMap(policy::apply);
}
return newFlux;
}

How do I write to multiple files in Apache Beam?

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.
For example, let's say the result consists of
(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)
Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.
And in my case:
Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.
Any ideas?
Handily, I wrote a sample of this case just the other day.
This example is dataflow 1.x style
Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).
...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
.apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
readyToWrite.apply(
new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...
And then the transform doing most of the work is:
public class PTransformWriteToGCS
extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {
private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);
private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();
private final String bucketName;
private final SerializableFunction<String, String> pathCreator;
public PTransformWriteToGCS(final String bucketName,
final SerializableFunction<String, String> pathCreator) {
this.bucketName = bucketName;
this.pathCreator = pathCreator;
}
#Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {
return input
.apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {
#Override
public void processElement(
final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
throws Exception {
final String key = arg0.element().getKey();
final List<String> values = arg0.element().getValue();
final String toWrite = values.stream().collect(Collectors.joining("\n"));
final String path = pathCreator.apply(key);
BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
.setContentType(MimeTypes.TEXT)
.build();
LOG.info("blob writing to: {}", blobInfo);
Blob result = STORAGE.create(blobInfo,
toWrite.getBytes(StandardCharsets.UTF_8));
}
}));
}
}
Just write a loop in a ParDo function!
More details -
I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record. So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images. HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario:
(Also my code is in Python)
class WriteToSeparateTFRecordFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
l, image_list = element
writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
for example in image_list:
writer.write(example.SerializeToString())
writer.close()
And then in your pipeline just after the stage where you get key-value pairs to add these two lines:
(p
| 'GroupByLabelId' >> beam.GroupByKey()
| 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
)
you can use FileIO.writeDinamic() for that
PCollection<KV<String,String>> readfile= (something you read..);
readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("somefolder")
.withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));
p.run();
In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method.
Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.
Just write below lines in your ParDo class :
from apache_beam.io import filesystems
eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
eventCSVFileWriter.write(record)
If you want the full code I can help you with that too.

Reactor 3 'interval buffer' Flux?

How can I use existing Flux operators to make a Flux return incoming values into multiple Lists with a minimum delay between returns?
This can be achieved with a non-trivial set of composed operators.
import java.time.Duration;
import java.util.*;
import reactor.core.publisher.*;
public class DelayedBuffer {
public static void main(String[] args) {
Flux.just(1, 2, 3, 6, 7, 10)
.flatMap(v -> Mono.delayMillis(v * 1000)
.doOnNext(w -> System.out.println("T=" + v))
.map(w -> v)
)
.compose(f -> delayedBufferAfterFirst(f, Duration.ofSeconds(2)))
.doOnNext(System.out::println)
.blockLast();
}
public static <T> Flux<List<T>> delayedBufferAfterFirst(Flux<T> source, Duration d) {
return source
.publish(f -> {
return f.take(1).collectList()
.concatWith(f.buffer(d).take(1))
.repeatWhen(r -> r.takeUntilOther(f.ignoreElements()));
});
}
}
(Note however, that the expected emission pattern may be better matched with a custom operator due to time being involved.)
I thought buffer(Duration) would fit your need, but it doesn't.
edit: leaving this in case someone with your exact same need is tempted to use that operator. This variant of buffer splits the sequence into consecutive time windows (that each produce a buffer). That is, the new delay starts at the end of the previous one, not whenever a new out-of-delay element is emitted.

Google Dataflow Out Of Heap When Creating Multiple Tagged Outputs

I have many large unpartitioned BigQuery tables and files that I would like to partition in various ways. So I decided to try and write a Dataflow job to achieve this. The job I think is simple enough. I tried to write with generics so that I easily apply it both TextIO and BigQueryIO sources. It works fine with small tables, but I keep getting java.lang.OutOfMemoryError: Java heap space when I run it on large tables.
In my main class I either read a file with target keys (made with another DF job) or run a query against a BigQuery table to get a list of keys to shard by. My main class looks like this:
Pipeline sharder = Pipeline.create(opts);
// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";
// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);
// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());
String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";
TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();
PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
shard.apply(BigQueryIO.Write.to(destBase + shardName)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
System.out.println(destBase+shardName);
}
sharder.run();
I generate a set of TupleTags to use in a custom transform. I created a utility class that stores a TupleTagList and HashMap so that I can reference the tuple tags by key:
public class TupleTagMap<Key, Type> implements Serializable {
private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;
public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
map = new HashMap<>();
for (Key key : t)
map.put(key, new TupleTag<Type>());
this.tList = TupleTagList.of(new ArrayList<>(map.values()));
this.selector = selector;
}
public Map<Key, TupleTag<Type>> getMap() {
return map;
}
public TupleTagList getTagList() {
return tList;
}
public TupleTag<Type> getTag(Type t){
return map.get(selector.getKey(t));
}
Then I have this custom transform that basically has a function that uses the tuple map to output PCollectionTuple and then moves it to a PCollectionList to return to the main class:
public class ShardedTransform<Key, Type> extends
PTransform<PCollection<Type>, PCollectionList<Type>> {
private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;
public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
this.tags = tags;
this.coder = coder;
}
#Override
public PCollectionList<Type> apply(PCollection<Type> in) {
PCollectionTuple shards = in.apply(ParDo.of(
new ShardFn<Key, Type>(tags)).withOutputTags(
new TupleTag<Type>(), tags.getTagList()));
List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());
for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
shardList.add(shard);
}
return PCollectionList.of(shardList);
}
}
The actual DoFn is dead simple it just uses the lambda provided in the main class do find the matching tuple tag in the hash map for side output:
public class ShardFn<Key, Type> extends DoFn<Type, Type> {
private static final long serialVersionUID = 961325260858465105L;
private final TupleTagMap<Key, Type> tags;
ShardFn(TupleTagMap<Key, Type> tags) {
this.tags = tags;
}
#Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
throws Exception {
Type element = c.element();
TupleTag<Type> tag = tags.getTag(element);
if (tag != null)
c.sideOutput(tags.getTag(element), element);
}
}
The Beam model doesn't have good support for dynamic partitioning / large numbers of partitions right now. Your approach chooses the number of shards at graph construction time, and then the resulting ParDos likely all fuses together, so you've got each worker trying to write to 80 different BQ tables at the same time. Each write requires some local buffering, so it's probably just too much.
There's an alternate approach which will do the parallelization across tables (but not across elements). This would work well if you have a large number of relatively small output tables. Use a ParDo to tag each element with the table it should go to and then do a GroupByKey. This gives you a PCollection<KV<Table, Iterable<ElementsForThatTable>>>. Then process each KV<Table, Iterable<ElementsForThatTable>> by writing the elements to the table.
Unfortunately for now you'll have to the BQ write by hand to use this option. We're looking at extending the Sink APIs with built in support for this. And since the Dataflow SDK is being further developed as part of Apache Beam, we're tracking that request here: https://issues.apache.org/jira/browse/BEAM-92

Test that either one thing holds or another in AssertJ

I am in the process of converting some tests from Hamcrest to AssertJ. In Hamcrest I use the following snippet:
assertThat(list, either(contains(Tags.SWEETS, Tags.HIGH))
.or(contains(Tags.SOUPS, Tags.RED)));
That is, the list may be either that or that. How can I express this in AssertJ? The anyOf function (of course, any is something else than either, but that would be a second question) takes a Condition; I have implemented that myself, but it feels as if this should be a common case.
Edited:
Since 3.12.0 AssertJ provides satisfiesAnyOf which succeeds is one of the given assertion succeeds,
assertThat(list).satisfiesAnyOf(
listParam -> assertThat(listParam).contains(Tags.SWEETS, Tags.HIGH),
listParam -> assertThat(listParam).contains(Tags.SOUPS, Tags.RED)
);
Original answer:
No, this is an area where Hamcrest is better than AssertJ.
To write the following assertion:
Set<String> goodTags = newLinkedHashSet("Fine", "Good");
Set<String> badTags = newLinkedHashSet("Bad!", "Awful");
Set<String> tags = newLinkedHashSet("Fine", "Good", "Ok", "?");
// contains is statically imported from ContainsCondition
// anyOf succeeds if one of the conditions is met (logical 'or')
assertThat(tags).has(anyOf(contains(goodTags), contains(badTags)));
you need to create this Condition:
import static org.assertj.core.util.Lists.newArrayList;
import java.util.Collection;
import org.assertj.core.api.Condition;
public class ContainsCondition extends Condition<Iterable<String>> {
private Collection<String> collection;
public ContainsCondition(Iterable<String> values) {
super("contains " + values);
this.collection = newArrayList(values);
}
static ContainsCondition contains(Collection<String> set) {
return new ContainsCondition(set);
}
#Override
public boolean matches(Iterable<String> actual) {
Collection<String> values = newArrayList(actual);
for (String string : collection) {
if (!values.contains(string)) return false;
}
return true;
};
}
It might not be what you if you expect that the presence of your tags in one collection implies they are not in the other one.
Inspired by this thread, you might want to use this little repo I put together, that adapts the Hamcrest Matcher API into AssertJ's Condition API. Also includes a handy-dandy conversion shell script.

Resources