How to create custom Combine.PerKey in beam sdk 2.0 - google-cloud-dataflow

We figured out how to create a custom combine function (after lots of guesswork and beam sdk 2.0 code reading) in beam sdk 2.0, as the dataflow sdk 1.x syntax did not work in sdk 2.0.
However, we can't figure out how to create a custom combine PER KEY function in beam sdk 2.0. Any help or pointers (or better yet an actual example) would be greatly appreciated. (We scoured the internet for documentation or examples and found none; we also attempted to look at the code within beam sdk 2.0's Combine class, but couldn't figure it out, especially since the PerKey class now has a private constructor, so we can't extend it any longer.)
In case it helps, here's how we correctly created a custom combiner (without) keys in beam sdk 2.0, but we can't figure out how to create one with a key:
public class CombineTemplateIntervalsIntoBlocks
extends Combine.AccumulatingCombineFn<ImmutableMySetOfIntervals, TemplateIntervalAccum, ArrayList<ImmutableMySetOfIntervals>>{
public CombineTemplateIntervalsIntoBlocks() {
}
#Override
public TemplateIntervalAccum createAccumulator() {
return new TemplateIntervalAccum()
}
and then
public class TemplateIntervalAccum
implements Combine.AccumulatingCombineFn.Accumulator<ImmutableMySetOfIntervals, TemplateIntervalAccum, ArrayList<ImmutableMySetOfIntervals>>, Serializable {
...

You don't need to create your CombineFn differently to use a Combine.PerKey.
You can extend either AccumulatingCombineFn (which puts the merging logic in the accumulator) or extend CombineFn (which puts the merging logic in the CombineFn). There are also other options such as BinaryCombineFn and IterableCombineFn.
Say that you have a CombineFn<InputT, AccumT, OutputT> called combineFn:
You can use Combine.globally(combineFn) to create a PTransform that takes a PCollection<InputT> and combines all the elements.
Or, you can use Combine.perKey(combineFn) to create a PTransform that takes a PCollection<KV<K, InputT>> and combines all the values associated with a each key and combines them. This corresponds to the Combine.PerKey I believe you are referring to.

Related

What is the use of requireBinding?

I am relatively new to Guice and trying to understand the usage of requireBinding and when/why to use it.
As per my understanding, while creating an injector Guice goes through the code of configure() method of all the modules and builds a dependency graph.
If Guice builds the dependency graph in itself then why does a module need to add a requireBinding? As long as I could understand the usage of requireBinding is to add an explicit dependency on a class which guice's dependency graph seems to be doing anyway.
I would like to understand that when should we use requireBinding and what is the impact of not using it in a module.
I have read Guice's official documentation and search on all the existing questions on Stackoverflow/any other blog but couldn't find a satisfying answer to the above question.
Adding to the original question.
Looking at the Source code of the AbstractModule the implementation looks like
protected void requireBinding(Key<?> key) {
this.binder().getProvider(key);
}
protected void requireBinding(Class<?> type) {
this.binder().getProvider(type);
}
Which you would assume will not have any side effect as it's a "get" call.
But on the other hand looking it the binder itself it adds some element to a list of elements of type ProviderLookup
public <T> Provider<T> getProvider(Dependency<T> dependency) {
ProviderLookup<T> element = new ProviderLookup(this.getElementSource(), dependency);
this.elements.add(element);
return element.getProvider();
}
I've always though of requireBinding() as a contract for the Module.
You are correct that the graph would eventually fail when you call #get() or tried to inject an object that depends on the binding. However, I believe requireBinding will cause a failure when the Injector is created vs when the object is created (via the injector). When I was at Google, it functioned more as a contract, less as something with consequential behavior.

Dynamic work rebalancing with custom unbounded source and reader for google cloud dataflow

I have the following implementation of a custom reader and a custom source:
public class CustomPubsubReader extends UnboundedReader {....}
public class CustomPubsubSource extends UnboundedSource {....}
Going through the documentation, it appears that dynamic work rebalancing is applicable only for bounded sources.
In my case, I see that only 1 worker node is created to read the message from custom source even if the message queue is receiving 1000s of elements/s.
If I used PubsubIO.Read() for example, it would create > 1 worker in this case for a streaming mode.
Is there any way to scale out when using a custom source with Cloud Dataflow?
Thanks!
The UnboundedSource may implement generateInitialSplits (Dataflow 1.X) or split (Dataflow 2.0) to produce multiple readers for a given source.
See the Javadoc for more details.

Get classes that have fields annotated with Redstone's #Field()

I have some Dart classes in my project where I annotate some fields with Redstone Mapper's #Field() annotation.
How can I get all these classes at runtime?
I've seen the private Map _cache in redstone_mapper_factory... but it's private.
I'm aware of that I can use the Reflection package to scan these classes myself, however all of them are already being detected and stored by the Redstone mapper so I'd like to leverage that.
You can use dart:mirror to do that.
But I don't think it's possible to get that by redstone, you should probably ask on github, even do the change yourself if you want and do a pull request, it should not be difficult, it is just a getter on _cache.
https://github.com/redstone-dart/redstone_mapper

What does # do in Dart programs?

I've just spent 20 hours in learning about the basics of Dart language, but when I find the # prefix in an open-source Dart program such as here, which I found that most programs use, I wonder what the # directive do in those programs...
For your information, the official documentation says the following:
Metadata
Use metadata to give additional information about your code. A metadata annotation begins with the character #, followed by either a reference to a compile-time constant (such as deprecated) or a call to a constant constructor.
Three annotations are available to all Dart code: #deprecated, #override, and #proxy. For examples of using #override and #proxy, see the section called “Extending a Class”. Here’s an example of using the #deprecated annotation:
However, what "additional information" does the # directive add to the code? If you create an instance by writing the following constructor
#todo('seth', 'make this do something')
, instead of the following constructor, which is the default:
todo('seth", 'make this do something')
, what is the benefit I can get from the first constructor?
I've got that using the built-in metadata such as #deprecated and #override can give me an advantage of being warned in running the app, but what can I get from the case on the custom #todo, or the aforementioned linked sample code over Github?
Annotations can be accessed through the dart:mirrors library. You can use custom annotations whenever you want to provide additional information about a class, method, etc. For instance #MirrorsUsed is used to provide the dart2js compiler with extra information to optimize the size of the generated JavaScript.
Annotations are generally more useful to framework or library authors than application authors. For instance, if you were creating a REST server framework in Dart, you could use annotations to turn methods into web resources. For example it might look something like the following (assuming you have created the #GET annotation):
#GET('/users/')
List<User> getUsers() {
// ...
}
You could then have your framework scan your code at server startup using mirrors to find all methods with the #GET annotation and bind the method to the URL specified in the annotation.
You can do some 'reasoning' about code.
You can query for fields/methods/classes/libraries/... that have a specific annotation.
You acquire those code parts using reflection. In Dart reflection is done by the 'dart:mirrors' package.
You can find an code example here: How to retrieve metadata in Dartlang?
An example where annotations are regularly used is for serialization or database persistence where you add metatdata to the class that can be used as configuration settings by the serialization/persistence framework to know how to process a field or method.
For example you add an #Entity() annotation to indicate that this class should be persisted.
On each field that should be persisted you add another annotation like #Column().
Many persistence frameworks generate the database table automatically from this metadata.
For this they need more information so you add a #Id() on the field that should be used as primary key and #Column(name: 'first_name', type: 'varchar', length: 32) to define parameters for the database table and columns.
This is just an example. The limit is you imagination.

How does an interpreter use a DSL?

I'm using an interpreter for my domain specific language rather than a compiler (despite the performance). I'm struggling to understand some of the concepts though:
Suppose I have a DSL (in XML style) for a game so that developers can make building objects easily:
<building>
<name> hotel </name>
<capacity> 10 </capacity>
</building>
The DSL script is parsed, then what happens?
Does it execute an existing method for creating a new building? As I understand it does not simply transform the DSL into a lower level language (as this would then need to be compiled).
Could someone please describe what an interpreter would do with the resulting parsed tree?
Thank you for your help.
Much depends on your specific application details. For example, are name and capacity required? I'm going to give a fairly generic answer, which might be a bit overkill.
Assumptions:
All nested properties are optional
There are many nested properties, possibly of varying depth
This invites 2 ideas: structuring your interpreter as a recursive descent parser and using some sort of builder for your objects. In your specific example, you'd have a BuildingBuilder that looks something like (in Java):
public class BuildingBuilder {
public BuildingBuilder() { ... }
public BuildingBuilder setName(String name) { ... return this; }
public BuildingBuilder setCapacity(int capacity) { ... return this; }
...
public Building build() { ... }
}
Now, when your parser encounters a building element, use the BuildingBuilder to build a building. Then add that object to whatever context the DSL applies to (city.addBuilding(building)).
Note that if the name and capacity are exhaustive and are always required, you can just create a building by passing the two parameters directly. You can also construct the building and set the properties directly as encountered instead of using the builder (the Builder Pattern is nice when you have many properties and said properties are both immutable and optional).
If this is in a non-object-oriented context, you'll end up implementing some sort of buildBuilding function that takes the current context and the inner xml of the building element. Effectively, you are building a recursive descent parser by hand with an xml library providing the actual parsing of individual elements.
However you implement it, you will probably appreciate having a direct semantic mapping between xml elements in your DSL and methods/objects in your interpreter.

Resources