Access side input in a non - anonymous DoFn - google-cloud-dataflow

How to access the elements of a side input if I have my class extend DoFn?
For example:
Say I have a ParDo transform like:
PCollection<String> data = myData.apply("Get data",
ParDo.of(new MyClass()).withSideInputs(myDataView));
And I have a class:-
static class MyClass extends DoFn<String,String>
{
//How to access side input here
}
c.sideInput() isn't working in this case.
Thanks.

In this case, the problem is that the processElement method in your DoFn does not have access to the PCollectionView instance in your main method.
You can pass the PCollectionView to the DoFn in the constructor:
class MyClass extends DoFn<String,String>
{
private final PCollectionView<..> mySideInput;
public MyClass(PCollectionView<..> mySideInput) {
// List, or Map or anything:
this.mySideInput = mySideInput;
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
// List or Map or any type you need:
List<..> sideInputList = c.sideInput(mySideInput);
}
}
You would then pass the side input to the class when you instantiate it, and indicate it as a side input like so:
p.apply(ParDo.of(new MyClass(mySideInput)).withSideInputs(mySideInput));
The explanation for this is that when you use an anonymous DoFn, the process method has a closure with access to all the objects within the scope that encloses the DoFn (among them is the PCollectionView). When you're not using an anonymous DoFn, there is no closure, and you need another way of passing the PCollectionView.

So although the answer above is correct, it is still a little incomplete.
So once you finish implementing the above answer, you need to execute your pipeline like this:
p.apply(ParDo.of(new MyClass(mySideInput)).withSideInputs(mySideInput));

Related

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be passed as side input? Is it ok if they are passed as normal constructor arguments for the DoFn ? For example, what if I have some fixed (not computed) values variables (constants, like start date, end date) that I want to make use of during the per element computation of processElement. Now, I can make singleton PCollectionViews out of each of those variables separately and pass them to the DoFn constructor as side input. However, instead of doing that, can I not just pass each of those constants as normal constructor arguments to the DoFn? Am I missing anything subtle here?
In terms of code, when should I do:
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
// these are singleton views
private final PCollectionView<LocalDateTime> dateStartView;
private final PCollectionView<LocalDateTime> dateEndView;
public MyFilter(PCollectionView<LocalDateTime> dateStartView,
PCollectionView<LocalDateTime> dateEndView){
this.dateStartView = dateStartView;
this.dateEndView = dateEndView;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// extract date values from the singleton views here and use them
As opposed to :
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
private final LocalDateTime dateStart;
private final LocalDateTime dateEnd;
public MyFilter(LocalDateTime dateStart,
LocalDateTime dateEnd){
this.dateStart = dateStart;
this.dateEnd = dateEnd;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// use the passed in date values directly here
Notice that in these examples, startDate and endDate are fixed values and not the dynamic results of any previous computation of the pipeline.
When you call something like pipeline.apply(ParDo.of(new MyFilter(...)) the DoFn gets instantiated in the main program that you use to start the pipeline. It then gets serialized and passed to the runner for execution. Runner then decides where to execute it, e.g. on a fleet of a 100 VMs each of which will receive its own copy of the code and serialized data. If the member variables are serializable and you don't mutate them at execution time, it should be fine (link, link), the DoFn will get deserialized on each node with all the fields populated, and will get executed as expected. However you don't control the number of instances or basically their lifecycle (to some extent), so mutate them at your own risk.
The benefit of PCollections and side inputs is that you are not limited to static values, so for couple of simple unmutable values you should be fine .

Any way to use Pipeline and DataflowPipelineOptions instances inside ParDo?

As it is apparent that Pipeline and DataflowPipelineOptions instances can be used inside ParDo when written in static block only.Since, processElement() is not static. Therefore, ParDo can't accommodate any logic inside this function which deals with objects of Pipeline and DataflowPipelineOptions.
Sample :
public void processElement(ProcessContext c)
{
PCollection<String> dummy=p.apply("dummy"); // It will fail with a SerializableException (What I want to do)
}
So, just wanted to check whether the same can be made possible by calling a static function from ParDo itself.
public void processElement(ProcessContext c)
{
logicCaller(); // It's a static function
}
The logicCaller() static method can contain anything from creation of a simple PCollection to any ParDo block execution having some logical steps inside.
I tried this way around and the function got called successfully as well. The weird behaviour was that everything got executed inside that static function but the ParDo part got skipped without any error or exception.
Any insights?

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}
A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

What is the purpose of the getter methods in Components in Dagger 2?

I am trying to understand Components in Dagger 2. Here is an example:
#Component(modules = { MyModule.class })
public interface MyComponent {
void inject(InjectionSite injectionSite);
Foo foo();
Bar bar();
}
I understand what the void inject() methods do. But I don't understand what the other Foo foo() getter methods do. What is the purpose of these other methods?
Usage in dependent components
In the context of a hierarchy of dependent components, such as in this example, provision methods such as Foo foo() are for exposing bindings to a dependent component. "Expose" means "make available" or even "publish". Note that the name of the method itself is actually irrelevant. Some programmers choose to name these methods Foo exposeFoo() to make the method name reflect its purpose.
Explanation:
When you write a component in Dagger 2, you group together modules containing #Provides methods. These #Provides methods can be thought of as "bindings" in that they associate an abstraction (e.g., a type) with a concrete way of resolving that type. With that in mind, the Foo foo() methods make the Component able to expose its binding for Foo to dependent components.
Example:
Let's say Foo is an application Singleton and we want to use it as a dependency for instances of DependsOnFoo but inside a component with narrower scope. If we write a naive #Provides method inside one of the modules of MyDependentComponent then we will get a new instance. Instead, we can write this:
#PerFragment
#Component(dependencies = {MyComponent.class }
modules = { MyDependentModule.class })
public class MyDependentComponent {
void inject(MyFragment frag);
}
And the module:
#Module
public class MyDepedentModule {
#Provides
#PerFragment
DependsOnFoo dependsOnFoo(Foo foo) {
return new DependsOnFoo(foo);
}
}
Assume also that the injection site for DependentComponent contains DependsOnFoo:
public class MyFragment extends Fragment {
#Inject DependsOnFoo dependsOnFoo
}
Note that MyDependentComponent only knows about the module MyDependentModule. Through that module, it knows it can provide DependsOnFoo using an instance of Foo, but it doesn't know how to provide Foo by itself. This happens despite MyDependentComponent being a dependent component of MyComponent. The Foo foo() method in MyComponent allows the dependent component MyDependentComponent to use MyComponent's binding for Foo to inject DependsOnFoo. Without this Foo foo() method, the compilation will fail.
Usage to resolve a binding
Let's say we would like to obtain instances of Foo without having to call inject(this). The Foo foo() method inside the component will allow this much the same way you can call getInstance() with Guice's Injector or Castle Windsor's Resolve. The illustration is as below:
public void fooConsumer() {
DaggerMyComponent component = DaggerMyComponent.builder.build();
Foo foo = component.foo();
}
Dagger is a way of wiring up graphs of objects and their dependencies. As an alternative to calling constructors directly, you obtain instances by requesting them from Dagger, or by supplying an object that you'd like to have injected with Dagger-created instances.
Let's make a coffee shop, that depends on a Provider<Coffee> and a CashRegister. Assume that you have those wired up within a module (maybe to LightRoastCoffee and DefaultCashRegister implementations).
public class CoffeeShop {
private final Provider<Coffee> coffeeProvider;
private final CashRegister register;
#Inject
public CoffeeShop(Provider<Coffee> coffeeProvider, CashRegister register) {
this.coffeeProvider = coffeeProvider;
this.register = register;
}
public void serve(Person person) {
cashRegister.takeMoneyFrom(person);
person.accept(coffeeProvider.get());
}
}
Now you need to get an instance of that CoffeeShop, but it only has a two-parameter constructor with its dependencies. So how do you do that? Simple: You tell Dagger to make a factory method available on the Component instance it generates.
#Component(modules = {/* ... */})
public interface CoffeeShopComponent {
CoffeeShop getCoffeeShop();
void inject(CoffeeService serviceToInject); // to be discussed below
}
When you call getCoffeeShop, Dagger creates the Provider<Coffee> to supply LightRoastCoffee, creates the DefaultCashRegister, supplies them to the Coffeeshop constructor, and returns you the result. Congratulations, you are the proud owner of a fully-wired-up coffeeshop.
Now, all of this is an alternative to void injection methods, which take an already-created instance and inject into it:
public class CoffeeService extends SomeFrameworkService {
#Inject CoffeeShop coffeeShop;
#Override public void initialize() {
// Before injection, your coffeeShop field is null.
DaggerCoffeeShopComponent.create().inject(this);
// Dagger inspects CoffeeService at compile time, so at runtime it can reach
// in and set the fields.
}
#Override public void alternativeInitialize() {
// The above is equivalent to this, though:
coffeeShop = DaggerCoffeeShopComponent.create().getCoffeeShop();
}
}
So, there you have it: Two different styles, both of which give you access to fully-injected graphs of objects without listing or caring about exactly which dependencies they need. You can prefer one or the other, or prefer factory methods for the top-level and members injection for Android or Service use-cases, or any other sort of mix and match.
(Note: Beyond their use as entry points into your object graph, no-arg getters known as provision methods are also useful for exposing bindings for component dependencies, as David Rawson describes in the other answer.)

Adding Dagger to an existing project

I'm trying to add Dagger to an existing web application and am running into a design problem.
Currently our Handlers are created in a dispatcher with something like
registerHandler('/login', new LoginHandler(), HttpMethod.POST)
Inside the login handler we might call a function like
Services.loginService.login('username', 'password');
I want to be able to inject the loginService into the handler, but am having trouble figuring out the best approach. There is a really long list of handlers in the dispatcher, and injecting them all as instance variables seems like a large addition of code.
Is there a solution to this type of problem?
Based on your comment about having different services to inject. I would propose next solution.
ServicesProvider:
#Module(injects = {LoginHandler.class, LogoutHandler.class})
public class ServicesProvider {
#Provides #Singleton public LoginService getLoginService() {
return new LoginService();
}
}
LoginHandler.java:
public class LoginHandler extends Handler {
#Inject LoginService loginService;
}
HttpNetwork.java
public class HttpNetwork extends Network {
private ObjectGraph objectGraph = ObjectGraph.create(new ServicesProvider());
public registerHandler(String path, Handler handler, String methodType) {
getObjectGraph().inject(handler);
}
}
There is one week point in this solution - you can't easily change ServiceProvider for test purpose (or any other kind of purpose). But if you inject it also (for example with another object graph or just through constructor) you can fix this situation.

Resources