Prior to the most recent SDK I was relying on the ability to access my sideInput inside of startBundle of my DoFn. I’m not sure of the history of refactoring but I seem to be having issues doing this now.
Essentially I have an array that I want to process across within my process() method and the array is reasonably sized that will fit in memory.
Is it valid to expect to access a sideInput within startBundle? And if so, how can I do that if startBundle is sent a Context instead of a ProcessContext?
Example:
#Override
public void startBundle(DoFn<KV<String, Iterable<String>>, String>.Context c) throws Exception {
uniqueIds = Lists.newArrayList(c.sideInput(iterableView));
super.startBundle(c);
}
The history is explained here: Why did #sideInput() method move from Context to ProcessContext in Dataflow beta
Do you need to do any processing on your side input to prepare it for use in processElement? If not, then I'd suggest just using View.asList() or View.asMap() and calling that directly in processElement() -- Dataflow will do caching when possible to make this cheap. (Note View.asList() is currently available on Github and will be in the next Maven release.)
If you need to do processing on your side input, and you are using the (default) GlobalWindow, then you can lazily initialize a local variable from within processElement(). However, if you are using Window.into(), you'll need to invalidate that cache every time the element's window changes.
Related
We are using apache beam and would like to setup the logback MDC. logback MDC is a great GREAT resource when you have a request come in and you store let's say a userId (in our case, it's custId, fileId, requestId), then anytime a developer logs, it magically stamps that information on to the developers log. the developer no longer forgets to add it every log statement he adds.
I am starting in an end to end integration type test with apache beam direct runner embedded in our microservice for testing (in production, the microservice calls dataflow). currently, I am see that the MDC is good up until after the expand() methods are called. Once the processElement methods are called, the context is of course gone since I am in another thread.
So, trying to fix this piece first. Where should I put this context such that I can restore it at the beginning of this thread.
As an example, if I have an Executor.execute(runnable), then I simply transfer context using that runnable like so
public class MDCContextRunnable implements Runnable {
private final Map<String, String> mdcSnapshot;
private Runnable runnable;
public MDCContextRunnable(Runnable runnable) {
this.runnable = runnable;
mdcSnapshot = MDC.getCopyOfContextMap();
}
#Override
public void run() {
try {
MDC.setContextMap(mdcSnapshot);
runnable.run();
} Catch {
//Must log errors before mdc is cleared
log.error("message", e);. /// Logs error and MDC
} finally {
MDC.clear();
}
}
}
so I need to do the same with apache beam basically. I need to
Have a point to capture the MDC
Have a point to restore the MDC
Have a point to clear out the MDC to prevent it leaking to another request(really in case I missed something which seems to happen now and then)
Any ideas on how to do this?
oh, bonus points if it the MDC can be there when any exceptions are logged by the framework!!!! (ie. ideally, frameworks are supposed to do this for you but apache beam seems like it is not doing this. Most web frameworks have this built in).
thanks,
Dean
Based on the context and examples you gave, it sounds like you want to use MDC to automatically capture more information for your own DoFns. Your best bet for this is, depending on the lifetime you need your context available for, to use either the StartBundle/FinishBundle or Setup/Teardown methods on your DoFns to create your MDC context (see this answer for an explanation of the differences between the two). The important thing is that these methods are executed for each instance of a DoFn, meaning they will be called on the new threads created to execute these DoFns.
Under the Hood
I should explain what's happening here and how this approach differs from your original goal. The way Apache Beam executes is that your written pipeline executes on your own machine and performs pipeline construction (which is where all the expand calls are occurring). However, once a pipeline is constructed, it is sent to a runner which is often executing on a separate application unless it's the Direct Runner, and then the runner either directly executes your user code or runs it in a docker environment.
In your original approach it makes sense that you would successfully apply MDC to all logs until execution begins, because execution might not only be occurring in a different thread, but potentially also a different application or machine. However, the methods described above are executed as part of your user code, so setting up your MDC there will allow it to function on whatever thread/application/machine is executing transforms.
Just keep in mind that those methods get called for every DoFn and you will often have mutiple DoFns per thread, which is something you may need to be wary of depending on how MDC works.
I've run into the same question repeatedly whenever using a new DI framework... how do you run massively-parallel operation kicked off from an HttpRequest where each thread needs its own unique copy of the dependencies? In my case, I'm using Ninject.
The specific case I always run into is a CPU-intensive report, using Parallel.ForEach, that needs to use an Entity Framework DbContext; the EF context must be unique to the thread, but outside of these special reports the EF context it must be InRequestScope.
How do you achieve this with Ninject? Preferably allow disposing the EF context with each task on the Parallel.ForEach, since the data loaded with the context would just stay in the context and consume memory.
Note that this report is big enough to warrant Parallel.ForEach but small enough that it can run synchronously on a web request and not timeout the browser (<60 seconds). Maybe I'm weird, but I run into this need a lot.
The solution has several different moving parts that, IMO, aren't terribly well-documented parts of Ninject. The upside is that after implementing something like this, you should start feeling comfortable with Ninject in a hurry!
First, you need to change the scope for your objects so they use the HttpContext if it exists, and if not, use the current thread as a fallback. There is no documentation for this, but there is a DefaultScopeCallback that was added to the settings a while back. Set that property to your own scope callback which uses the same code in the Ninject.Web.Common source to get the HttpContext, but then use "?? Thread.CurrentThread" as the fallback. Do that in the CreateKernel code that should have been created automatically when you installed the NuGet package.
(I have substituted the StandardScopeCallbacks.Thread(ctx) where I used to have Thread.CurrentThread, since the former could conceivably change at some point. Currently those two are identical in what they do.)
private static IKernel CreateKernel()
{
var settings = new NinjectSettings{ DefaultScopeCallback = DefaultScopeCallback };
var kernel = new StandardKernel(settings);
// The rest of the default implementation of CreateKernel left out for brevity
}
private static Object DefaultScopeCallback(Ninject.Activation.IContext ctx)
{
var scope = ctx.Kernel.Components.GetAll<INinjectHttpApplicationPlugin>()
.Select(c => c.GetRequestScope(ctx)).FirstOrDefault(s => s != null);
return scope ?? Ninject.Infrastructure.StandardScopeCallbacks.Thread(ctx);
}
Also, don't forget that the Kernel needs to be set aside as a static object for access later. You don't want to new-up a new Kernel every time you need it; I make mine accessible via "MyConfig.ObjectFactory". While this is a code smell of the service locator anti-pattern, we're going to great lengths here to avoid the anti-pattern as much as possible.
Second, according to the commit description, the DefaultScopeCallback only affects explicit bindings with no explicit scope. So if, like me, you were depending on a bunch of implicit bindings that you hadn't added, you now need to configure them:
kernel.Bind(i => i.From(Assembly.GetExecutingAssembly(), Assembly.GetAssembly(typeof(Bll.MyConfig)))
.SelectAllClasses()
.BindToSelf());
If you don't like doing the above, there's another way of setting the default scope for all implicit bindings that is arguably more elegant. Changing default object scope with Ninject 2.2
Third, if you'd like to clear all cached objects from the scope at the end of each Parallel operation so that memory usage doesn't skyrocket due to EF caching or whatnot, here's how clear the Ninject cache scoped to the current thread:
Parallel.ForEach(myList, i =>
{
var threadDb = MyConfig.ObjectFactory.Get<MyContext>();
CreateModelsForItem(i, threadDb);
MyConfig.ObjectFactory.Components.Get<Ninject.Activation.Caching.ICache>().Clear(Thread.CurrentThread);
});
Note that I did some testing without that Clear line at the end, and it seemed like the EF Context was getting re-used even if that HttpRequest finished and I generated the report several more times. This was not what I wanted, so the Clear operation was important. Really, the behavior I want is closer to InCallScope, but trying to get InRequestScope with InCallScope as a fallback is a can of worms I'll open on another day.
I have two separate Pipelines say 'P1' and 'P2'. As per my requirement I need to run P2 only after P1 has completely finished its execution. I need to get this entire operation done through a single Template.
Basically Template gets created the moment it finds run() its way say p1.run().
So what I can see that I need to handle two different Pipelines using two different templates but that would not satisfy my strict order based Pipeline execution requirement.
Another way I could think of calling p1.run() inside the ParDo of p2.run() and keep the run() of p2 wait until finish of run() of p1. I tried this way but stuck at IllegalArgumentException given below.
java.io.NotSerializableException: PipelineOptions objects are not serializable and should not be embedded into transforms (did you capture a PipelineOptions object in a field or in an anonymous class?). Instead, if you're using a DoFn, access PipelineOptions at runtime via ProcessContext/StartBundleContext/FinishBundleContext.getPipelineOptions(), or pre-extract necessary fields from PipelineOptions at pipeline construction time.
Is it not possible at all to call the run() of a pipeline inside any transform say 'Pardo' of another Pipeline?
If this is the case then how to satisfy my requirement of calling two different Pipelines in sequence by creating a single template?
A template can contain only a single pipeline. In order to sequence the execution of two separate pipelines each of which is a template, you'll need to schedule them externally, e.g. via some workflow management system (such as what Anuj mentioned, or Airflow, or something else - you might draw some inspiration from this post for example).
We are aware of the need for better sequencing primitives in Beam within a single pipeline, but do not have a concrete design yet.
I have an ExecutorService that runs several solvers in parallel. Each solver modifies several internal variables which value must be returned.
It is not possible to encapsulate all the variables in a class to be returned via a callable object for compatibility issues. Therefore, make the solvers either callable or runnable does not make any difference in my case, as I cannot retrieve all the variables I need.
I considered following two options:
Each solver access a synchronized class and writes its values there.
Access the objects (solvers) that have been submitted by the executor in order to get their variables via get methods.
I prefer the second option, but I don't find the way to gain access to the objects submitted.
Any suggestion (for any of the options)?
You didn't elaborate on the "compatibility issues", so I can only suggest a general solution for what you described.
Since you use ExecutorService, I believe that you use ThreadPoolExecutor (or its subclass) as implementation of that interface. If that's the case, I suggest overriding ThreadPoolExecutor.afterExecute(Runnable r, Throwable t) method. It's called after any submitted Runnable has completed it's execution. Its default implementation is empty.
Your implementation should follow these steps:
Check if t != null. If so, process Throwable t which caused a solver to abort.
Check the type of r and if you recognize it, retrieve its results. Of course, it will be simpler if all your solvers have a common API.
Store results somewhere.
But look out - ThreadPoolExecutor.afterExecute() is called from the thread that ran the Runnable r, so the 3rd step will most likely need to be synchronized.
Putting it all together, your code can look like this:
if (t != null) {
// handle t
} else {
Solver solver = (Solver)r;
Results results = solver.getResults();
synchronized (allSolutions) {
allSolutions.addResults(results);
}
}
The question is how could I stop a method being called twice, where the first call has not "completed" because its handler is waiting for a url to load for example?
Here is the situation:
I have written a flash client which interfaces with a java server using a binary encrypted protocol (I would love to not have had to re-invent the whole client/server object communcation stack, but I had to encrypt the data in such a way that simple tools like tamper data and charles proxy could not pick them up if using SSL).
The API presents itself to flas as an actionscript swf file, and the API itself is a singleton.
the api exposes some simple methods, including:
login()
getBalance()
startGame()
endGame()
Each method will call my HttpCommunicator class.
HttpCommunicator.as (with error handling and stuff removed):
public class HttpCommunicator {
private var _externalHalder:function;
public function communicate(data:String, externalHandler:APIHandler):void {
// do encryption
// add message numbers etc to data.
this._externalHalder = externalHandler;
request.data = encrypt(addMessageNumers(data)));
loader.addEventListener(Event.COMPLETE, handleComplete);
loader.load(request);
}
private function handleComplete(event:Event):void {
var loader:URLLoader = URLLoader(event.target);
String data = decrypt(loader.data);
// check message numbers match etc.
_externalHandler(data);
}
The problem with this is I cant protect the same HttpCommunicator object from being called twice before the first has handled the complete event, unless:
I create a new HttpCommunicator object every single time I want to send a message. I also want to avoid creating a URLLoader each time, but this is not my code so will be more problematic to know how it behaves).
I can do something like syncronize on communicate. This would effectivly block, but this is better than currupting the data transmission. In theory, the Flash client should not call the same api function twice in a row, but I but it will happen.
I implement a queue of messages. However, this also needs syncronization around the push and pop methods, which I cant find how to do.
Will option 1. even work? If I have a singleton with a method say getBalance, and the getBalance method has:
// class is instantiated through a factory as a singleton
public class API{
var balanceCommunicator:HttpCommunicator = new HttpCommunicator(); // create one for all future calls.
public funciton getBalance(playerId:uint, hander:Fuction):Number {
balanceCommunicator.communicate(...); // this doesnt block
// do other stuff
}
Will the second call trounce the first calls communicator variable? i.e. will it behave as if its static, as there is onlyone copy of the API object?
If say there was a button on the GUI which had "update balance", and the user kept clicking on it, at the same time as say a URLLoader complete event hander being called which also cals the apis getBalance() function (i.e. flash being multithreaded).
Well, first off, with the exception of the networking APIs, Flash is not multithreaded. All ActionScript runs in the same one thread.
You could fairly easily create a semaphore-like system where each call to communicate passed in a "key" as well as the arguments you already specified. That "key" would just be a string that represented the type of call you're doing (getBalance, login, etc). The "key" would be a property in a generic object (Object or Dictionary) and would reference an array (it would have to be created if it didn't exist).
If the array was empty then the call would happen as normal. If not then the information about the call would be placed into an object and pushed into the array. Your complete handler would then have to just check, after it finished a call, if there were more requests in the queue and if so dequeue one of them and run that request.
One thing about this system would be that it would still allow different types of requests to happen in parallel - but you would have to have a new URLLoader per request (which is perfectly reasonable as long as you clean it up after each request is done).