How to set up logback MDC in apache beam and dataflow? - google-cloud-dataflow

We are using apache beam and would like to setup the logback MDC. logback MDC is a great GREAT resource when you have a request come in and you store let's say a userId (in our case, it's custId, fileId, requestId), then anytime a developer logs, it magically stamps that information on to the developers log. the developer no longer forgets to add it every log statement he adds.
I am starting in an end to end integration type test with apache beam direct runner embedded in our microservice for testing (in production, the microservice calls dataflow). currently, I am see that the MDC is good up until after the expand() methods are called. Once the processElement methods are called, the context is of course gone since I am in another thread.
So, trying to fix this piece first. Where should I put this context such that I can restore it at the beginning of this thread.
As an example, if I have an Executor.execute(runnable), then I simply transfer context using that runnable like so
public class MDCContextRunnable implements Runnable {
private final Map<String, String> mdcSnapshot;
private Runnable runnable;
public MDCContextRunnable(Runnable runnable) {
this.runnable = runnable;
mdcSnapshot = MDC.getCopyOfContextMap();
}
#Override
public void run() {
try {
MDC.setContextMap(mdcSnapshot);
runnable.run();
} Catch {
//Must log errors before mdc is cleared
log.error("message", e);. /// Logs error and MDC
} finally {
MDC.clear();
}
}
}
so I need to do the same with apache beam basically. I need to
Have a point to capture the MDC
Have a point to restore the MDC
Have a point to clear out the MDC to prevent it leaking to another request(really in case I missed something which seems to happen now and then)
Any ideas on how to do this?
oh, bonus points if it the MDC can be there when any exceptions are logged by the framework!!!! (ie. ideally, frameworks are supposed to do this for you but apache beam seems like it is not doing this. Most web frameworks have this built in).
thanks,
Dean

Based on the context and examples you gave, it sounds like you want to use MDC to automatically capture more information for your own DoFns. Your best bet for this is, depending on the lifetime you need your context available for, to use either the StartBundle/FinishBundle or Setup/Teardown methods on your DoFns to create your MDC context (see this answer for an explanation of the differences between the two). The important thing is that these methods are executed for each instance of a DoFn, meaning they will be called on the new threads created to execute these DoFns.
Under the Hood
I should explain what's happening here and how this approach differs from your original goal. The way Apache Beam executes is that your written pipeline executes on your own machine and performs pipeline construction (which is where all the expand calls are occurring). However, once a pipeline is constructed, it is sent to a runner which is often executing on a separate application unless it's the Direct Runner, and then the runner either directly executes your user code or runs it in a docker environment.
In your original approach it makes sense that you would successfully apply MDC to all logs until execution begins, because execution might not only be occurring in a different thread, but potentially also a different application or machine. However, the methods described above are executed as part of your user code, so setting up your MDC there will allow it to function on whatever thread/application/machine is executing transforms.
Just keep in mind that those methods get called for every DoFn and you will often have mutiple DoFns per thread, which is something you may need to be wary of depending on how MDC works.

Related

How to do Parallel operations with Thread Scope in Web app using Ninject

I've run into the same question repeatedly whenever using a new DI framework... how do you run massively-parallel operation kicked off from an HttpRequest where each thread needs its own unique copy of the dependencies? In my case, I'm using Ninject.
The specific case I always run into is a CPU-intensive report, using Parallel.ForEach, that needs to use an Entity Framework DbContext; the EF context must be unique to the thread, but outside of these special reports the EF context it must be InRequestScope.
How do you achieve this with Ninject? Preferably allow disposing the EF context with each task on the Parallel.ForEach, since the data loaded with the context would just stay in the context and consume memory.
Note that this report is big enough to warrant Parallel.ForEach but small enough that it can run synchronously on a web request and not timeout the browser (<60 seconds). Maybe I'm weird, but I run into this need a lot.
The solution has several different moving parts that, IMO, aren't terribly well-documented parts of Ninject. The upside is that after implementing something like this, you should start feeling comfortable with Ninject in a hurry!
First, you need to change the scope for your objects so they use the HttpContext if it exists, and if not, use the current thread as a fallback. There is no documentation for this, but there is a DefaultScopeCallback that was added to the settings a while back. Set that property to your own scope callback which uses the same code in the Ninject.Web.Common source to get the HttpContext, but then use "?? Thread.CurrentThread" as the fallback. Do that in the CreateKernel code that should have been created automatically when you installed the NuGet package.
(I have substituted the StandardScopeCallbacks.Thread(ctx) where I used to have Thread.CurrentThread, since the former could conceivably change at some point. Currently those two are identical in what they do.)
private static IKernel CreateKernel()
{
var settings = new NinjectSettings{ DefaultScopeCallback = DefaultScopeCallback };
var kernel = new StandardKernel(settings);
// The rest of the default implementation of CreateKernel left out for brevity
}
private static Object DefaultScopeCallback(Ninject.Activation.IContext ctx)
{
var scope = ctx.Kernel.Components.GetAll<INinjectHttpApplicationPlugin>()
.Select(c => c.GetRequestScope(ctx)).FirstOrDefault(s => s != null);
return scope ?? Ninject.Infrastructure.StandardScopeCallbacks.Thread(ctx);
}
Also, don't forget that the Kernel needs to be set aside as a static object for access later. You don't want to new-up a new Kernel every time you need it; I make mine accessible via "MyConfig.ObjectFactory". While this is a code smell of the service locator anti-pattern, we're going to great lengths here to avoid the anti-pattern as much as possible.
Second, according to the commit description, the DefaultScopeCallback only affects explicit bindings with no explicit scope. So if, like me, you were depending on a bunch of implicit bindings that you hadn't added, you now need to configure them:
kernel.Bind(i => i.From(Assembly.GetExecutingAssembly(), Assembly.GetAssembly(typeof(Bll.MyConfig)))
.SelectAllClasses()
.BindToSelf());
If you don't like doing the above, there's another way of setting the default scope for all implicit bindings that is arguably more elegant. Changing default object scope with Ninject 2.2
Third, if you'd like to clear all cached objects from the scope at the end of each Parallel operation so that memory usage doesn't skyrocket due to EF caching or whatnot, here's how clear the Ninject cache scoped to the current thread:
Parallel.ForEach(myList, i =>
{
var threadDb = MyConfig.ObjectFactory.Get<MyContext>();
CreateModelsForItem(i, threadDb);
MyConfig.ObjectFactory.Components.Get<Ninject.Activation.Caching.ICache>().Clear(Thread.CurrentThread);
});
Note that I did some testing without that Clear line at the end, and it seemed like the EF Context was getting re-used even if that HttpRequest finished and I generated the report several more times. This was not what I wanted, so the Clear operation was important. Really, the behavior I want is closer to InCallScope, but trying to get InRequestScope with InCallScope as a fallback is a can of worms I'll open on another day.

sideInput from startBundle

Prior to the most recent SDK I was relying on the ability to access my sideInput inside of startBundle of my DoFn. I’m not sure of the history of refactoring but I seem to be having issues doing this now.
Essentially I have an array that I want to process across within my process() method and the array is reasonably sized that will fit in memory.
Is it valid to expect to access a sideInput within startBundle? And if so, how can I do that if startBundle is sent a Context instead of a ProcessContext?
Example:
#Override
public void startBundle(DoFn<KV<String, Iterable<String>>, String>.Context c) throws Exception {
uniqueIds = Lists.newArrayList(c.sideInput(iterableView));
super.startBundle(c);
}
The history is explained here: Why did #sideInput() method move from Context to ProcessContext in Dataflow beta
Do you need to do any processing on your side input to prepare it for use in processElement? If not, then I'd suggest just using View.asList() or View.asMap() and calling that directly in processElement() -- Dataflow will do caching when possible to make this cheap. (Note View.asList() is currently available on Github and will be in the next Maven release.)
If you need to do processing on your side input, and you are using the (default) GlobalWindow, then you can lazily initialize a local variable from within processElement(). However, if you are using Window.into(), you'll need to invalidate that cache every time the element's window changes.

PerRequestLifetimeManager and Task.Factory.StartNew - Dependency Injection with Unity

How to manage new tasks with PerRequestLifeTimeManager?
Should I create another container inside a new task?(I wouldn't like to change PerRequestLifeTimeManager to PerResolveLifetimeManager/HierarchicalLifetimeManager)
[HttpPost]
public ActionResult UploadFile(FileUploadViewModel viewModel)
{
var cts = new CancellationTokenSource();
CancellationToken cancellationToken = cts.Token;
Task.Factory.StartNew(() =>
{
// _fileService = DependencyResolver.Current.GetService<IFileService>();
_fileService.ProcessFile(viewModel.FileContent);
}, cancellationToken);
}
You should read this article about DI in multi-threaded applications. Although it is written for a different DI library, you'll find most of the information applicable to the concept of DI in general. To quote a few important parts:
Dependency injection forces you to wire all dependencies together in a
single place in the application: the Composition Root. This means that
there is a single place in the application that knows about how
services behave, whether they are thread-safe, and how they should be
wired. Without this centralization, this knowledge would be scattered
throughout the code base, making it very hard to change the behavior
of a service.
In a multi-threaded application, each thread should get its own object
graph. This means that you should typically call
[Resolve<T>()] once at the beginning of the thread’s
execution to get the root object for processing that thread (or
request). The container will build an object graph with all root
object’s dependencies. Some of those dependencies will be singletons;
shared between all threads. Other dependencies might be transient; a
new instance is created per dependency. Other dependencies might be
thread-specific, request-specific, or with some other lifestyle. The
application code itself is unaware of the way the dependencies are
registered and that’s the way it is supposed to be.
The advice of building a new object graph at the beginning of a
thread, also holds when manually starting a new (background) thread.
Although you can pass on data to other threads, you should not pass on
container-controlled dependencies to other threads. On each new
thread, you should ask the container again for the dependencies. When
you start passing dependencies from one thread to the other, those
parts of the code have to know whether it is safe to pass those
dependencies on. For instance, are those dependencies thread-safe?
This might be trivial to analyze in some situations, but prevents you
to change those dependencies with other implementations, since now you
have to remember that there is a place in your code where this is
happening and you need to know which dependencies are passed on. You
are decentralizing this knowledge again, making it harder to reason
about the correctness of your DI configuration and making it easier to
misconfigure the container in a way that causes concurrency problems.
So you should not spin of new threads from within your application code itself. And you should definitely not create a new container instance, since this can cause all sorts of performance problems; you should typically have just one container instance per application.
Instead, you should pull this infrastructure logic into your Composition Root, which allows your controller's code to be simplified. Your controller code should not be more than this:
[HttpPost]
public ActionResult UploadFile(FileUploadViewModel viewModel)
{
_fileService.ProcessFile(viewModel.FileContent);
}
On the other hand, you don't want to change the IFileService implementation, because it shouldn't its concern to do multi-threading. Instead we need some infrastructural logic that we can place in between the controller and the file service, without them having to know about this. They way to do this is by implementing a proxy class for the file service and place it in your Composition Root:
private sealed class AsyncFileServiceProxy : IFileService {
private readonly ILogger logger;
private readonly Func<IFileService> fileServiceFactory;
public AsyncFileServiceProxy(ILogger logger, Func<IFileService> fileServiceFactory)
{
this.logger = logger;
this.fileServiceFactory = fileServiceFactory;
}
void IFileService.ProcessFile(FileContent content) {
// Run on a new thread
Task.Factory.StartNew(() => {
this.BackgroundThreadProcessFile(content);
});
}
private void BackgroundThreadProcessFile(FileContent content) {
// Here we run on a different thread and the
// services should be requested on this thread.
var fileService = this.fileServiceFactory.Invoke();
try {
fileService.ProcessFile(content);
}
catch (Exception ex) {
// logging is important, since we run on a
// different thread.
this.logger.Log(ex);
}
}
}
This class is a small peace of infrastructural logic that allows processing files on a background thread. The only thing left is to configure the container to inject our AsyncFileServiceProxy instead of the real file service implementation. There are multiple ways to do this. Here's an example:
container.RegisterType<ILogger, YourLogger>();
container.RegisterType<RealFileService>();
container.RegisterType<Func<IFileService>>(() => container.Resolve<RealFileService>(),
new ContainerControlledLifetimeManager());
container.RegisterType<IFileService, AsyncFileServiceProxy>();
One part however is missing here from the equation, and this is how to deal with scoped lifestyles, such as the per-request lifestyle. Since you are running stuff on a background thread, there is no HTTPContext and this basically means that you need to start some 'scope' to simulate a request (since your background thread is basically its own new request). This is however where my knowledge about Unity stops. I'm very familiar with Simple Injector and with Simple Injector you would solve this using a hybrid lifestyle (that mixes a per-request lifestyle with a lifetime-scope lifestyle) and you explicitly wrap the call to BackgroundThreadProcessFile in such scope. I imagine the solution in Unity to be very close to this, but unfortunately I don't have enough knowledge of Unity to show you how. Hopefully somebody else can comment on this, or add an extra answer to explain how to do this in Unity.

DI Container and custom-scoped state in legacy system

I believe I understand the basic concepts of DI / IoC containers having written a couple of applications using them and reading a lot of stack overflow answers as well as Mark Seeman's book. There are still some cases that I have trouble with, especially when it comes to integrating DI container to a large existing architecture where DI principle hasn't been really used (think big ball of mud).
I know the ideal scenario is to have a single composition root / object graph per operation but in a legacy system this might not be possible without major refactoring (only the new and some select refactored old parts of the code could have dependencies injected through constructor and the rest of the system using the container as a service locator to interact with the new parts). This effectively means that a stack trace deep within an operation might include several object graphs with calls being made back and forth between new subsystems (single object graph until exiting into an old segment) and traditional subsystems (service locator call at some point to code under DI container).
With the (potentially faulty, I might be overthinking this or be completely wrong in assuming this kind of hybrid architecture is a good idea) assumptions out of the way, here's the actual problem:
Let's say we have a thread pool executing scheduled jobs of various types defined in database (or any external place). Each separate type of scheduled job is implemented as a class inheriting a common base class. When the job is started, it gets fed the information about which targets it should write its log messages to and the configuration it should use. The configuration could probably be handled by just passing the values as method parameters to whatever class needs them but if the job implementation gets larger than say 10-20 classes, it doesn't seem very handy.
Logging is the larger problem. Subsystems the job calls probably also need to write things to the log and usually in examples this is done by just requesting instance of ILog in the constructor. But how does that work in this case when we don't know the details / implementation until runtime? Since:
Due to (non DI container controlled) legacy system segments in the call chain (-> there potentially being multiple separate object graphs), child container cannot be used to inject the custom logger for specific sub-scope
Manual property injection would basically require the complete call chain (including all legacy subsystems) to be updated
A simplified example to help better perceive the problem:
Class JobXImplementation : JobBase {
// through constructor injection
ILoggerFactory _loggerFactory;
JobXExtraLogic _jobXExtras;
public void Run(JobConfig configurationFromDatabase)
{
ILog log = _loggerFactory.Create(configurationFromDatabase.targets);
// if there were no legacy parts in the call chain, I would register log as instance to a child container and Resolve next part of the call chain and everyone requesting ILog would get the correct logging targets
// do stuff
_jobXExtras.DoStuff(configurationFromDatabase, log);
}
}
Class JobXExtraLogic {
public void DoStuff(JobConfig configurationFromDatabase, ILog log) {
// call to legacy sub-system
var old = new OldClass(log, configurationFromDatabase.SomeRandomSetting);
old.DoOldStuff();
}
}
Class OldClass {
public void DoOldStuff() {
// moar stuff
var old = new AnotherOldClass();
old.DoMoreOldStuff();
}
}
Class AnotherOldClass {
public void DoMoreOldStuff() {
// call to a new subsystem
var newSystemEntryPoint = DIContainerAsServiceLocator.Resolve<INewSubsystemEntryPoint>();
newSystemEntryPoint.DoNewStuff();
}
}
Class NewSubsystemEntryPoint : INewSubsystemEntryPoint {
public void DoNewStuff() {
// want to log something...
}
}
I'm sure you get the picture by this point.
Instantiating old classes through DI is a non-starter since many of them use (often multiple) constructors to inject values instead of dependencies and would have to be refactored one by one. The caller basically implicitly controls the lifetime of the object and this is assumed in the implementations (the way they handle internal object state).
What are my options? What other kinds of problems could you possibly see in a situation like this? Is trying to only use constructor injection in this kind of environment even feasible?
Great question. In general, I would say that an IoC container loses a lot of its effectiveness when only a portion of the code is DI-friendly.
Books like Working Effectively with Legacy Code and Dependency Injection in .NET both talk about ways to tease apart objects and classes to make DI viable in code bases like the one you described.
Getting the system under test would be my first priority. I'd pick a functional area to start with, one with few dependencies on other functional areas.
I don't see a problem with moving beyond constructor injection to setter injection where it makes sense, and it might offer you a stepping stone to constructor injection. Adding a property is usually less invasive than changing an object's constructor.

Exception thrown Constructor Injection - AutoFac Dependency Injection

I have an Autofac DI Container and use constructor injection to inject configuration settings into my SampleClass. The Configuration Manager class is created as a singleInstance so the same single instance is used.
public ConfigurationManager()
{
// Load the configuration settings
GetConfigurationSettings();
}
public SampleClass(IConfigurationManager configurationManager)
{
_configurationManager = configurationManager;
}
I am loading the configuration settings from a App.config file in the constructor of the configuration Manager. My problem is i am also validating the configuration settings and if they are not in the App.config file a exception is thrown, which causes the program to crash. Which means I cant handle the exception and return a response.
I am doing this the wrong way? Is there a better way to load the configuration settings Or is there a way to handle the exception being thrown.
Edit
ConfigurationManager configurationManager = new ConfigurationManager();
configurationManager.GetConfigurationSettings();
//Try catch around for the exception thrown if config settings fail
//Register the instance above with autofac
builder.Register(configurationManager()).As<IConfigurationManager>().SingleInstance();
//Old way of registering the configurationManager
builder.Register(c => new ConfigurationManager()).As<IConfigurationManager>().SingleInstance();
You are doing absolutely the right thing. Why? You are preventing the system from starting when the application isn't configured correctly. The last thing you want to happen is that the system actually starts and fails later on. Fail fast! However, make sure that this exception doesn't get lost. You could make sure the exception gets logged.
One note though. The general advice is to do as little as possible in the constructor of a type. Just store the incoming dependencies in instance variables and that's it. This way construction of a type is really fast and can never really fail. In general, building up the dependency graph should be quick and should not fail. In your case this would not really be a problem, since you want the system to fail as soon as possible (during start-up). Still, for the sake of complying to general advice, you might want to extract this validation process outside of that type. So instead of calling GetConfigurationSettings inside that constructor, call it directly from the composition root (the code where you wire up the container) and supply the valid configuration settings object to the constructor of the ConfigurationManager. This way you -not only- make the ConfigurationManager simpler, but you can let the system fail even faster.
The core issue is that you are mixing the composition and execution of your object graph by doing some execution during composition. In the DI style, constructors should be as simple as possible. When your class is asked to perform some meaningful work, such as when the GetConfigurationSettings method is called, that is your signal to begin in earnest.
The main benefit of structuring things in this way is that it makes everything more predictable. Errors during composition really are composition errors, and errors during execution really are execution errors.
The timing of work is also more predictable. I realize that application configuration doesn't really change during runtime, but let's say you had a class which reads a file. If you read it in the constructor during composition, the file's contents may change by the time you use that data during execution. However, if you read the file during execution, you are guaranteed to avoid the timing issues that inevitably arise with that form of caching.
If caching is a part of your algorithm, as I imagine it is for GetConfigurationSettings, it still makes sense to implement that as part of execution rather than composition. The cached values may not have the same lifetime as the ConfigurationManager instance. Even if they do, encoding that into the constructor leaves you only one option, where as an execution-time cache offers far more flexibility and it solves your exception ambuguity issue.
I would not call throwing exceptions at composition-time a good practice. It is so because composition might have a fairly complex and indirect execution logic making reasonable exception handling virtually impossible. I doubt you could invent anything better than awful
try
{
var someComponent = context.Resolve<SampleClass>();
}
catch
{
// Yeah, just stub all exceptions cause you have no idea of what to expect
}
I'd recommend redesigning your classes in a way that their constructors do not throw exceptions unless they do really really need to do that (e.g. if they are absolutely useless with a null-valued constructor parameter). Then you'll need some methods that initialize your app, handle errors and possibly interact with user to do that.

Resources