JSR 352 :How to collect data from the Writer of each Partition of a Partitioned Step? - jsr352

So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.

I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
#Inject
private StepContext stepContext;
#Override
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
stepContext.setTransientUserData(partRowCount);
}
MyPartitionCollector
#Inject
private StepContext stepContext;
#Override
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
}
MyPartitionAnalyzer
private int rowCount = 0;
#Override
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
}
Reference : JSR352 v1.0 Final Release.pdf

Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
E.g.
public class MyItemWriteListener extends AbstractItemWriteListener {
#Inject
StepContext stepCtx;
#Override
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
stepCtx.setExitStatus(Integer.toString(newCount));
}
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
Comments
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.

Related

Can an Azure Function be Executed for Multiple Environments

I've encountered a dependency injection scenario which I cannot find a way through.
We currently have an Azure function.
We are using dependency injection via the FunctionsStartup attribute.
That all works fine, until I get asked to make it work for multiple environments.
The tester found it too onerous to deploy to 7 different environments, so I was asked to re-jig the function so that it runs (in a loop) for those environments.
That means 7 different IConfigurations and somehow having 7 separate compartmentalised IOC registrations of services.
I can't think of a way of doing that, without significantly re-structuring the way abstractions are being resolved. Even if you set up registrations in a loop and inject an IEnumerable of a service, when it goes to resolve a child dependency, it just pulls the last one registered, rather than the one which was meant to correlate with the current item being iterated.
So, something like this (using Autofac):
Registration
foreach (var configuration in configurations)
{
containerBuilder.Register<ICosmosDbService<AccountUsage>>(sp =>
{
var dBConfig = CosmosDBHelper.GetProjectDatabaseConfig(configuration.Value, Project.Jupiter);
return CosmosClientInitializer<AccountUsage>.Initialize(dBConfig);
}).As<ICosmosDbService<AccountUsage>>();
}
Usage
private readonly IEnumerable<IAccountUsageService> _accountUsageService;
public JobScheduler(IEnumerable<IAccountUsageService> accountUsageService)
{
_accountUsageService = accountUsageService;
}
[FunctionName("JobScheduler")]
public async Task Run([TimerTrigger("0 */2 * * * *")] TimerInfo myTimer, ILogger log)
{
log.LogInformation($"Job Scheduler Timer trigger function executed at: {DateTime.Now}");
try
{
foreach (var usageService in _accountUsageService)
{
var logs = await usageService.GetCurrentAccountUsage("gfkjdsasjfa");
// ...
}
}
I realise this kind of DI usage is not ideal (and does not even work).
Is there a way to structure an Azure Function such that it can execute for different configurations in a compartmentalised manner? Or is this really just fighting against the technology?
You've got a couple of ways to do this - either inject the right dependencies into the function constructor, or resolve them dynamically using a service-locater type approach with a named instance.
Let's consider the second approach and what it would mean for your implementation. As you demonstrated, you'd be looping through your instances and resolving the dependency you want to use, then invoking it
foreach (var usageService in _accountUsageService)
{
var logs = await usageService.GetCurrentAccountUsage("named-instance");
logs.DoSomething();
}
This is technically possible, but now you're doing batch processing - you're doing more than once piece of work that's been triggered by a single event (the timer object), which means you have to deal with a couple of extra problems. What should you do if there's a failure with one of the instances, and what to do if one of the instances is running slowly?
Ideally, you want functions to do the smallest bit of work they can, and complete quickly - You don't want failure or slowness with one particular instance impacting the other instances. By breaking it down to the smallest piece of work (think, one event trigger does one piece of work) then you can take advantage of the functions runtime for things like retries on failures, and threading and concurrency is now being done for you by the runtime.
You could then think about a couple of ways you could do this. a) multiple function signatures and a service resolver approach, e.g.
public class JobScheduler
{
public JobScheduler(IEnumerable<IAccountUsageService> accountUsageService)
{
_accountUsageService = accountUsageService;
}
[FunctionName("FirstInstance")]
public Task FirstInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
var logs = await _accountUsageService.GetNamedInstance("instance-a");
logs.DoSomething();
}
[FunctionName("SecondInstance")]
public Task SecondInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
var logs = _accountUsageService.GetNamedInstance("instance-b");
logs.DoSomething();
}
}
or b), multiple classes with the necessary dependencies injected
public class JobSchedulerFirstInstance
{
public JobSchedulerFirstInstance(ILogs logs)
{
_logs = logs;
}
[FunctionName("FirstInstance")]
public Task FirstInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
_logs.DoSomething();
}
}
I'd personally lean towards multiple classes approach, and register named instances with my container. A bit of extra wire up work needed, but you'll end up with lots of small classes that all look very similar that are basically jus t plumbing that the functions runtime executes.

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be passed as side input? Is it ok if they are passed as normal constructor arguments for the DoFn ? For example, what if I have some fixed (not computed) values variables (constants, like start date, end date) that I want to make use of during the per element computation of processElement. Now, I can make singleton PCollectionViews out of each of those variables separately and pass them to the DoFn constructor as side input. However, instead of doing that, can I not just pass each of those constants as normal constructor arguments to the DoFn? Am I missing anything subtle here?
In terms of code, when should I do:
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
// these are singleton views
private final PCollectionView<LocalDateTime> dateStartView;
private final PCollectionView<LocalDateTime> dateEndView;
public MyFilter(PCollectionView<LocalDateTime> dateStartView,
PCollectionView<LocalDateTime> dateEndView){
this.dateStartView = dateStartView;
this.dateEndView = dateEndView;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// extract date values from the singleton views here and use them
As opposed to :
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
private final LocalDateTime dateStart;
private final LocalDateTime dateEnd;
public MyFilter(LocalDateTime dateStart,
LocalDateTime dateEnd){
this.dateStart = dateStart;
this.dateEnd = dateEnd;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// use the passed in date values directly here
Notice that in these examples, startDate and endDate are fixed values and not the dynamic results of any previous computation of the pipeline.
When you call something like pipeline.apply(ParDo.of(new MyFilter(...)) the DoFn gets instantiated in the main program that you use to start the pipeline. It then gets serialized and passed to the runner for execution. Runner then decides where to execute it, e.g. on a fleet of a 100 VMs each of which will receive its own copy of the code and serialized data. If the member variables are serializable and you don't mutate them at execution time, it should be fine (link, link), the DoFn will get deserialized on each node with all the fields populated, and will get executed as expected. However you don't control the number of instances or basically their lifecycle (to some extent), so mutate them at your own risk.
The benefit of PCollections and side inputs is that you are not limited to static values, so for couple of simple unmutable values you should be fine .

Amazon SWF queries

Over the last couple of years, I have done a fair amount of work on Amazon SWF, but the following points are still unclear to me and I am not able to find any straight forward answers on any forums yet.
These are pretty basic requirements I suppose, sure others might have come across too. Would be great if someone can clarify these.
Is there a simple way to return a workflow execution result (maybe just something as simple as boolean) back to workflow starter?
Is there a way to catch Activity timeout exception, so that we can do run customised actions in such scenarios?
Why doesn't WorkflowExecutionHistory contains Activities, why just Events?
Why there is no simple way of restarting a workflow from the point it failed?
I am considering to use SWF for more business processes at my workplace, but these limitations/doubts are holding me back!
FINAL WORKING SOLUTION
public class ReturnResultActivityImpl implements ReturnResultActivity {
SettableFuture future;
public ReturnResultActivityImpl() {
}
public ReturnResultActivityImpl(SettableFuture future) {
this.future = future;
}
public void returnResult(WorkflowResult workflowResult) {
System.out.print("Marking future as Completed");
future.set(workflowResult);
}
}
public class WorkflowResult {
public WorkflowResult(boolean s, String n) {
this.success = s;
this.note = n;
}
private boolean success;
private String note;
}
public class WorkflowStarter {
#Autowired
ReturnResultActivityClient returnResultActivityClient;
#Autowired
DummyWorkflowClientExternalFactory dummyWorkflowClientExternalFactory;
#Autowired
AmazonSimpleWorkflowClient swfClient;
String domain = "test-domain;
boolean isRegister = true;
int days = 7;
int terminationTimeoutSeconds = 5000;
int threadPollCount = 2;
int taskExecutorThreadCount = 4;
public String testWorkflow() throws Exception {
SettableFuture<WorkflowResult> workflowResultFuture = SettableFuture.create();
String taskListName = "testTaskList-" + RandomStringUtils.randomAlphabetic(8);
ReturnResultActivity activity = new ReturnResultActivityImpl(workflowResultFuture);
SpringActivityWorker activityWorker = buildReturnResultActivityWorker(taskListName, Arrays.asList(activity));
DummyWorkflowClientExternalFactory factory = new DummyWorkflowClientExternalFactoryImpl(swfClient, domain);
factory.getClient().doSomething(taskListName)
WorkflowResult result = workflowResultSettableFuture.get(20, TimeUnit.SECONDS);
return "Call result note - " + result.getNote();
}
public SpringActivityWorker buildReturnResultActivityWorker(String taskListName, List activityImplementations)
throws Exception {
return setupActivityWorker(swfClient, domain, taskListName, isRegister, days, activityImplementations,
terminationTimeoutSeconds, threadPollCount, taskExecutorThreadCount);
}
}
public class Workflow {
#Autowired
private DummyActivityClient dummyActivityClient;
#Autowired
private ReturnResultActivityClient returnResultActivityClient;
#Override
public void doSomething(final String resultActivityTaskListName) {
Promise<Void> activityPromise = dummyActivityClient.dummyActivity();
returnResult(resultActivityTaskListName, activityPromise);
}
#Asynchronous
private void returnResult(final String taskListname, Promise waitFor) {
ActivitySchedulingOptions schedulingOptions = new ActivitySchedulingOptions();
schedulingOptions.setTaskList(taskListname);
WorkflowResult result = new WorkflowResult(true,"All successful");
returnResultActivityClient.returnResult(result, schedulingOptions);
}
}
The standard pattern is to host a special activity in the workflow starter process that is used to deliver the result. Use a process specific task list to make sure that it is routed to a correct instance of the starter. Here are the steps to implement it:
Define an activity to receive the result. For example "returnResultActivity". Make this activity implementation to complete the Future passed to its constructor upon execution.
When the workflow is started it receives "resultActivityTaskList" as an input argument. At the end the workflow calls this activity with a workflow result. The activity is scheduled on the passed task list.
The workflow starter creates an ActivityWorker and an instance of a Future. Then it creates an instance of "returnResultActivity" with that future as a constructor parameter.
Then it registers the activity instance with the activity worker and configures it to poll on a randomly generated task list name. Then it calls "start workflow execution" passing the generated task list name as an input argument.
Then it wait on the Future to complete. The future.get() is going to return the workflow result.
Yes, if you are using the AWS Flow Framework a timeout exception is thrown when activity is timed out. If you are not using the Flow framework than you are making your life 100 times harder. BTW the workflow timeout is thrown into a parent workflow as a timeout exception as well. It is not possible to catch a workflow timeout exception from within the timing out instance itself. In this case it is recommended to not rely on workflow timeout, but just create a timer that would fire and notify workflow logic that some business event has timed out.
Because a single activity execution has multiple events associated to it. It should be pretty easy to write code that converts history to whatever representation of activities you like. Such code would just match the events that relate to each activities. Each event always has a reference to the related events, so it is easy to roll them up into higher level representation.
Unfortunately there is no easy answer to this one. Ideally SWF would support restarting workflow by copying its history up to the failure point. But it is not supported. I personally believe that workflow should be written in a way that it never fails but always deals with failures without failing. Obviously it doesn't work in case of failures due to unexpected conditions. In this case writing workflow in a way that it can be restarted from the beginning is the simplest approach.

Dynamic table name when writing to BQ from dataflow pipelines

As a followup question to the following question and answer:
https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Thanks.
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
}
#Override
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
.apply(TextIO.Write.to(key));
c.output(1);
}
}
The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/BigQueryTableInserter.java

Cloud Dataflow to BigQuery - too many sources

I have a job that among other things also inserts some of the data it reads from files into BigQuery table for later manual analysis.
It fails with the following error:
job error: Too many sources provided: 10001. Limit is 10000., error: Too many sources provided: 10001. Limit is 10000.
What does it refer to as "source"? Is it a file or a pipeline step?
Thanks,
G
I'm guessing the error is coming from BigQuery and means that we are trying to upload too many files when we create your output table.
Could you provide some more details on the error / context (like a snippet of the commandline output (if using the BlockingDataflowPipelineRunner) so I can confirm? A jobId would also be helpful.
Is there something about your pipeline structure that is going to result in a large number of output files? That could either be a large amount of data or perhaps finely sharded input files without a subsequent GroupByKey operation (which would let us reshard the data into larger pieces).
The note in In Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) mitigates this issue:
Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink
In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.
Note that in both versions, temporary files in GCS may be left over if your job fails.
public static class ForceGroupBy <T> extends PTransform<PCollection<T>, PCollection<KV<T, Iterable<Void>>>> {
private static final long serialVersionUID = 1L;
#Override
public PCollection<KV<T, Iterable<Void>>> apply(PCollection<T> input) {
PCollection<KV<T,Void>> syntheticGroup = input.apply(
ParDo.of(new DoFn<T,KV<T,Void>>(){
private static final long serialVersionUID = 1L;
#Override
public void processElement(
DoFn<T, KV<T, Void>>.ProcessContext c)
throws Exception {
c.output(KV.of(c.element(),(Void)null));
} }));
return syntheticGroup.apply(GroupByKey.<T,Void>create());
}
}

Resources