Amazon SWF for long running business workflows - amazon-swf

I'm evaluating Amazon SWF as an option to build a distributed workflow system. The main language will be Java, so the Flow framework is an obvious choice. There's just one thing that keeps puzzling me and I would get some other opinions before I can recommend it as a key component in our architecture:
The examples are all about tasks that produce a result after a deterministic, relatively short period of time (i.e. after some minutes). In our real-life business workflow, the matter looks different, here we have tasks that could take potentially weeks to complete. I checked the calculator already, having workflows that live 30 days or so do not lead to a cost explosion, so it seems they already counted in that possibility.
Did anybody use SWF for some scenario like this and can share any experience? Are there any recommendations, best practices how to design a workflow like this? Is Flow the right choice here?
It seems to me Activity implementations are expected to eventually return a value synchronously, however, for long running transactions we would rather use messages to send back worker results asynchronously.
Any helpful feedback is appreciated.

From the Amazon Simple Workflow Service point of view an activity execution is a pair of API calls: PollForActivityTask and RespondActivityTaskCompleted that share a task token. There is no requirement for those calls coming from the same thread, process or even host.
By default AWS Flow Framework executes an activity synchronously. Use #ManualActivityCompletion annotation to indicate that activity is not complete upon return of the activity method. Such activity should be explicitly completed (or failed) using provided ManualActivityCompletionClient.
Here is an example taken from the AWS Flow Framework Developer Guide:
#ManualActivityCompletion
public String getName() {
ActivityExecutionContext executionContext = contextProvider.getActivityExecutionContext();
String taskToken = executionContext.getTaskToken();
sendEmail("abc#xyz.com",
"Please provide a name for the greeting message and close task with token: " + taskToken);
return "This will not be returned to the caller";
}
public class CompleteActivityTask {
public void completeGetNameActivity(String taskToken) {
AmazonSimpleWorkflow swfClient = new AmazonSimpleWorkflowClient(…); //pass in user credentials
ManualActivityCompletionClientFactory manualCompletionClientFactory = new ManualActivityCompletionClientFactoryImpl(swfClient);
ManualActivityCompletionClient manualCompletionClient
= manualCompletionClientFactory.getClient(taskToken);
String result = "Hello World!";
manualCompletionClient.complete(result);
}
public void failGetNameActivity(String taskToken, Throwable failure) {
AmazonSimpleWorkflow swfClient
= new AmazonSimpleWorkflowClient(…); //pass in user credentials
ManualActivityCompletionClientFactory manualCompletionClientFactory
= new ManualActivityCompletionClientFactoryImpl(swfClient);
ManualActivityCompletionClient manualCompletionClient
= manualCompletionClientFactory.getClient(taskToken);
manualCompletionClient.fail(failure);
}
}
That an activity is implemented using #ManualActivityCompletion is an implementation detail. Workflow code calls it through the same interface and doesn't treat any differently than any activity implemented synchronously.

Related

Can an Azure Function be Executed for Multiple Environments

I've encountered a dependency injection scenario which I cannot find a way through.
We currently have an Azure function.
We are using dependency injection via the FunctionsStartup attribute.
That all works fine, until I get asked to make it work for multiple environments.
The tester found it too onerous to deploy to 7 different environments, so I was asked to re-jig the function so that it runs (in a loop) for those environments.
That means 7 different IConfigurations and somehow having 7 separate compartmentalised IOC registrations of services.
I can't think of a way of doing that, without significantly re-structuring the way abstractions are being resolved. Even if you set up registrations in a loop and inject an IEnumerable of a service, when it goes to resolve a child dependency, it just pulls the last one registered, rather than the one which was meant to correlate with the current item being iterated.
So, something like this (using Autofac):
Registration
foreach (var configuration in configurations)
{
containerBuilder.Register<ICosmosDbService<AccountUsage>>(sp =>
{
var dBConfig = CosmosDBHelper.GetProjectDatabaseConfig(configuration.Value, Project.Jupiter);
return CosmosClientInitializer<AccountUsage>.Initialize(dBConfig);
}).As<ICosmosDbService<AccountUsage>>();
}
Usage
private readonly IEnumerable<IAccountUsageService> _accountUsageService;
public JobScheduler(IEnumerable<IAccountUsageService> accountUsageService)
{
_accountUsageService = accountUsageService;
}
[FunctionName("JobScheduler")]
public async Task Run([TimerTrigger("0 */2 * * * *")] TimerInfo myTimer, ILogger log)
{
log.LogInformation($"Job Scheduler Timer trigger function executed at: {DateTime.Now}");
try
{
foreach (var usageService in _accountUsageService)
{
var logs = await usageService.GetCurrentAccountUsage("gfkjdsasjfa");
// ...
}
}
I realise this kind of DI usage is not ideal (and does not even work).
Is there a way to structure an Azure Function such that it can execute for different configurations in a compartmentalised manner? Or is this really just fighting against the technology?
You've got a couple of ways to do this - either inject the right dependencies into the function constructor, or resolve them dynamically using a service-locater type approach with a named instance.
Let's consider the second approach and what it would mean for your implementation. As you demonstrated, you'd be looping through your instances and resolving the dependency you want to use, then invoking it
foreach (var usageService in _accountUsageService)
{
var logs = await usageService.GetCurrentAccountUsage("named-instance");
logs.DoSomething();
}
This is technically possible, but now you're doing batch processing - you're doing more than once piece of work that's been triggered by a single event (the timer object), which means you have to deal with a couple of extra problems. What should you do if there's a failure with one of the instances, and what to do if one of the instances is running slowly?
Ideally, you want functions to do the smallest bit of work they can, and complete quickly - You don't want failure or slowness with one particular instance impacting the other instances. By breaking it down to the smallest piece of work (think, one event trigger does one piece of work) then you can take advantage of the functions runtime for things like retries on failures, and threading and concurrency is now being done for you by the runtime.
You could then think about a couple of ways you could do this. a) multiple function signatures and a service resolver approach, e.g.
public class JobScheduler
{
public JobScheduler(IEnumerable<IAccountUsageService> accountUsageService)
{
_accountUsageService = accountUsageService;
}
[FunctionName("FirstInstance")]
public Task FirstInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
var logs = await _accountUsageService.GetNamedInstance("instance-a");
logs.DoSomething();
}
[FunctionName("SecondInstance")]
public Task SecondInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
var logs = _accountUsageService.GetNamedInstance("instance-b");
logs.DoSomething();
}
}
or b), multiple classes with the necessary dependencies injected
public class JobSchedulerFirstInstance
{
public JobSchedulerFirstInstance(ILogs logs)
{
_logs = logs;
}
[FunctionName("FirstInstance")]
public Task FirstInstance([TimerTrigger("%MetricPoller:Schedule%")] TimerInfo myTimer)
{
_logs.DoSomething();
}
}
I'd personally lean towards multiple classes approach, and register named instances with my container. A bit of extra wire up work needed, but you'll end up with lots of small classes that all look very similar that are basically jus t plumbing that the functions runtime executes.

Periodic Work Request running immediately - Xamarin

I am trying to create a notification to be triggered on a schedule. I want the notification to show once a day around a set time (doesn't have to be exact). So what I'm trying to do is create a PeriodicWorkRequest that runs every 24 hours, and delay the start of the work request to the time of day I want it to run. I can get it running on a schedule, but whenever I create it, it just runs right away. I found some questions saying you can use .setInitialDelay() on the builder, however this doesn't seem to be available in the Xamarin Work Manager API. I have also tried .SetPeriodStartTime() on the Builder class after finding the method in the builder class definition, but this doesn't seem to be affecting anything.
MessagingCenter.Subscribe<StartNotificationRemindersMessage>(this, "StartNotificationRemindersMessage", message => {
TimeSpan startDelay = DateTime.Now.AddMinutes(2) - DateTime.Now;
PeriodicWorkRequest notificationsWorkRequest = PeriodicWorkRequest.Builder.From<NotificationsWorker>(1, TimeUnit.Days).SetPeriodStartTime(startDelay).Build();
WorkManager.Instance.Enqueue(notificationsWorkRequest);
});
Here I'm creating the periodic work request and trying to add a 2 minute delay to the start.
And here is the NotificationsWorker class.
class NotificationsWorker : Worker
{
public NotificationsWorker(Context context, WorkerParameters workerParameters) : base(context, workerParameters)
{
}
public override Result DoWork()
{
CrossLocalNotifications.Current.Show("GCS Reminder", "Testing reminder notifications");
return Result.InvokeSuccess();
}
}
If someone can show me what I'm doing wrong, it would be much appreciated.
Thanks!
There is an extension method named SetPeriodStartTime. It's extremely hard to find any documentation on it, but I've been experimenting with it and it appears to do the trick.
E.g.
PeriodicWorkRequest workRequest = PeriodicWorkRequest.Builder
.From<YourWorkerClass>(repeatInterval, flexInterval)
.SetPeriodStartTime(TimeSpan.FromHours(1))
.Build();
setInitialDelay() is now available on the builder in the Xamarin Work Manager API for Xamarin Android
I use setInitialDelay() to make sure PeriodicWorkRequest is delayed for a while you get a WorkRequest so Just cast it to a PeriodicWorkRequest
PeriodicWorkRequest notificationsWorkRequest = (PeriodicWorkRequest)PeriodicWorkRequest.Builder.From<YourWorkerClass>(repeatInterval,repeatIntervalTimeUnit).SetInitialDelay(repeatInterval,repeatIntervalTimeUnit).Build();
WorkManager.GetInstance(Android.App.Application.Context).Enqueue(notificationsWorkRequest );
Then to use it in Xamarin.Forms I use a dependency service, messaging center, or I just define it in the Android Main activity class depending on my reason for using work manager
example would be
PeriodicWorkRequest notificationsWorkRequest = (PeriodicWorkRequest) PeriodicWorkRequest.Builder.From<NotificationsWorker>(24, Java.Util.Concurrent.TimeUnit.Hours).SetInitialDelay(24, Java.Util.Concurrent.TimeUnit.Hours).Build();

Amazon SWF queries

Over the last couple of years, I have done a fair amount of work on Amazon SWF, but the following points are still unclear to me and I am not able to find any straight forward answers on any forums yet.
These are pretty basic requirements I suppose, sure others might have come across too. Would be great if someone can clarify these.
Is there a simple way to return a workflow execution result (maybe just something as simple as boolean) back to workflow starter?
Is there a way to catch Activity timeout exception, so that we can do run customised actions in such scenarios?
Why doesn't WorkflowExecutionHistory contains Activities, why just Events?
Why there is no simple way of restarting a workflow from the point it failed?
I am considering to use SWF for more business processes at my workplace, but these limitations/doubts are holding me back!
FINAL WORKING SOLUTION
public class ReturnResultActivityImpl implements ReturnResultActivity {
SettableFuture future;
public ReturnResultActivityImpl() {
}
public ReturnResultActivityImpl(SettableFuture future) {
this.future = future;
}
public void returnResult(WorkflowResult workflowResult) {
System.out.print("Marking future as Completed");
future.set(workflowResult);
}
}
public class WorkflowResult {
public WorkflowResult(boolean s, String n) {
this.success = s;
this.note = n;
}
private boolean success;
private String note;
}
public class WorkflowStarter {
#Autowired
ReturnResultActivityClient returnResultActivityClient;
#Autowired
DummyWorkflowClientExternalFactory dummyWorkflowClientExternalFactory;
#Autowired
AmazonSimpleWorkflowClient swfClient;
String domain = "test-domain;
boolean isRegister = true;
int days = 7;
int terminationTimeoutSeconds = 5000;
int threadPollCount = 2;
int taskExecutorThreadCount = 4;
public String testWorkflow() throws Exception {
SettableFuture<WorkflowResult> workflowResultFuture = SettableFuture.create();
String taskListName = "testTaskList-" + RandomStringUtils.randomAlphabetic(8);
ReturnResultActivity activity = new ReturnResultActivityImpl(workflowResultFuture);
SpringActivityWorker activityWorker = buildReturnResultActivityWorker(taskListName, Arrays.asList(activity));
DummyWorkflowClientExternalFactory factory = new DummyWorkflowClientExternalFactoryImpl(swfClient, domain);
factory.getClient().doSomething(taskListName)
WorkflowResult result = workflowResultSettableFuture.get(20, TimeUnit.SECONDS);
return "Call result note - " + result.getNote();
}
public SpringActivityWorker buildReturnResultActivityWorker(String taskListName, List activityImplementations)
throws Exception {
return setupActivityWorker(swfClient, domain, taskListName, isRegister, days, activityImplementations,
terminationTimeoutSeconds, threadPollCount, taskExecutorThreadCount);
}
}
public class Workflow {
#Autowired
private DummyActivityClient dummyActivityClient;
#Autowired
private ReturnResultActivityClient returnResultActivityClient;
#Override
public void doSomething(final String resultActivityTaskListName) {
Promise<Void> activityPromise = dummyActivityClient.dummyActivity();
returnResult(resultActivityTaskListName, activityPromise);
}
#Asynchronous
private void returnResult(final String taskListname, Promise waitFor) {
ActivitySchedulingOptions schedulingOptions = new ActivitySchedulingOptions();
schedulingOptions.setTaskList(taskListname);
WorkflowResult result = new WorkflowResult(true,"All successful");
returnResultActivityClient.returnResult(result, schedulingOptions);
}
}
The standard pattern is to host a special activity in the workflow starter process that is used to deliver the result. Use a process specific task list to make sure that it is routed to a correct instance of the starter. Here are the steps to implement it:
Define an activity to receive the result. For example "returnResultActivity". Make this activity implementation to complete the Future passed to its constructor upon execution.
When the workflow is started it receives "resultActivityTaskList" as an input argument. At the end the workflow calls this activity with a workflow result. The activity is scheduled on the passed task list.
The workflow starter creates an ActivityWorker and an instance of a Future. Then it creates an instance of "returnResultActivity" with that future as a constructor parameter.
Then it registers the activity instance with the activity worker and configures it to poll on a randomly generated task list name. Then it calls "start workflow execution" passing the generated task list name as an input argument.
Then it wait on the Future to complete. The future.get() is going to return the workflow result.
Yes, if you are using the AWS Flow Framework a timeout exception is thrown when activity is timed out. If you are not using the Flow framework than you are making your life 100 times harder. BTW the workflow timeout is thrown into a parent workflow as a timeout exception as well. It is not possible to catch a workflow timeout exception from within the timing out instance itself. In this case it is recommended to not rely on workflow timeout, but just create a timer that would fire and notify workflow logic that some business event has timed out.
Because a single activity execution has multiple events associated to it. It should be pretty easy to write code that converts history to whatever representation of activities you like. Such code would just match the events that relate to each activities. Each event always has a reference to the related events, so it is easy to roll them up into higher level representation.
Unfortunately there is no easy answer to this one. Ideally SWF would support restarting workflow by copying its history up to the failure point. But it is not supported. I personally believe that workflow should be written in a way that it never fails but always deals with failures without failing. Obviously it doesn't work in case of failures due to unexpected conditions. In this case writing workflow in a way that it can be restarted from the beginning is the simplest approach.

SideInputs kill dataflow performance

I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
dofnCounter.addValue(1);
}
}
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
#Override
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
}
}));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
p.apply(TextIO.Read.from("gs://some_text.txt").named("Sizing"))
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
.apply(TextIO.Write.named("Write_out").to(outputFile));
p.run();
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Edit
Including a few job ID's:
2016-01-20_14_31_12-1354600113427960103
2016-01-21_08_04_33-1642110636871153093 (latest)
Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.
Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.
I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?

Is it possible to keep track of the state when processing logs by DataFlow?

Is it possible to have the DataFlow process maintain the state. There are log processing tools that allow for that by providing fast access (propriety / in-memory) files available for the real time process to keep track of the state on the logs while processing them.
A use case example would be with tracking registration steps taken by users. The registration steps would come in different logs and the data form those logs would be assembled by the real time process into one final database record (for each registered user) that is written to a database.
Can my DataFLow code keep track of the many registration steps (streaming input) by users and once user's registration steps are completed then have the DataFLow process write the records to the database (one record per user).
I don't know much about DataFlow architecture. It must be using some (proprietary / in-memory nosql) data storage for keeping track of things it needs to keep track of (ex. when it tries to produce top 100 customers). Is that fast access data storage also available to the DataFlow processes to use?
Thanks
As danielm said, state is not yet exposed. The good news is you may not need it for your use case.
If you have a PCollection<KV<UserId, LogEvent>> you can use a CombineFn and Combine.perKey to take all of the LogEvents for a specific UserId and combine them into a single output. The CombineFn tells Dataflow how to create an accumulator, update it by incorporating input elements, and then extract a final output. Transforms like Top actually use a CombineFn (with a Heap as the accumulator) rather than an actual state API.
If your events are of different types, you can still do something like this. For instance, if you have two logs, you can do:
PCollection<KV<UserId, LogEvent1>> events1 = ...;
PCollection<KV<UserId, LogEvent2>> events2 = ...;
// Create tuple tags for the value types in each collection.
final TupleTag<LogEvent1> tag1 = new TupleTag<LogEvent1>();
final TupleTag<LogEvent2> tag2 = new TupleTag<LogEvent2>();
//Merge collection values into a CoGbkResult collection
PCollection<KV<UserIf, CoGbkResult>> coGbkResultCollection =
KeyedPCollectionTuple.of(tag1, pt1)
.and(tag2, pt2)
.apply(CoGroupByKey.<UserId>create());
// Access results and do something.
PCollection<T> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, T>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all LogEvent1 values
Iterable<LogEvent1> event1s = e.getValue().getAll(tag1);
// There will only be one LogEvent2
LogEvent2 event2 = e.getValue().getOnly(tag2);
... Do Something to compute T ....
c.output(...some T...);
}
}));
The above example was adapted from docs on CoGroupByKey which have information.
Dataflow does not currently expose the underlying state mechanism that it uses. However, this is definitely on the radar for a future update.

Resources