how to end a reactor event loop in ace - ace

I found two ways of ending a reactor event loop in ace:
1. ACE_Reactor::instance()->end_reactor_event_loop();
2. ACE_Reactor::instance()->close()
What is the difference between them? Which should I use?

Depends on what you want to do:
Take a look at this documentation
Basically the difference between the 2 is:
end_reactor_event_loop stops processing of the messages by the reactor but doesn't free resources and doesn't drop any messages already in the queues.
close on the other hand will do above and release all the resources associated with the implementation of the ACE_Reactor::intance(), consequently dropping messages deleting all sockets and handlers associated with the reactor, etc.
So depending on what you are doing you can choose one or the other beyond that you would need to provide more details.

Related

How are Dataflow bundles created after GroupBy/Combine?

Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Use of serialized target queues for Concurrent queues in iOS

I was going through this excellent blog post
(http://www.humancode.us/2014/08/14/target-queues.html)
of target threads in iOS and I could not help but wonder why do we need such a mechanism. In the example, we are specifying a serialised target queue for a custom concurrent queue. Can we not achieve the same by executing the blocks in the original concurrent queue in a serialised queue instead?
Whats the point of having a serialised target queue for a concurrent queue????
If I got You right, you're asking why would someone start serial task on a concurrent queue.
You would need that kind of behaviour in case, if most tasks with some resource can be performed concurrently (aka, simultaneously), but some tasks are, by nature, unsafe to be performed concurrently with others.
The most common example is readers/writers problem. Here you are accessing, for example, some resource of a file system. It's ok to read it even from different threads - every reader will get exactly what it needs. But here comes necessity to update contents of that file. Modifying it while someone reads it leads to unpredicted results - reader is not guaranteed to get the right, expected, info (partially from old version, partially from new). Even worse - there can be two writers (if file contents changes by application user and from some central storage via net) - result will be some crazy mix of two versions (actually, it can be now even corrupted)
Here comes necessity for each writer to wait till all other tasks performed (no one reads, no one writes), and for each reader to wait until no writing tasks take place (no one writes, no matter how many readers)
Wikipedia has nice article on this one. I haven't run into any other practical situations, where you would need this, but I believe there're more of them.
Hope it answers your question

Using Asana events API for task monitoring

I'm trying to use Asana events API to track changes in one of our projects, more specific task movement between sections.
Our workflow is as follows:
We have a project divided into sections.
Each section represents a
step in the process. When one step is done, the task is moved to
section below.
When a given task reaches a specific step we want to pass it to an external system. It doesn't have to be the full info - basic things + url would be enough.
My idea was to use https://asana.com/developers/api-reference/events to implement a pull-based mechanism to obtain recent changes in tasks.
My problems are:
Events API seem to generate a lot of information, but not the useful ones. Moving one single task between sections generates 3 events (2 "changed" actions, one "added" action marked as "system"). During work many tasks will be moved between many sections, but I'm interested one in one specific sections. How can I finds items moved into that section? I know that there's a
resource->text field, but it gives me something like moved from X to Y (ProjectName) which probably is a human readable message that might change in the future
According to documentation the resource key should contain task data, but the only info I see is id and name which is not enough for my case. Is it possible to get hold on tags using events API? Or any other data that would allow us to classify tasks in our system?
Can I listen for events for a specific section instead of tracking the whole project?
Ideas or suggestions are welcome. Thanks
In short:
Yes, answer below.
Yes, answer below.
Unfortunately not, sections are really tasks with a bit of extra functionality. Currently the API represents the relationship between sections and the tasks in them via the memberships field on a task and not the other way.
This should help you achieve what you are looking for, I think.
Let's say you have a project Ninja Pipeline with 2 sections Novice & Expert. Keep in mind, sections are really just tasks whose name ends with a : character with a few extra features in that tasks can belong to them.
Events "bubble up" from children to their parents; therefore, when you the Wombat task in this project form the Novice section to Expert you get 3 events. Starting from the top level going down, they are:
The Ninja Pipeline project changed.
The Wombat task changed.
A story was added to the Wombat task.
For your use case, the most interesting event is the second one about the task changing. The data you really want to know is now that the task changed what is the value of the memberships field on the task. If it is now a member of the section you are interested in, take action, otherwise ignore.
By default, many resources in the API are represented in compact form which usually only includes the id & name. Use the input/output options in order to expand objects or select specific fields you need.
In this case your best bet is to include the query parameter opt_expand=resource when polling events on the project. This should expand all of the resource objects in the payload. For events of type: "task" then if resource.memberships[0].section.id=<id_of_the_section> is true, take action, otherwise ignore.

Event manager process in erlang. Named processes or Pids?

I have event manager process that dispatches events to subscribers (e.g. http_session_created, http_sesssion_destroyed). If Pid is used instead of named process, I must put it into functions to operate with event manager but if Named process is used, code will be more clear.
Which variant is right?
Thank you!
While there is no actual difference to the process naming a process, registering it, makes it global. You in essence you are telling the system that here is a global service which anyone can use.
From you description it more sounds like you are giving them names to save the, small, effort of carrying them around in your loop. If this is the case I would put their pids in a record with all the other state data you carry around. This much better indicates the type of the processes.
If you have a fixed set of "subscriber" processes, then use registered names IMO.
If, on the contrary, you have a publish/subscribe sort of architecture where subscribers come and go, then you need an infrastructure to track those and from this point it doesn't really matter if you use Pid() or names.
One of the drawbacks of using registered names is that you need to track them in your code base to avoid "collisions". So it is up to you: personally, I tend to favor named processes as, like you say, it makes reading the code clearer. One way or another, OTP doesn't care.

Resources