How do I set the inputBuffer of an akka stream publisher? - stream

I'm using Akka streams in a context where sinks for a single source will come and go. For this reason I'm creating a publisher from a source and attaching subscribers as the need arise:
val publisher= mySource.runWith(Sink.publisher(true))
with
publisher.subscribe(subscriber1)// There will be others
Some of the subscribers will be faster than others and I'd like to allow the faster ones to go ahead independently of the slowest, at least to the extend permitted by the input buffer of the publisher. This buffer is described by the comment on the Sink.publisher(true) method:
If fanout is true, the materialized Publisher will support multiple Subscribers and the size of the inputBuffer configured for this stage becomes the maximum number of elements that the fastest [[org.reactivestreams.Subscriber]] can be ahead of the slowest one before slowing the processing down due to back pressure.
My problem is that I don't know how to set this inputBuffer value "for this stage". The closest I have seen is described in the Dropping Broadcast section of this article but this seems to insist on the use of the Flow DSL. I believe that I can't use the DSL because of my need to continually attach new Subscribers.
As a result, my overall stream rate is held back by the slowest subscriber. A related aspect of what I am trying to do relates to making sure the different subscribers are running on different threads (without creating explicit actors as subscribers).

It'd look something like (for Akka Streams 2.0.1):
Sink.asPublisher(true).addAttributes(Attributes.inputBuffer(initialSize, maxSize))

Related

Is it sufficient to set ROS publisher buffer to 1 and Subscriber buffer to 1000 and still not loose any messages

I am trying to understand subscriber and publisher buffers. If I set subsrciber buffer to 1000 and publisher buffer to 1, are there any chances that I loose messages ? Could anyone please explain me the same?
Yes, in theory you may lose messages with these settings, in practice it depends.
Theory: spinner threads
On both sides, publisher as well as subscriber, there are so called spinner threads responsible for handling the callbacks (for message sending on the publisher side and message evaluation on the subscriber-side). These spinner threads are working in parallel to the main thread. If messages are arriving faster from the main thread than they are being processed by the spinner thread, the number of messages given by the queue size will be buffered up before beginning to throw away the oldest ones. Therefore if you publish at a very high rate the publisher-sided spinner thread might drop older messages, while if your callback function on the subscriber side takes too long to execute your subscriber queue will start dropping messages. To improve this one can use multi-threaded spinners where one increases the number of spinner threads and activate concurrency in order to process the callback queue more quickly. Read more about it here.
Practice: Choosing the queue size
The queue size of the publisher queue you should set depends on which rate you publish and if you publish in bursts. If you publish in bursts or at higher frequencies (e.g. > 10 Hz) a publisher queue size of 1 won't be sufficient. On the subscriber side it is harder to give recommendations as it also depends on how long the callback takes to process the information.
It is actually also possible to set the value 0 for the queues which results in an arbitrarily large queue but this might be problematic as the required memory might grow indefinitely, well at least until your computer freezes. Furthermore having a large queue size might often be disadvantageous: If you set a large queue and the callback takes long to execute you might be working on very outdated data while the queue gets longer and longer.
Alternative communication patterns
If you want to guarantee that information is actually being processed (e.g. real-time or safety-relevant information) ROS topics are probably the wrong choice. Depending on what precisely you need the other two communication methods services or actions might be an alternative. But for things like large information streams of safety-relevant real-time data there are no perfect communication mechanisms in ROS1.

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Making use of workers in Custom Sink

I have a custom sink which will publish the final result from a pipeline to a repository.
I am getting the inputs for this pipeline from BigQuery and GCS.
The custom writer present in the sink is called for each in all workers. Custom Writer will just collect the objects to be psuhed and return it as part of WriteResult. And then finally I merge these records in the CustomWriteOperation.finalize() and push it into my repository.
This works fine for smaller files. But, my repository will not accept if the result is greater than 5 MB. Also it will not accept not more than 20 writes per hour.
If I push the result via worker, then the writes per day limit will be violated. If I write it in a CustomWriteOperation.finalize(), then it may violate size limt i.e. 5MB.
Current approach is to write in chunks in CustomWriteOperation.finalize(). As this is not executed in many workers it might cause delay in my job. How can I make use of workers in finalize() and how can I specify the number of workers to be used inside a pipeline for a specific job (i.e) write job?
Or is there any better approach?
The sink API doesn't explicitly allow tuning of bundle size.
One work around might be to use a ParDo to group records into bundles. For example, you can use a DoFn to randomly assign each record a key between 1,..., N. You could then use a GroupByKey to group the records into KV<Integer, Iterable<Records>>. This should produce N groups of roughly the same size.
As a result, an invocation of Sink.Writer.write could write all the records with the same key at once and since write is invoked in parallel the bundles would be written in parallel.
However, since a given KV pair could be processed multiple times or in multiple workers at the same time, you will need to implement some mechanism to create a lock so that you only try to write each group of records once.
You will also need to handle failures and retries.
So, if I understand correctly, you have a repository that
Accepts no more than X write operations per hour (I suppose if you try to do more, you get an error from the API you're writing to), and
Each write operation can be no bigger than Y in size (with similar error reporting).
That means it is not possible to write more than X*Y data in 1 hour, so I suppose, if you want to write more than that, you would want your pipeline to wait longer than 1 hour.
Dataflow currently does not provide built-in support for enforcing either of these limits, however it seems like you should be able to simply do retries with randomized exponential back-off to get around the first limitation (here's a good discussion), and it only remains to make sure individual writes are not too big.
Limiting individual writes can be done in your Writer class in the custom sink. You can maintain a buffer of records, and have write() add to the buffer and flush it by issuing the API call (with exponential back-off, as mentioned) if it becomes just below the allowed write size, and flush one more time in close().
This way you will write bundles that are as big as possible but not bigger, and if you add retry logic, throttling will also be respected.
Overall, this seems to fit well in the Sink API.
I am working with Sam on this and here are the actual limits imposed by our target system: 100 GB per api call, and max of 25 api calls per day.
Given these limits, the retry method with back-off logic may cause the upload to take many days to complete since we don't have control on the number of workers.
Another approach would be to leverage FileBasedSink to write many files in parallel. Once all these files are written, finalize (or copyToOutputFiles) can combine files until total size reaches 100 GB and push to target system. This way we leverage parallelization from writer threads, and honor the limit from target system.
Thoughts on this, or any other ideas?

Use it for JSON data transfer

I am trying to use RabbitMQ for a distributed system that would work something like:
a producer puts in a queue a JSON-formatted list of order ids
several consumers pull out of that queue, do the business logic with that order ids and the result (JSON formatted) as well is put back into another queue
from the second queue, another consumer will take the data and pass it back to the caller
I am still very new to RabbitMQ and I am wondering if this model is the right approach, given the fact that the data should be back as fast as possible (sometimes in the matter of seconds, max 5) so there are real time requirements.
Also, how large can the message passed to a queue can be? The JSON that the producer will get back will be fairly large, based on what the consumer does.
Thanks for any ideas!
See page 47 in this presentation (InfoQ) for a great comparision between different messaging formats.
There's nothing wrong with the design you suggested.
The slight wrinkle is that enforcing "real time requirements" isn't straightforward. For instance, it's not currently possible to expire messages within a queue, so this would need to be handled by the clients when consuming messages.
The total size of messages in RabbitMQ <=1.8.1 was bounded by the amount of available RAM. As of 2.0.0, it's bounded by the amount of available disk space (i.e. rabbit will page messages to disk if it's running low on memory). Individual message sizes are recorded as 32-bit integers (IIRC), so individual messages cannot be larger than ~4GB; if this is a problem, consider saving the JSONs to network storage and passing some ID to them in the messages. Other than this, there aren't any constraints.

Resources