How to make use of x-stream-offset when consuming a RabbitMQ Stream to keep track of what has been processed? - stream

The official documentation states that one can use the x-stream-offset consumer argument to specify where to start reading the Stream from. But it doesn't say where to get these values from.
For example, what does timestamp mean here? The point in time when the chunk was inserted into the stream? If one is reading messages from the Stream, is there a way to get this timestamp to store it so that when the process is restarted it can continue from where it left off?
Same with the numerical offset value... how does a consumer know the current offset when reading messages from the stream?

The offset for each message can be retrieved from its headers.
Usually:
message.properties.headers["x-stream-offset"]

Related

Dataflow to process late and out-of-order data for batch and stream messages?

My company receives both batch and stream based event data. I want to process the data using Google Cloud dataflow over a predictable time period. However, I realize that in some instances the data comes late or out of order. How to use Dataflow to handle late or out of order?
This is a homework question, and would like to know the only answer in below.
a. Set a single global window to capture all data
b. Set sliding window to capture all the lagged data
c. Use watermark and timestamps to capture the lagged data
d. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
My reasoning - I believe 'C' is the answer. But then, watermark is actually different from late data. Please confirm. Also, since the question mentioned both batch and stream based, i also think if 'D' could be the answer since 'batch'(or bounded collection) mode doesn't have the timestamps unless it comes from source or is programmatically set. So, i am a bit confused on the answer.
Please help here. I am a non-native english speaker, so not sure if I could have missed some cues in the question.
How to use Dataflow to handle late or out of order
This is a big question. I will try to give some simple explanations but provide some resources that might help you understand.
Bounded data collection
You have gotten a sense of it: bounded data does not have lateness problem. By the nature of bounded data, you can read the full data set at once before pipeline starts.
Unbounded data collection
Your C is correct, and watermark is different from late data. Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with a event timestamp that is earlier than the watermark, the record is treated as late data (this is only conceptual and you might want to check[1] for some detailed discussion).
Here are [2], [3], [4] as reference for this topic:
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#heading=h.7a03n7d5mf6g
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
https://www.oreilly.com/library/view/streaming-systems/9781491983867/
https://docs.google.com/presentation/d/1ln5KndBTiskEOGa1QmYSCq16YWO9Dtmj7ZwzjU7SsW4/edit#slide=id.g19b6635698_3_4
B and C may be the answer.
With sliding windows, you have the order of the data, so if you recive the data in position 9 and you don't recive the data in the position 8, you know that data 8 is delayed and wait for it. The problem is, if the latest data is delayed, you can't know this data is delayed and you lost it. https://en.wikipedia.org/wiki/Sliding_window_protocol
Watermark, wait a period of time for the lagged data, if this time passes and the data doesn't arrive, you lose this data.
So, the answer is C, because B says "capture all the lagged data" and C ignores the word all

Apache BEAM pipeline IllegalArgumentException - Timestamp skew?

I have an existing BEAM pipeline that is handling the data ingested (from Google Pubsub topic) by 2 routes. The 'hot' path does some basic transformation and stores them in Datastore, while the 'cold' path performs fixed hourly windowing for deeper analysis before storage.
So far the pipeline has been running fine until I started to do some local buffering on the data before publishing to Pubsub (so data arrives at Pubsub may be a few hours 'late'). The error that gets thrown is as below:
java.lang.IllegalArgumentException: Cannot output with timestamp 2018-06-19T14:00:56.862Z. Output timestamps must be no earlier than the timestamp of the current input (2018-06-19T14:01:01.862Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.checkTimestamp(SimpleDoFnRunner.java:463)
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.outputWithTimestamp(SimpleDoFnRunner.java:429)
at org.apache.beam.sdk.transforms.WithTimestamps$AddTimestampsDoFn.processElement(WithTimestamps.java:138)
It seems to be referencing the section of my code (withTimestamps method) that performs the hourly windowing as below:
Window<KV<String, Data>> window = Window.<KV<String, Data>>into
(FixedWindows.of(Duration.standardHours(1)))
.triggering(Repeatedly.forever(pastEndOfWindow()))
.withAllowedLateness(Duration.standardSeconds(10))
.discardingFiredPanes();
PCollection<KV<String, List<Data>>> keyToDataList = eData.apply("Add Event Timestamp", WithTimestamps.of(new EventTimestampFunction()))
.apply("Windowing", window)
.apply("Group by Key", GroupByKey.create())
.apply("Sort by date", ParDo.of(new SortDataFn()));
I'm not sure if I understand exactly what I've done wrong here. Is it because the data is arriving late that is throwing the error? As I understand, if the data arrives late past the allowed lateness, it should be discarded and not throw an error like the one I'm seeing.
Wondering if setting an unlimited timestampSkew will resolve this? The data that's late can be exempt from analysis, I just need to ensure that errors don't get thrown that will choke the pipeline. There's also nowhere else where I'm adding/ changing the timestamps for the data so I'm not sure why the errors are thrown.
It looks like your DoFn is using “outputWithTimestamp” and you are trying to set a timestamp which is older than the input element’s timestamp. Typically timestamps of output elements are derived from inputs, this is important to ensure the correctness of the watermark computation.
You may be able to workaround this by increasing both the timestamp skew and the windwing allowed lateness, however, some data may be lost, it is for you to determine if such loss is acceptable in your scenario.
Another alternative is not to use output with timestamp and instead use the PubSub message timestamp to process each message. Then, output each element as a KV, where the RealTimestamp is computed in the same way you are currently processing the timestamp (just don’t use it in “WithTimestamps”), GroupByKey and write the KV to Datastore.
Other questions you can ask yourself are:
Why are the input elements associated to a most recent timestamp than the output elements?
Do you really need to Buffer that much data before publishing to PubSub?

Control rate of individual topic consumption in Kafka Streams 0.9.1.0-cp1?

I am trying to backprocess data in Kafka topics using a Kafka Streams application that involves a join. One of the streams to be joined has much larger volume of data per unit of time in the corresponding topic. I would like to control the consumption from the individual topics so that I get roughly the same event timestamps from each topic in a single consumer.poll(). However, there doesn't appear to be any way to control the behavior of the KafkaConsumer backing the source stream. Is there any way around this? Any insight would be appreciated.
Currently Kafka cannot control the rate limit on both Producers and Consumers.
Refer:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
But if you are using Apache Spark as the stream processing platform, you can limit the input rate for the Kafka receivers.
in the consumer side you can use consume([num_messages=1][, timeout=-1])
function instead of poll.
consume([num_messages=1][, timeout=-1]):
Consumes a list of messages (possibly empty on timeout). Callbacks may be executed as a side effect of calling this method.
The application must check the returned Message object’s Message.error() method to distinguish between proper messages (error() returns None) and errors for each Message in the list (see error().code() for specifics). If the enable.partition.eof configuration property is set to True, partition EOF events will also be exposed as Messages with error().code() set to _PARTITION_EOF.
num_messages (int) – The maximum number of messages to return (default: 1).
timeout (float) – The maximum time to block waiting for message, event or callback (default: infinite (-1)). (Seconds)

Why is it not safe to use Socket.ReceiveLength?

Well, even Embarcadero states that it is not guaranteed to return accurate result of the bytes ready to read in the socket buffer, but if you look at it, when you place -1 at Socket.ReceiveBuf (this is what ReceiveLength wraps) it calls ioctlsocket with FIONREAD to determine the amount of data pending in the network's input buffer that can be read from socket s.
so, how is it not safe or bad ?
e.g: ioctlsocket(Socket.SocketHandle, FIONREAD, Longint(i));
The documentation you mention specifically says (emphasis mine)
Note: ReceiveLength is not guaranteed to be accurate for streaming socket connections.
This means that the length is not known ahead of time because it's being supplied by a stream of data. Obviously, if you don't know how big the data is that's being sent ahead of time, you can't properly set the length the client should expect.
Consider it like generic code to copy a file. If you don't know ahead of time how big the file is you'll be copying, you can't predict how many bytes you'll be copying. In the case of the socket, the stream size that's supplying the socket isn't known in advance (for instance, for data being generated real-time and sent), so there's no way to inform the client socket how much to expect.

Using PARSE on a PORT! value

I tried using PARSE on a PORT! and it does not work:
>> parse open %test-data.r [to end]
** Script error: parse does not allow port! for its input argument
Of course, it works if you read the data in:
>> parse read open %test-data.r [to end]
== true
...but it seems it would be useful to be able to use PARSE on large files without first loading them into memory.
Is there a reason why PARSE couldn't work on a PORT! ... or is it merely not implemented yet?
the easy answer is no we can't...
The way parse works, it may need to roll-back to a prior part of the input string, which might in fact be the head of the complete input, when it meets the last character of the stream.
ports copy their data to a string buffer as they get their input from a port, so in fact, there is never any "prior" string for parse to roll-back to. its like quantum physics... just looking at it, its not there anymore.
But as you know in rebol... no isn't an answer. ;-)
This being said, there is a way to parse data from a port as its being grabbed, but its a bit more work.
what you do is use a buffer, and
APPEND buffer COPY/part connection amount
Depending on your data, amount could be 1 byte or 1kb, use what makes sense.
Once the new input is added to your buffer, parse it and add logic to know if you matched part of that buffer.
If something positively matched, you remove/part what matched from the buffer, and continue parsing until nothing parses.
you then repeat above until you reach the end of input.
I've used this in a real-time EDI tcp server which has an "always on" tcp port in order to break up a (potentially) continuous stream of input data, which actually piggy-backs messages end to end.
details
The best way to setup this system is to use /no-wait and loop until the port closes (you receive none instead of "").
Also make sure you have a way of checking for data integrity problems (like a skipped byte, or erroneous message) when you are parsing, otherwise, you will never reach the end.
In my system, when the buffer was beyond a specific size, I tried an alternate rule which skipped bytes until a pattern might be found further down the stream. If one was found, an error was logged, the partial message stored and a alert raised for sysadmin to sort out the message.
HTH !
I think that Maxim's answer is good enough. At this moment the parse on port is not implemented. I don't think it's impossible to implement it later, but we must solve other issues first.
Also as Maxim says, you can do it even now, but it very depends what exactly you want to do.
You can parse large files without need to read them completely to the memory, for sure. It's always good to know, what you expect to parse. For example all large files, like files for music and video, are divided into chunks, so you can just use copy|seek to get these chunks and parse them.
Or if you want to get just titles of multiple web pages, you can just read, let's say, first 1024 bytes and look for the title tag here, if it fails, read more bytes and try it again...
That's exactly what must be done to allow parse on port natively anyway.
And feel free to add a WISH in the CureCode database: http://curecode.org/rebol3/

Resources