Can input and output topics be the same for a Pulsar Function? - avro

Is it an anti-pattern to have the same input and output topic when using Pulsar functions?
In my case, I have been using just one topic where I have a Cassandra sink consuming the messages. I was thinking to create a function that will read messages from this topic and send the transformed messages to the same one. The sink will be able to sink into Cassandra just the processed messages because they will respect the schema.
Is this a bad practice?

I would not recommend it. You'll need to filter the transformed messages in your function otherwise you get an infinite loop. Also the sink will have to filter the raw messages. These filters are wastes of resources.
It would be far better to have distinct topics for the raw and the transformed messages. Is there something preventing you to do it ?

Related

Can a wireshark capture filter scan for two different patterns?

I need a capture filter that looks for a 4-byte machine ID that can occur in several places in the UDP payload. Specifically,
udp[18:4]==0x76123AA6 or udp[20:4]==0x76123AA6 or udp[25:4]==0x76123AA6
Experimenting with pieces of this filter does locate some desired packets, but if I use the above filter, some packets are not passed through the filter as expected.
Do the no-loop rules of the BPF interpreter prevent this kind of multiple-match filtering?
I found the reason I seemed to be missing some packets: the search patterns must be listed in the capture filter in sequential order.
For example, this filter will potentially miss some packets:
udp[18:4]==0x12345678 or udp[24:4]==0x12345678 or udp[20:4]==0x12345678
because the scope of the last pattern (20:4) is later in the data than an earlier pattern (24:4).
In order to work as expected the filter must be written like this:
udp[18:4]==0x12345678 or udp[20:4]==0x12345678 or udp[24:4]==0x12345678
I unknowningly wrote out the filter correctly in the original question; my code had the filter in the incorrect format.

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

How to structure a Dask application that processes a fixed number of inputs from a queue?

We have a requirement to implement the following. Given a Redis channel that will provide a known number of messages:
For each message consumed from the channel:
Get a JSON document from Redis
Parse the JSON document, extracting a list of result objects
Aggregate across all result objects to produce a single result
We would like to distribute both steps 1 and 2 across many workers, and avoid collecting all results into memory. We would also like to display progress bars for both steps.
However, we can't see a nice way to structure the application such that we can see progress and keep work moving through the system without blocking as inopportune times.
For example, in step 1 if we read from the Redis channel into a queue then we can pass the queue to Dask, in which case we start processing each message as it comes in without waiting for all messages. However, we can't see a way to show progress if we use a queue (presumably because a queue typically has an unknown size?)
If we collect from the Redis channel into a list and pass this to Dask then we can see progress, but we have to wait for all messages from Redis before we can start processing the first one.
Is there a recommended way to approach this kind of problem?
If your Redis channels are concurrent-access-safe then you might submit many futures to pull an element from the channel. These would run on different machines.
from dask.distributed import Client, progress
client = Client(...)
futures = [client.submit(pull_from_redis_channel, ..., pure=False) for _ in range(n_items)]
futures2 = client.map(process, futures)
progress(futures2)

Best Practice ETL with Dataflow and Lookup

What's the best practice to implement a standard streaming ETL process which writes fact and some smaller dimensional tables to BigQuery?
I'm trying to understand how to handle the following things:
How to do a simple dimension lookup in a streaming pipeline?
In case the answer is sideInput - how to handle lookups for values that don't exist yet in the dimension? How to update the sideInput?
When side inputs receive late data on a specific window, they will be recomputed. If you do the lookup after this, then you'll be able to see the element in the side input.
Currently, the Beam model does not include semantics for re-triggering of the ParDo that consumes the side input, so you'd need to somehow make sure to (re)do de lookup after the side input has been computed.

Control rate of individual topic consumption in Kafka Streams 0.9.1.0-cp1?

I am trying to backprocess data in Kafka topics using a Kafka Streams application that involves a join. One of the streams to be joined has much larger volume of data per unit of time in the corresponding topic. I would like to control the consumption from the individual topics so that I get roughly the same event timestamps from each topic in a single consumer.poll(). However, there doesn't appear to be any way to control the behavior of the KafkaConsumer backing the source stream. Is there any way around this? Any insight would be appreciated.
Currently Kafka cannot control the rate limit on both Producers and Consumers.
Refer:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
But if you are using Apache Spark as the stream processing platform, you can limit the input rate for the Kafka receivers.
in the consumer side you can use consume([num_messages=1][, timeout=-1])
function instead of poll.
consume([num_messages=1][, timeout=-1]):
Consumes a list of messages (possibly empty on timeout). Callbacks may be executed as a side effect of calling this method.
The application must check the returned Message object’s Message.error() method to distinguish between proper messages (error() returns None) and errors for each Message in the list (see error().code() for specifics). If the enable.partition.eof configuration property is set to True, partition EOF events will also be exposed as Messages with error().code() set to _PARTITION_EOF.
num_messages (int) – The maximum number of messages to return (default: 1).
timeout (float) – The maximum time to block waiting for message, event or callback (default: infinite (-1)). (Seconds)

Resources