is there a alternative choice for job.coordinator.system - apache-samza

I want to use samza, but case is our kafka topic creation is limited (topic creation should be reviewed and should has concrete porpose).
So, is there any other choice for "job.coordinator.system"? And I need the usage intro.
Thanks a lot!

As of Samza 0.12.0, Samza auto-creates 3 categories of streams:
Coordinator stream to durably store job model information, e.g. the host affinity mapping.
Checkpoint stream to store checkpoints for consumption of input streams
Changelog streams for any stores with changelog enabled
It is very common to use Kafka log-compacted topics for each of these, but theoretically any System implementation would work. You even can define your own.
Some details on Samza's stream abstractions can be found here:
http://samza.apache.org/learn/documentation/0.12/container/streams.html

Related

Why are read-only nodes called read-only in the case of data store replication?

I was going through the article, https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs which says, "If separate read and write databases are used, they must be kept in sync". One obvious benefit I can understand from having separate read replicas is that they can be scaled horizontally. However, I have some doubts:
It says, "Updating the database and publishing the event must occur in a single transaction". My understanding is that there is no guarantee that the updated data will be available immediately on the read-only nodes because it depends on when the event will be consumed by the read-only nodes. Did I get it correctly?
Data must be first written to read-only nodes before it can be read i.e. write operations are also performed on the read-only nodes. Why are they called read-only nodes? Is it because the write operations are performed on these nodes not directly by the data producer application; but rather by some serverless function (e.g. AWS Lambda or Azure Function) that picks up the event from the topic (e.g. Kafka topic) to which the write-only node has sent the event?
Is the data sharded across the read-only nodes or does every read-only node have the complete set of data?
All of these have "it depends"-like answers...
Yes, usually, although some implementations might choose to (try to) update read models transactionally with the update. With multiple nodes you're quickly forced to learn the CAP theorem, though, and so in many CQRS contexts, eventual consistency is just accepted as a feature, as the gains from tolerating it usually significantly outweigh the losses.
I suspect the bit you quoted anyway refers to transactionally updating the write store with publishing the event. Even this can be difficult to achieve, and is one of the problems event sourcing seeks to solve.
Yes. It's trivially obvious - in this context - that data must be written before it can be read, but your apps as consumers of the data see them as read-only.
Both are valid outcomes. Usually this part is less an application concern and is more delegated to the capabilities of your chosen read-model infrastructure (Mongo, Cosmos, Dynamo, etc).

Source stream outside scdf

I have read the documentation and it's not clear for me the following scenario:
the consumer is outside scdf and the processor and sink are inside.
All the examples provided, the three components are inside.
From my point of view I think that there are two solutions:
The producer outside SCDF produces a message in the topic configured in SCDF
There is another binder outside SCDF and the processor/sink connects to this binder outside SCDF
If somebody could provide any sample it will be very useful
It is not entirely clear from your question what you mean by “outside scdf” . I assume you are referring to existing code that you want to use as a source for an scdf stream. It may already produce messages using a supported message middleware, for example it writes to a Kafka topic, or can be modified to do so but for some reason, you cannot use SCDF to manage its deployment. The simplest way to do this is to use the source topic as a named destination in your stream definition: :my-topic > processor | sink
https://dataflow.spring.io/docs/feature-guides/streams/named-destinations/

MQTT cluster information

I am quite confused about MQTT clustering, it doesn't seem to be part of the MQTT protocol and I was wondering if each MQTT broker implementation has its own way to implement it. Also, do you know which kind of information are shared between cluster nodes? It seems like it retains information related to the session related to pub/sub but not the messages, is that correct? Thanks!
No, there is nothing in the MQTT protocol about clustering brokers. There is support for bridging topics between 2 brokers, but this is purely at the message level, it carries no information about clients or sessions.
Any clustering is implemented independently by a given broker, what information shared would also be dependent on the that implementation. But would need to include the following:
Client Session information, including subscriptions
Message
Information about which messages have been delivered to what clients

how data is stored in the storage device in IPFS

I am going through the concept of IPFS. And one of the important aspect in IPFS is Bitswapping which basically deals with how blocks of data are requested using the wantlists.
My question is with regards to once a peer gets the wantlists from other peers,
how does it actually fetch the data from the actual storage device?
What are the steps involved in it?
How does the conversion happen with respect to different storage protocols based on the bitswap requests.
Please help me with these answers.
I'm still learning, so questions like this are a good opportunity to dig deeper :)
how does it actually fetch the data from the actual storage device?
What are the steps involved in it?
Based on the Bitswap api docs, it looks like bitswap operates on a provided libp2p instance and blockstore instance.
The blockstore instance is an abstraction over actual data storage, which could be software abstraction of anything - storage service like S3, virtualized device, or real device.
Based on the configuration bits I've read though, fetching could be done over whichever transports the libp2p instance was configured with, and any connected nodes also support (on per node basis).
Assuming multiple transports are supported on both ends between two nodes, I don't know how best connection is negotiated/dictated by libp2p...
How does the conversion happen with respect to different storage protocols based on the bitswap requests.
IIUC, at the block level there would not be any conversion happening - that would happen at a higher level in the stack (IPLD).
I read through these to get a better understanding:
Bitswap spec
JS-IPFS Bitswap implementation
JS-IPFS Blockservice

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

Resources