Kafka Streams wait function with depending objects - stream

I create a Kafka Streams application, which receives different JSON objects from different topics and I want to implement some kind of wait function, but I'm not sure about how to implement it best.
To simplify the problem I'll use simplified entities in the following section, I hope the problem can be described very good with it.
So in one of my streams I receive car objects and every car has an id. In a second stream I receive person objects and every person has also a car id and is assigned to a car with this id.
I want to read with my Kafka Streams application from both input streams (topics) and enrich the car object with the four persons, which have the same car id. The car objects should only be forwarded to the next downstream processor when all four persons are included into the car object.
I have planned to create an input stream for the car and one for the person objects, parse the JSON data into the internal object representation, merge both streams together and apply a "selectKey" function on the merged stream to extract the keys out of the entities.
After that I would push the data into a custom transformation function, which has a state store inlcuded. Inside this transform function I would store every arriving car object with its id in the state store. As soon as new person objects arrive, I would add them to the respective car object in the state store (please ignore the case of late arriving cars here). As soon as four persons are in a car object, I would forward the object to the next stream function and remove the car object out of the state store.
Would this be a suitable approach for this? I'm not sure about scalability, because I have to make sure that when running multiple instances that the car and person objects with the same id will be processed by the same application instance. I would use the selectKey function for this, would that work?
Thanks!

The basic design looks sound to me.
However, selectKey() itself will not be sufficient, because transform() (in contrast to DSL operators) does not trigger an auto-rebalance. Thus, you need to manually rebalance via through().
stream.selectKey(...)
.through("user-created-topic")
.transform(...);
https://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning

Related

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

How does Firebase choose what to store in its cache with isPersistenceEnabled = true in iOS

I have an app that is using Firebase quite extensively to store data that contains relationships. I want to make sure I am using Firebase as safely as possible in offline mode. The safety concern I have can be demonstrated in the following example:
Assume I have a Zoo model where each individual zoo is stored in Firebase as a subnode of "/zoos".
I have an Animal model where each individual animal is stored in Firebase as a subnode of "/animals".
A Zoo can have Animals which are stored in an ordered list. Specifically, the Zoo model contains an Animal array e.g. [Animal]. This list of Animals is stored in Firebase as a set of position-reference pairs at "/zoos/myZoo/animals" which will contain nodes like:
{0: "animals/fidoTheDog"},
{1: "animals/jillTheCat"}
When I add a new Animal to a Zoo, I need to know how many animals are currently in that zoo so I can add the new animal in the right position like:
{2: "animals/jakeTheSnake"}
If I am offline and happen to read the location "zoos/myZoo/animals" to get the list of animals so I can add in the right position, I want to make sure I have accurate data. I know that if someone else wrote to that position while I am offline and added another animal in position 2, I will get stale data and when I add an animal in position 2, I will overwrite his entry at "zoos/myZoo/animals/2" when I again go online. So that is an issue.
But, if I know I will be the only one writing to that location, can I be relatively sure that Firebase will hold the crucial data at "zoos/myZoo/animals" for me since I am using isPersistenceEnabled = true? In other words, will Firebase just keep that data in cache as long as I have recently written to that location or recently read from that location?
Or do I explicitly need to specify "keepSynced(true)" on that location? This gets to the core general version of the question - How does Firebase choose what to store in its cache with isPersistenceEnabled = true? Especially if I have not specifically set keepSynced(true) on any particular locations. Will Firebase just prioritize recently read data and then when the 10mb limit is hit, discard the old stuff first? Does it matter if I wrote the data to that location a long time ago but consistently read from that location? Will it still maintain that location in the cache because it was recently read? Will it ever discard data before hitting the 10mb limit?
I'm a little bit of a newbie so thank you for your patience with me!
-------------- FOLLOW UP QUESTIONS --------------
A couple follow up questions.
I think the approach suggested in the blog (given by Frank in comments) of using childByAutoID sounds good. So if I am saving a zoo with many animals (in order) then it sounds like I would loop through the animals and use childByAutoID to create a new key for each animal whose value will be the reference to the location of the animal object. Can I be sure that the keys that I create in rapid succession (looping will probably be very fast) will ultimately sort correctly when ordered lexicographically? I’m looking at this blog post and assuming that is the case. https://firebase.googleblog.com/2015/02/the-2120-ways-to-ensure-unique_68.html
Suppose I am doing something more complicated like inserting an animal at the beginning of the list in position zero. Then before doing the operation, I would sync down the list of animals in the zoo as suggested in the blog post you sent. https://firebase.googleblog.com/2014/04/best-practices-arrays-in-firebase.html. If the user is offline, I obviously can’t be sure that I will have the freshest copy. But suppose I am ok with that because users will only be working with their own data and only on their own device. In that case, does it help to use keepSynced(true) on the path to the zoo? Or since the amount of data the user is working with is well, well under 10mb (the whole database right now is 300k for 10ish active users), can I just assume the cache will store the data in the zoo path (whether keepSynced or not) because we never flirt with the 10mb limit in any case?
Thank you!

Chord Join DHT - join protocol for second node

I have a distributed hash table (DHT) which is running on multiple instances of the same program, either on multiple machines or for testing on different ports on the same machine. These instances are started after each other. First, the base node is started, then the other nodes join it.
I am a little bit troubled how I should implement the join of the second node, in a way that it works with all the other nodes as well (all have of course the same program) without defining all border cases.
For a node to join, it sends a join message first, which gets passed to the correct node (here it's just the base node) and then answered with a notify message.
With these two messages the predecessor of the base node and the successor of the existing nodes get set. But how does the other property get set? I know, that occasionally the nodes send a stabilise message to their successor, which compares it to its predecessor and returns it with a notify message and the predecessor in case it varies from the sender of the message.
Now, the base node can't send a message, because it doesn't know its successor, the new node can send one, but the predecessor is already valid.
I am guessing, both properties should point to the other node in the end, to be fully joined.
Here another diagram what I think should be the sequence i the third node joins. But again, when do I update the properties based on a stabilise message, when do I send a notify message back? In the diagram it is easy to see, but in code it is hard to decide.
Th trick is here to set the successor to the same value as the predecessor if it is NULL after the join-message has been received. Everything else gets handled nicely by the rest of the protocol.

Parse.com data caching and synchronization

What is the best strategy to synchronize Parse objects across the application?
Take Twitter as an example, they have many Tweet objects, same tweet object can be shown on multiple places, say viewController1 and viewController2, so it is not efficient for both of them to hold deep copies of the same parse object.
When I increase the likeCount of Tweet_168 in viewController2, how should I update the likeCount of Tweet_168 in viewController1?
I created a singleton container class (TweetContainer) so every Parse request goes through this and this checks if the incoming objectIds are already in the container,
A) if it is, it updates the previous object's fields and dumps the new object. (to keep single deep copy of a parse object.)
B) if it is not, it adds the new object.
(This process is fast as I'm using hashmaps)
This container holds deep copies to those objects, and gives shallow copies to viewControllers, thus editing a Tweet in a viewController will result in its update on all viewControllers!
Taking one step further, let's say Tweet objects have pointers to Author objects. When an Author object is updated, I want all of them to be updated (say image change). I can create a new AuthorContainer with the same strategy and give shallow copies to Tweet objects in TweetContainer.
I could, in an ideal world, propagate every update to cloud and refresh every object before showing to user over the cloud, but that's not feasible neither bandwidth nor latency-wise

Database Design without NULL values and Repeating Data for iOS App

Having figured out most of my data-model for a new iOS app, I'm now stuck with a problem that I've been thinking about for a while.
An 'Experiment' has a name, description and owner. It also has one 'Action' and one 'Event'.
An 'Event' could be different things: Time, Location or Speed.
Depending on what the 'Event' is, it can have a different 'Type'. For example, Time could be one-off, interval, date-range, repeating or random. Location could be area or exact location.
Each 'Type' then has a value that has a data type unique to itself. The Time One-Off could be a date value of 12:15pm and the Location Exact could be a GeoPoint value of (30.0, -20.0).
The Problem
How do I design the data model so that the database is not riddled
with NULL values?
How do I design the data model to be extensible if I add more 'Events'
and 'Types'.
Thoughts
As an Experiment only has one Action and one Event, it would be wrong to separate these two into different tables, however not doing so would cause the Experiment table to be full of NULL values, as I'd have to have columns for Event, Event Type and Event Type Value to compensate for all of the possible data types one could enter for an Event Type Value. (date, int, string, geopoint, etc)
Separating the Event and Event Type into a separate table would probably fix the NULL value issue however I'd be left with repeating data, especially in the case of time as the Event with Type One-Off as 12:00pm, as this would exist in other experiments, not just one. (Unless I create EVERY possibility and populate a separate table with these - how could I easily do this though?)
Maybe I'm over complicating things, maybe I'm missing something so simple that I'm going to kick myself when I see it.
You need to think about your data model in terms of objects not tables. Core data works with object graphs so everything in core data is an object. In Objective-c you work with objects. This is why you don't need a ORM tool. If you think in terms of objects then I think the model below (obviously needs work but you should get the point) makes sense. The advantage of separating your concepts out into objects like this is that you can look at your problem from multiple angles. In other words you can look at it from the Experiment angle or from the Event angle. I suspect you will want to do something with the data such as use your Time object in your code to show on a calendar or set a reminder. Fetch all the events for all experiments of a specific type, etc. By encapsulating these data items in objects in core data, everything is ready for you to leverage, manipulate and modify in your code. It also removes the null value issue you identified. Because you won't be creating objects for null values, only for values that are relevant to your experiment. That being said, you might want to break down the model even further depending upon the specifics of your program. Also, you would not have the repeating data issue you mention if you design this properly. Again, you're not dealing with rows in a table you are dealing with objects. If you create an Event Type object with "one-off 12:00pm", you can assign that Event Type objec,t through its relationship, to as many Event(s) as you wish. You don't create the object again, you simply reference it. When you think of the relationships think "X can be associated with Y". For example, "An Experiment can be associated with only 1 Event", "An Event Type can be associated with many Events", "An Event can be associated with only 1 Event Type". Taking this approach sets you up for extensibility down the road. Imagine you want to add a new Event Type. You simply create a new event entity and associate it to your Event Type entity.
My suggestion is to think about your object model relative to how you anticipate using the objects in your code (and how you anticipate accessing the objects via queries). That should help drive how you construct it (e.g. if you need a time object then make sure you have that in your object model. If you need an alert object then make sure you have that in your object model). Let the model do the work for you and try not to write a lot of code to assemble the equivalent of an object model within objective-c or start creating objects in code and populating them with data from your data store.
(EDIT: Replace the "event" relationship in the diagram under time, location & speed with "event types")

Resources