I am currently trying to find a straight forward and performant way to classify records with Kafka Streams.
All the records contain at least an id and a failed property.
(id is just a String and failed is Boolean)
The idea is to, in the beginning, classify all the incoming records as "message".
Once one of the incoming records has the failed field set, this should be "persisted" somewhere and the record should be classified as "failure".
Every from now on incoming record with the same id should be classified as "failure" as well, no matter if the failed property is set.
I'm thinking about either using the internal State Store of Kafka Streams (together with the interactive query feature) or an external database that will be queried each time a record comes in. I think the State Store of Kafka itself sounds like a more lightweight solution.
Here is a little concept sketch to, hopefully, help understand the problem.
Does someone have an idea on how to tackle this the right way?
Thank you
All the best
- Tim
Your approach sounds good to me. Don't think you need IQ feature though. Just define a custom Transformer and attach a key-value store to it. During processing, if you get a message with failed=true you put the ID into the store. For each incoming message with failed=false you additionally check the store to check if there was a previous failed message with the same ID.
To persist failed messages, you just split your stream into two (maybe use branch() and write failed messages into a special topic.
Related
I have a system that produces thousands of messages per hour. Some location tracking system that gather events from different devices and doing different calculations based on that messages.
I'm trying to evaluate if event store suites for this use case. So my plane is associate stream per device and accumulate messages in those streams.
Now the question - will I be able to read those messages for specific time frame in the past? I don't want to replay all the messages from the beginning, I just need fast access to messages from date1 to date2.
Any ideas? So far what I saw in the docs only relates to reading all messages either from the beginning or from the end and do the filtration during the process. But this pattern doesn't look very optimal to me. Am I doing something wrong?
EventStoreDB index allows you to read events from a specific stream by the event number, but not by date. What you wrote about reading from the beginning or the end is not entirely correct, as you can read from any position in the stream, both backwards and forwards, but then again it has nothing to do with dates.
Essentially, the data when the event was written to the database is considered not important and transient. For example, if you decide to move your data to another store using replication, all the events will get the new date. That's why if the date is important, it should be stored somewhere in the event data or metadata. EventStoreDB doesn't know about the event payload (or meta), and doesn't index that.
If you are looking to find a database kind that allows you to query records by time, the best chance is to look at time series databases like Prometheus and InfluxDB. These databases are specifically designed to index primarily by timestamp, and optimised to store data like sensor readings where each reading is a replacement of the previous one. EventStoreDB is not designed for that purpose, it's the database built to support event-sourced applications, and sensor readings is not that.
I'm facing a situation where I have multiple robots, most running full ROS stacks (complete with Master) and I'd like to selectively route some topics through another messaging framework to the other robots (some of which not running ROS).
The naive way to do this works, namely, to set up a node that subscribes to the ROS topics in question and sends that over the network, after which another node publishes it (if its ROS). Great, but it seems odd to have to do this much serializing. Right now the message goes from its message type to the ROS serialization, back to the message type, then to a different serialization format (currently Pickle), across the network, then back to the message type, then back to the ROS serialization, then back to the message type.
So the question is, can I simplify this? How can I operate on the ROS serialized data (ie subscribe without rospy automagically deserializing for me)? http://wiki.ros.org/rospy/Overview/Publishers%20and%20Subscribers suggests that I can access the connection information as dict of strings, which may be half of the solution, but how can the other end take the connection information and republish it without first deserializing and then immediately reserializing?
Edit: I just found https://gist.github.com/wkentaro/2cd56593107c158e2e02 , which seems to solve half of this. It uses AnyMsg to avoid deserializing on the ROS subscriber side, but then when it republishes it still deserializes and immediately reserializes the message. Is what I'm asking impossible?
Just to close the loop on this, it turns out you can publish AnyMsgs, it's just that the linked examples chose not to.
Has anyone posted a response to this problem? There have been other posts with no answers. Our situation is that we are pushing messages onto a topic that is backing a KTable in the first step of our stream process. We are then pulling a small amount of data from those messages and passing them along. We are doing multiple computations on that smaller amount of data for grouping and aggregation. At the end of the streaming process, we simply want to join back to that original topic via a KTable to pick up the full message content again. The results of the join are only a subset of the data because it can not find the entries in the KTable.
This is just the beginning of the problem. In another case, we are using KTables as indexes for lookups meant to enrich the data coming in. Think of these lookups as identifying whether we have seen a specific pattern in the streaming message before. If we have seen the pattern we want to tag it with an ID (used for grouping) pulled from an existing KTable. If we have not seen the pattern before we would assign it an ID and place it back into the KTable to be used to tag future messages. What we have found is that there is no guaranty that the information will be present in the KTable for future messages. This lack of guaranty seems to make KTables useless. We can not figure out why there is a very little discussion of this on the forums.
Finally, none of this seemed to be a problem when running with a single instance of the streams application. However, as soon as our data got large and we were forced to have 10 instances of the app, everything broke. As well, there is no way that we could use things like GlobalKTables because there is too much data to be loaded into a single machine's memory.
What can we do? We are currently planning to abandon KTables all together and use something like Hazelcast to store the lookup data. Should we just move to Hazelcast Jet and drop Kafka streams all together?
Adding flow:
Kafka data flow
I'm sorry for this non-answer answer, but I don't have enough points to comment...
The behavior you describe is definitely inconsistent with my understanding and experience with streams. If you can share the topology (or a simplified one) that is causing the problem, there might be a simple mistake we can point out.
Once we get more info, I can edit this into a "real" answer...
Thanks!
-John
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
Is there an option in DynammoDB to store auto incremented ID as primary key in tables? I also need to store the server time in tables as the "created at" fields (eg., user create at). But I don't find any way to get server time from DynamoDB or any other AWS services.
Can you guys help me with,
Working with auto incremented IDs in DyanmoDB tables
Storing server time in tables for "created at" like fields.
Thanks.
Actually, there are very few features in DynamoDB and this is precisely its main strength. Simplicity.
There are no way automatically generate IDs nor UUIDs.
There are no way to auto-generate a date
For the "date" problem, it should be easy to generate it on the client side. May I suggest you to use the ISO 8601 date format ? It's both programmer and computer friendly.
Most of the time, there is a better way than using automatic IDs for Items. This is often a bad habit taken from the SQL or MongoDB world. For instance, an e-mail or a login will make a perfect ID for a user. But I know there are specific cases where IDs might be useful.
In these cases, you need to build your own system. In this SO answer and this article from DynamoDB-Mapper documentation, I explain how to do it. I hope it helps
Rather than working with auto-incremented IDs, consider working with GUIDs. You get higher theoretical throughput and better failure handling, and the only thing you lose is the natural time-order, which is better handled by dates.
Higher throughput because you don't need to ask Dynamo to generate the next available IDs (which would require some resource somewhere obtaining a lock, getting some numbers, and making sure nothing else gets those numbers). Better failure handling comes when you lose your connection to Dynamo (Dynamo goes down, or you are bursty and your application is doing more work than currently provisioned throughput). A write-only application can continue "working" and generating data complete with IDs, queueing it up to be written to dynamo, and never worry about ID collisions.
I've created a small web service just for this purpose. See this blog post, that explains how I'm using stateful.co with DynamoDB in order to simulate auto-increment functionality: http://www.yegor256.com/2014/05/18/cloud-autoincrement-counters.html
Basically, you register an atomic counter at stateful.co and increment it every time you need a new value, through RESTful API.