I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub?
We currently have a system which processes and stores records in a MySQL database. I'm curious about using DataFlow for some of our reporting, but wanted to understand what I would need to do to ensure that I didn't accidentally double count (or more than double count) the same messages.
My confusion comes in two parts, firstly ensuring I only send the messages once and secondly ensuring I process them only once.
My gut would be as follows:
Whenever an event I'm interested in is recorded in our MySQL database, transform it into a PubSub message and publish it to PubSub.
Assuming success, record the PubSub id that's returned alongside the MySQL record. That way, if it has a PubSub id, I know I've sent it and I don't need to send it again. If the publish to PubSub fails, then I know I need to send it again. All good.
But if the write to MySQL fails after the PubSub write succeeds, I might end up publishing the same message to pub sub again, so I need something on the DataFlow side to handle both this case and the case that PubSub sends a message twice (as per https://cloud.google.com/pubsub/subscriber#guarantees).
What's the best way to handle this? In AppEngine or other systems I would have a check against the datastore to see if the new record I'm creating exists, but I'm not sure how you'd do that with DataFlow. Is there a way I can easily implement a filter to stop a message being processed twice? Or does DataFlow handle this already?
Dataflow can de-duplicate messages based on an arbitrarily message attribute (selected by idLabel) on the receiver side, as outlined in Using Record IDs. From the producer side, you'll want to make sure that you are deterministically and uniquely populating the attribute based on the MySQL record. If this is done correctly, Dataflow will process each logical record exactly once.
Related
I'm using KafkaIO in dataflow to read messages from one topic. I use the following code.
KafkaIO.<String, String>read()
.withReadCommitted()
.withBootstrapServers(endPoint)
.withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
.put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 8000).put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, 2000)
.build())
// .commitOffsetsInFinalize()
.withTopics(Collections.singletonList(topicNames))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withoutMetadata();
I run the dataflow program in my local using the direct runner. Everything runs fine. I run another instance of the same program in parallel i.e another consumer. Now I see duplicate messages in processing of the pipeline.
Though I have provided consumer group id, starting another consumer with same consumer group id(different instance of the same program) shouldn't be processing same elements that are processed by another consumer right?
How does this turn out using dataflow runner?
I don't think the options you have set guarantees non-duplicate delivery of messages across pipelines.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG: This is a flag for the Kafka consumer not for Beam pipeline itself. Seems like this is best effort and periodic so you might still see duplicates across multiple pipelines.
withReadCommitted(): This just means that Beam will not read uncommitted messages. Again, it will not prevent duplicates across multiple pipelines.
See here for the protocol Beam source use to determine the starting point of the Kafka source.
To guarantee non-duplicate delivery probably you have to read from different topics or different subscriptions.
I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.
I am using Twilio to send/receive texts in a Rails 4.2 app. I am sending in bulk, around 1000 at a time, and receiving sporadically.
Currently when I receive a text I save it to the DB (to, from, body) and then pass that record to an ActiveJob worker to process later. For sending messages I currently persist the Twilio params to another DB and pass that record to a different ActiveJob worker. Since I am often doing it in batches I have two workers. The first outgoing message worker sends a single message. The second one queries the DB and finds all the user who should receive the message, creates a DB record for each message that should be sent, and then passes that record to the first outgoing message worker. So the second one basically just creates a bunch of jobs for the first one to process.
Right now I have the workers destroying the records once they finish processing (both incoming and outgoing). I am worried about not persisting things incase the server, redis, or resque go down but I do not know if this is actually a good design pattern. It was suggested to me just to use a vanilla ruby object and pass it's id to the worker but I am not sure how that effects data reliability. So is it over kill to be creating all these DBs and should I just be creating vanilla ruby objects and passing those object's ids to the workers?
Any and all insight is appreciated,
Drew
It seems to me that the approach of sending a minimal amount of data to your jobs is the best approach. Check out the 'Best Practices' section on the sidekiq wiki: https://github.com/mperham/sidekiq/wiki/Best-Practices
What if your queue backs up and that quote object changes in the meantime? Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
Also in terms of reliability - you should be worried about your job queue going down. It happens. You either design your system to be fault tolerant of a failure or you find a job queue system that has higher reliability guarantees (but even then no queue system can guarantee 100% message deliverability). Sidekiq pro has better reliability guarantees than sidekiq (non-pro), but if you design your jobs with a little bit of forethought, you can create jobs that can scan your database after a crash and re-queue any jobs that may have been lost.
How much work you spend desinging fault tolerant solutions really just depends how critical it is that your information make it from point A to point B :)
We've been tasked with implementing push notifications in our iOS and Android app. One of the features of the app is chat messaging, so we would like to push notify our users when they receive a message. The messages can be generated from the web app, so regardless of the origin, the chat messages get inserted into a Chat SQL Table via C# Web Services.
In my research I found PushSharp would be a good fit for our C# backend -- trying to avoid having to pay for a push notification service if we can. What I'm having a difficult time visualizing is how to trigger the push notification when a new message gets inserted to the DB table.
What's the best practice? I assume manually polling for new records is not.
Any advice would be appreciated.
M.
Probably it's too late but for the new guys that just came here occasionally, I suggest to try debezium, it consumes events for each row-level change made to the database. Only committed changes are visible, so your application doesn't have to worry about transactions or changes that are rolled back.
There are a couple of solutions available to you. Some depend on the level of control you have on the table. Here are a couple of ideas :
Use a daemon to run a script that periodically checks for new entries and sends pushes when necessary. The script can rely on a tuple id field (probably the primary key) to record the last field it checked and then pick up from there periodically. You can use supervise or monit to set that up but there are many other solutions out there that might be better fitted for your server.
A more simple solution would be to create a cronjob entry that triggers the script mentioned above periodically.
If you don't control the original table, you can create a TRIGGER in MySQL that inserts a record in a separate table that you can control entirely and can poll
If you don't want to poll (which is in fact not preferable if you have a lot of data to go through at a high rate), you'll have to look into message queue systems (like RabbitMQ) or into PUBSUB (I personally like Redis PUB/SUB).
Without more information about what your current architecture is, it's difficult to give you more details or point you to a better solution.
According to the documentation:
Q: How many times will I receive each message?
Amazon SQS is
engineered to provide “at least once” delivery of all messages in its
queues. Although most of the time each message will be delivered to
your application exactly once, you should design your system so that
processing a message more than once does not create any errors or
inconsistencies.
Is there any good practice to achieve the exactly-once delivery?
I was thinking about using the DynamoDB “Conditional Writes” as distributed locking mechanism but... any better idea?
Some reference to this topic:
At-least-once delivery (Service Behavior)
Exactly-once delivery (Service Behavior)
FIFO queues are now available and provide ordered, exactly once out of the box.
https://aws.amazon.com/sqs/faqs/#fifo-queues
Check your region for availability.
The best solution really depends on exactly how critical it is that you not perform the action suggested in the message more than once. For some actions such as deleting a file or resizing an image it doesn't really matter if it happens twice, so it is fine to do nothing. When it is more critical to not do the work a second time I use an identifier for each message (generated by the sender) and the receiver tracks dups by marking the ids as seen in memchachd. Fine for many things, but probably not if life or money depends on it, especially if there a multiple consumers.
Conditional writes sound like a clever solution, but it has me wondering if perhaps AWS isn't such a great solution for your problem if you need a bullet proof exactly-once solution.
Another alternative for distributed locking is Redis cluster, which can also be provisioned with AWS ElasticCache. Redis supports transactions which guarantee that concurrent calls will get executed in sequence.
One of the advantages of using cache is that you can set expiration timeouts, so if your message processing fails the lock will get timed release.
In this blog post the usage of a low-latency control database like Amazon DynamoDB is also recommended:
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
Amazon SQS FIFO queues ensure that the order of processing follows the
message order within a message group. However, it does not guarantee
only once delivery when used as a Lambda trigger. If only once
delivery is important in your serverless application, it’s recommended
to make your function idempotent. You could achieve this by tracking a
unique attribute of the message using a scalable, low-latency control
database like Amazon DynamoDB.
In short - we can put item or update item in dynamodb table with condition expretion attribute_not_exists(for put) or if_not_exists(for update), please check example here
https://stackoverflow.com/a/55110463/9783262
If we get an exception during put/update operations, we have to return from a lambda without further processing, if not get it then process the message (https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/)
The following resources were helpful for me too:
https://ably.com/blog/sqs-fifo-queues-message-ordering-and-exactly-once-processing-guaranteed
https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/
https://youtu.be/8zysQqxgj0I