I have two microservices, e.g M1 and M2. M1 is responsible for managing user transactions e.g orders. When an order is completed, the summary data is sent to M2 via Message bus. M2 is reponsible for generating reports on orders. Our transactions completes without checking if the message was processed successfully or not by M2. The problem is that some orders does not appear in the reports as the messages are not processed successfully(because of any Random Issue). What is the best way to make the data consistent between the two services. I am implementing a mechanism to pull data from M1 and identifying the gaps using the reference numbers(Its is a sequencial number) which I know is not a good approch as I may not know the last reference number that I have is actually the last. Any suggestions or improvements will be highly appreciated.
Thanks.
I have tried a pull data mechanism but I do not think that a good idea.
you may take a look at SAGA Pattern.
Saga goal is to form a Service span Transaction with capability of Compensate transaction.also you can use two phase commit pattern 2PC but it is not recommended mostly since it is resource span transaction which means it occupies resource until end of transaction and thats not good unless you insist immediate transaction which is short enough to release the resources soon as possible.
back to saga, you may hear the term called routing Slip. the routing slip forms chained steps to fulfill the transaction over services. this routing slip can be implemented in two ways of choreography and orchestration. in case of failure happened in any steps the compensation will triggered for all steps that are done. this compensate may be a rollback or any other strategy that should take place. eg :
order added
the inventory item allocated
shipping service proceeded with error and compensate take place as
the allocated item in inventory released
order removed or canceled
i use Masstransit.
read SAGA State Machine where the steps in transaction are coordinated and updates the state for the transaction
read Courier Routing Slip where the routing slips for transaction are defined.
watch Masstransit Serries tut by Chris Patterson
Also, please consider retry policy in case of failure occurred. it can be done with Masstransit configuration.
Related
In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.
Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:
Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
User.transaction do
users.update_all(email_sent: true)
users.each { |user| UserMailer.notification(user).deliver_now }
end
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.
What alternatives am I missing?
One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.
Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.
This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.
Separate Processing From Delivery
start a transaction, update a record, and send an email
Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.
Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.
Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.
By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.
Regarding Row Locking & Deadlocks
There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.
Response: Jobs that die in the middle
Say the job is killed just after the transaction completes but before the emails go out.
I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).
Response: Duplicate jobs
Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"
You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).
Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:
expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on
This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.
In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:
# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
UserUsageReportSchedulerJob.perform_now
end
# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
def perform
# need_usage_report is a scope to determine the list of users who need a report.
# This could, of course, also be "all".
User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
end
end
# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
def perform(user)
# the actual report generation
end
end
If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.
I am using ASP.NET MVC with NServiceBus and where as the vast majority of commands can be executed with eventual consistency in mind, there are a small minority of tasks where immediate consistency would appear to simplify things.
I have done plenty of research on the various methods used to accomplish this but few come with any kind of justification as to why that particular method is preferable. I don't have any experience with NSB in a production environment, so it would also be nice to know if any methods limit scalability in any way.
The following are broadly the methods I have come across: -
No synchronisation, fake the information back to the client. My reservations with this one are firstly, you have to deal with the case where you have faked the data and the command has failed (unlikely scenario) and more importantly, if the initialisation of any data within the command is complex, the ability to fake this data is not necessarily feasible anyway.
Reply (or publish event to be recieved by client) when the task is completed. My reservation with this one is that it means that the distributed architecture becomes more complex and I am not sure if load balanced clients would cause issues as only one of the client machines should be recieving the reply.
Poll the read store until data is present. My reservation with this one is that it puts the read store under more load than the other options.
Are there any options which are better than the above three and if so, why? If not, which of the above three are better and why?
I am assuming that the answer is not subjective and one suits using NServiceBus to implement the command infrastructure in CQRS better than the others.
Thanks.
My take on this is that the actual endpoint should not be performing the work but be handing it off to some 'Task' (Application Service / Operation Script) object. That object is performing the work immediately.
So for cases where you absolutely have to have 100% consistency rather call that same task object rather than sending a command for later processing. You may still want that command for other scenarios.
According to the documentation:
Q: How many times will I receive each message?
Amazon SQS is
engineered to provide “at least once” delivery of all messages in its
queues. Although most of the time each message will be delivered to
your application exactly once, you should design your system so that
processing a message more than once does not create any errors or
inconsistencies.
Is there any good practice to achieve the exactly-once delivery?
I was thinking about using the DynamoDB “Conditional Writes” as distributed locking mechanism but... any better idea?
Some reference to this topic:
At-least-once delivery (Service Behavior)
Exactly-once delivery (Service Behavior)
FIFO queues are now available and provide ordered, exactly once out of the box.
https://aws.amazon.com/sqs/faqs/#fifo-queues
Check your region for availability.
The best solution really depends on exactly how critical it is that you not perform the action suggested in the message more than once. For some actions such as deleting a file or resizing an image it doesn't really matter if it happens twice, so it is fine to do nothing. When it is more critical to not do the work a second time I use an identifier for each message (generated by the sender) and the receiver tracks dups by marking the ids as seen in memchachd. Fine for many things, but probably not if life or money depends on it, especially if there a multiple consumers.
Conditional writes sound like a clever solution, but it has me wondering if perhaps AWS isn't such a great solution for your problem if you need a bullet proof exactly-once solution.
Another alternative for distributed locking is Redis cluster, which can also be provisioned with AWS ElasticCache. Redis supports transactions which guarantee that concurrent calls will get executed in sequence.
One of the advantages of using cache is that you can set expiration timeouts, so if your message processing fails the lock will get timed release.
In this blog post the usage of a low-latency control database like Amazon DynamoDB is also recommended:
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
Amazon SQS FIFO queues ensure that the order of processing follows the
message order within a message group. However, it does not guarantee
only once delivery when used as a Lambda trigger. If only once
delivery is important in your serverless application, it’s recommended
to make your function idempotent. You could achieve this by tracking a
unique attribute of the message using a scalable, low-latency control
database like Amazon DynamoDB.
In short - we can put item or update item in dynamodb table with condition expretion attribute_not_exists(for put) or if_not_exists(for update), please check example here
https://stackoverflow.com/a/55110463/9783262
If we get an exception during put/update operations, we have to return from a lambda without further processing, if not get it then process the message (https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/)
The following resources were helpful for me too:
https://ably.com/blog/sqs-fifo-queues-message-ordering-and-exactly-once-processing-guaranteed
https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/
https://youtu.be/8zysQqxgj0I
My organization is building a new version of our ticketing site and is looking for the best way to build an online waiting room when the number of users in our purchase path exceeds a certain limit. The best version of this queue would let new users in after existing users have either completed their purchase or have exceeded a timeout limit after entering the path.
I'm trying to get an idea of how this has been implemented by other organizations. Has anyone out there done something similar or have any experience with this? We have some ideas, but I'd like to get a sense of what solutions have been tried and what problems those solutions have run up against.
Just to be complete, this site is being built in Ruby on Rails, though I'd love to hear about how people have solved this regardless of platform.
Edit: To clarify: The need for the queue is not primarily to reduce load, but to limit the speed at which the web is purchasing tickets relative to people buying in other ways, like over the phone.
Before I outline one method for this, I want to point out that what you want to do doesn't make a lot of sense. Services on the web aren't like a physical store, where I can walk up and see that it's crowded and decide to stay or not. Queueing people on your site strikes me as shifting the blame from you (unable or unwilling to adequately provision resources) to me (punishing me for trying to use your site).
If you're selling something like show tickets, where quantity is limited and each item is tied to a seat, I think it's better to reserve items and time out those reservations if they aren't paid for in a timely manner. Ticketmaster does this, and I think it's a much better solution than blocking people at the door.
If you still want to go down this path, then I'd design the system like this:
As customers come to your site, record their arrival time. As they interact with the site, record a "last seen" time. "Last seen" will be used to determine activeness. You'll need a background job running very frequently to expire sessions quickly.
Once your limit is hit, you have an ordered queue of people who are blocked. As customers complete their transaction or time out, you'll mark the next person in the queue for entry into the purchase path.
For queued users, their browsers will make a request on a regular basis, checking to see if you've let them in yet. If yes, they proceed to the purchase path. If no, they continue to wait.
The purchase path needs a mechanism to check if someone is trying to circumvent your waiting area, and sends them back.
You might find the Online queuing for ticketing guide helpful. Check their repository at GitHub.
They've integration with Ruby On Rails, PHP, .NET, iOS, Android and similar platforms.
Queue-it enables you to gain control of website overload during extreme traffic peaks by offloading end users into an online queue.
When a peak traffic event occurs on a website, the online queue system sends users to the virtual waiting room environment where the users wait and are redirected back to the website at a rate it can handle.
I was looking at the slave/pool modules and it seems similar to what I
want, but it also seems like I have a single point of failure in my
application (if the master node goes down).
The client has a list of gateways (for the sake of fallback - all do
the same thing) which accept connections, and one is chosen from
randomly by the client. When the client connects all nodes are
examined to see which has the least load and then the IP of the least-
loaded server is forwarded back to the client. The client then
connects to this server and everything is executed there.
In summary, I want all nodes to act as both gateways and to actually
process client requests. The load balancing is only done when the
client initially connects - all of the actual packets and processed on
the client's "home" node.
How would I do this?
I don't know if there is this modules implemented yet but what I can say, load balance is overrated. What I can argue is, random placing of jobs is best bet unless you know far more information how load will come in future and in most of cases you really doesn't. What you wrote:
When the client connects all nodes are examined to see which has the least load and then the IP of the least- loaded server is forwarded back to the client.
How you know that all those least loaded node will not be highest loaded just in next ms? How you know that all those high loaded nodes which you will not include in list will not drop load just in next ms? You really can't know it unless you have very rare case.
Just measure (or compute) your node's performance and set node's probability be chosen depend of it. Choose node randomly regardless of current load. Use this as initial approach. When you set it up, then you can try make up some more sophisticated algorithm. I bet that it will be very hard work to beat this initial approach. Trust me, very hard.
Edit: To be more clear in one subtle detail, I strongly argue that you can't predict future load from current and historical load but you should use knowledge about tasks durations probability and current decomposition of task's lifetime. This work is so hard to try achieve.
The purpose of a supervision tree is to manage the processes not necessarily forward requests. There is no reason you couldn't use different code to send requests directly to members of the list of available processes. See the pool:get_nodes or pool:get_node() functions for one way to get those lists.
You can let the pool module handle the management of the processes (restarting, monitoring, and killing processing) and use some other module to transparently redirect requests to the pool of processes. Maybe you were looking for distributed pools though? It'll be hard to get away from the master process in erlang whithout going to distributed nodes. The whole running system is pretty much one large supervision tree.
I recently remembered the pg module which allows you to setup process groups. messages sent to the group go to every process in the group. It might get you part way toward what you want. you would have to write the code to decide which process handles the request for real but you would get a pool without a master using it.