How to avoid Race Conditions when a restful architecture doesn't make sense - asp.net-mvc

I'm making a website that I don't think makes sense to implement with a restful architecture (at least not the portion relevant to this problem), but it's causing some problems with race conditions across multiple servers which share a database.
My website has info about users of another product, so it has a Users table (not users of my site though). Users have many files.
Users and files are populated by an automated service, not manually on the site. The service posts the files to the server, the server parses them and gets the username from the file. If the username is new, it creates a new user row in the table. It then returns about the file to the service that made the request.
The problems I'm seeing are when race conditions when multiple requests come in at the same time for related objects, and it causes things like violations of unique indexes in the db.
For example, there is a unique key on username. This code can be a problem if 2 requests from the automated service for files from the same user come in at the same time.
var myuser = db.users.FirstOrDefault(u => u.username == username);
if(myuser == null)
{
myuser = new user(username);
db.AddObject(user);
}
db.SaveChanges();
Request 1 will see that there is no user with username foo, so the if condition returns true. Request 2 sees the same thing, not knowing that request 1 already began creating the user, and when request 2 tries to save, it violates the unique key.
Is there a common pattern or solution to this problem? I know this wouldn't be a problem if the server was RESTful, but I don't think it's really feasible for the service to change the way it makes requests, so I'd like that to stay the same if possible. Right now, it just posts the file to the server, not knowing whether the user of that file existed already, or whether that file was posted to the server yet (it may post it more than once). Those objects are created if they don't exist yet, and if they do, the list of items is updated. But as far the service is concerned, it just wants to know certain info about the file, and isn't concerned with whether or not it already exists in my db.
I think it'd be too slow for it to try to create a user via a request, then try to create the file via a request, and then request info about the file in another request. Also, the service runs multiple requests at a time via Parallel.ForEach, and it'd be too slow for it to run it in a single thread.

The first thing is separation of concerns. If you have an automated service populating the data, then that service (or another piece of middleware) should be responsible for creating the database records. This shouldn't happen at run time in response to a request to your website.
Second, if you must do it this way, that is what locks are for. Each request to your website runs in it's own thread(s). So, if multiple threads need to access the same volatile resource (your DB) then you need to institute optimistic locking, so that the first thread in wins and any further threads will only be able to try to interact with that table or row (depending on the type of lock) once the first has completed its work.
Third, this is pretty much exactly what RESTful architecture attempts to solve. You can use ETags to version your resources so any attempt to POST to an outdated resource will return an HTTP error (409 Conflict) directing the client to refetch the original resource.

Related

How does Rails handle multiple incoming requests? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How does Rails, handle multiple requests from different users without colliding? What is the logic?
E.g., user1 logged in and browses the site. At same time, user2, user3, ... log in and browse. How does rails manage this situation without any data conflict between users?
One thing to bear in mind here is that even though users are using the site simultaneously in their browsers, the server may still only be handling a single request at a time. Request processing may take less than a second, so requests can be queued up to be processed without causing significant delays for the users. Each response starts from a blank slate, taking only information from the request and using it to look up data from the database. it does not carry anything over from one request to the next. This is called the "stateless" paradigm.
If the load increases, more rails servers can be added. Because each response starts from scratch anyway, adding more servers doesn't create any problems to do with "sharing of information", since all information is either sent in the request or loaded from the database. It just means that more requests can be handled per second.
When their is a feeling of "continuity" for the user, for example staying logged into a website, this is done via cookies, which are stored on their machine and sent through as part of the request. The server can read this cookie information from the request and, for example, NOT redirect someone to the login page as the cookie is telling them they have logged in already as user 123 or whatever.
In case your question is about how Rails differ users the answer will be that is uses cookies to store session. You can read more about it here.
Also data does not conflict since you get fresh instance of controller for each request
RailsGuides:
When your application receives a request, the routing will determine
which controller and action to run, then Rails creates an instance of
that controller and runs the method with the same name as the action.
That is guaranteed not by Rails but by the database that the webservice uses. The property you mentioned is called isolation. This is among several properties that a practical database has to satisfy, known as ACID.
This is achieved using a "session": a bunch of data specific to the given client, available server-side.
There are plenty of ways for a server to store a session, typically Rails uses a cookie: a small (typically around 4 kB) dataset that is stored on user's browser and sent with every request. For that reason you don't want to store too much in there. However, you usually don't need much, you only need just enough to identify the user and still make it hard to impersonate him.
Because of that, Rails stores the session itself in the cookie (as this guide says). It's simple and requires no setup. Some think that cookie store is unreliable and use persistence mechanisms instead: databases, key-value stores and the like.
Typically the workflow is as follows:
A session id is stored in a cookie when the server decides to initialize a session
A server receives a request from the user, fetches session by its id
If the session says that it represents user X, Rails acts as if it's actually him
Since different users send different session ids, Rails treats them as different ones and outputs data relevant to a detected one: on a per-request basis.
Before you ask: yes, it is possible to steal the other person's session id and act in that person's name. It's called session hijacking and it's only one of all the possible security issues you might run into unless you're careful. That same page offers some more insight on how to prevent your users from suffering.
As additional case You could use something like a "puma" multithread server...

How to properly handle asynchronous database replication?

I'm considering using Amazon RDS with read replicas to scale our database.
Some of our controllers in our web application are read/write, some of them are read-only. We already have an automated way for identifying which controllers are read-only, so my first approach would have been to open a connection to the master when requesting a read/write controller, else open a connection to a read replica when requesting a read-only controller.
In theory, that sounds good. But then I stumbled open the replication lag concept, which basically says that a replica can be several seconds behind the master.
Let's imagine the following use case then:
The browser posts to /create-account, which is read/write, thus connecting to the master
The account is created, transaction committed, and the browser gets redirected to /member-area
The browser opens /member-area, which is read-only, thus connecting to a replica. If the replica is even slightly behind the master, the user account might not exist yet on the replica, thus resulting in an error.
How do you realistically use read replicas in your application, to avoid these potential issues?
I worked with application which used pseudo-vertical partitioning. Since only handful of data was time-sensitive the application usually fetched from slaves and from master only in selected cases.
As an example: when the User updated their password application would always ask master for authentication prompt. When changing non-time sensitive data (like User Preferences) it would display success dialog along with information that it might take a while until everything is updated.
Some other ideas which might or might not work depending on environment:
After update compute entity checksum, store it in application cache and when fetching the data always ask for compliance with checksum
Use browser store/cookie for storing delta ensuring User always sees the latest version
Add "up-to-date" flag and invalidate synchronously on every slave node before/after update
Whatever solution you choose keep in mind it's subject of CAP Theorem.
This is a hard problem, and there are lots of potential solutions. One potential solution is to look at what facebook did,
TLDR - read requests get routed to the read only copy, but if you do a write, then for the next 20 seconds, all your reads go to the writeable master.
The other main problem we had to address was that only our master
databases in California could accept write operations. This fact meant
we needed to avoid serving pages that did database writes from
Virginia because each one would have to cross the country to our
master databases in California. Fortunately, our most frequently
accessed pages (home page, profiles, photo pages) don't do any writes
under normal operation. The problem thus boiled down to, when a user
makes a request for a page, how do we decide if it is "safe" to send
to Virginia or if it must be routed to California?
This question turned out to have a relatively straightforward answer.
One of the first servers a user request to Facebook hits is called a
load balancer; this machine's primary responsibility is picking a web
server to handle the request but it also serves a number of other
purposes: protecting against denial of service attacks and
multiplexing user connections to name a few. This load balancer has
the capability to run in Layer 7 mode where it can examine the URI a
user is requesting and make routing decisions based on that
information. This feature meant it was easy to tell the load balancer
about our "safe" pages and it could decide whether to send the request
to Virginia or California based on the page name and the user's
location.
There is another wrinkle to this problem, however. Let's say you go to
editprofile.php to change your hometown. This page isn't marked as
safe so it gets routed to California and you make the change. Then you
go to view your profile and, since it is a safe page, we send you to
Virginia. Because of the replication lag we mentioned earlier,
however, you might not see the change you just made! This experience
is very confusing for a user and also leads to double posting. We got
around this concern by setting a cookie in your browser with the
current time whenever you write something to our databases. The load
balancer also looks for that cookie and, if it notices that you wrote
something within 20 seconds, will unconditionally send you to
California. Then when 20 seconds have passed and we're certain the
data has replicated to Virginia, we'll allow you to go back for safe
pages.

Rails application design: Queueing, Resque, Background Services, and Redis

I am designing a Rails app that takes in requests, uses data within the request to call a 3rd party web service, process the reply and then sends out a response to the original requestor and also issues a PUT request to yet another service.
I am trying to wrap my head around how to design this Rails app as it's different from the canonical Rails structure.
The objects are Lists and Tasks. Each List has many Tasks, and each Task belongs to a List.
The request I would get is something like:
http://myrailsapp.heroku.com/v1/lists?id=1&from=2012-02-12&to=2012-02-14&priority=high
In this example I am requesting tasks from 2/12/2012 to 2/14/2012 with a high priority in List #1
I would then issue a 3rd party web service call like this:
http://thirdpartywebservice.com/v1/lists?id=4128&from=2012-02-12&to=2012-02-14&priority=high
As you can see some processing was done on the data (id was changed in this case)
The results are then sent back to the requestor and to another web service via PUT.
My question is, how do I set up the Rails app to handle these types of behaviors? How does the controller structure change? This looks like a good use case for queues, how do I distribute multiple concurrent requests among queues?
For one thing I don't need data persistence (data can be discarded after the response is sent out) and also data structure design is simplified. (I don't think I need ruby objects, simply dictionaries or hashes representing these would be lighter weight and quicker to implement)
Edit
So I broke down the work flow of the app into these components
Parse incoming request
Construct 3rd part web service request
Send 3rd party request
Enqueue a worker to process the expected response
Process the response once it arrives
Send the parsed result back as a response
Which of the standard ruby controllers handle each of these steps? What are the models needed besides Lists and Tasks?
You should still use a database because passing data to Resque is messy. Rather, you should store it in the database and then pass the id to the workers, fetch the data, commit any new data or delete the record. It's really up to you but this method is cleaner. You can also use a push service like faye to let the user know when the processing is complete.
If you expect to have many concurrent requests, I would recommend Sidekiq as it's less of a memory hog. Having 4-5 resque workers can already suck up about 512 MB. The controller structure should not change. Please comment on anything you need clarified and I'll be happy to update my answer.
EDIT
You would want to use a separate database store, such as Postgres. Not sure if it's important what models you need, but essentially this is what should be happening.
In your controller, create a Request object which contains the query params you want to query this 3rd party service with. Then enqueue a job to be handled by Sidekiq/Resque, let's call this ThirdPartyRequest and pass in the id of the Request object you just created as an argument. Then render a view here showing the Request object. Let's say that Request#response is still empty cause it hasn't been processed yet, so let the user know it's still processing.
A worker then handles your job ThirdPartyRequest. ThirdPartyRequest should then fetch the Request object and obtain the query params needed to contact the third party service. It does that then gets a Request. Update the Request object with this Request then save it.
class ThirdPartyRequest
def self.perform(request_id)
request = Request.find(request_id)
# contact third party service
request.response = ...
request.save
end
end
The user can continually refresh his page to check on his/her Request object. Once it gets updated with the response, they will know its completed. If you want the page to refresh automatically, look into faye/juggernaut/private_pub or a SaaS solution like Pusher.

Design for long running ASP.net MVC web request

I'm aware of the model that involves a scheduled task runninng in the back ground which runs jobs registered with a web request but how about this for an idea that keeps everything within ASP.net...
User uploads a CSV file with, perhaps, several thousand rows. The rows are persisted to the database. I think this would take maybe a minute or so which would be an acceptable wait.
Request returns to the browser and then an automatic Ajax request would go back to the server and request, say, ten rows at a time and process them. (Each row requires a number of web service requests.)
Ajax call returns, display is updated and then another automatic Ajax request goes back for more rows. This repeats until all rows are completed.
If user leaves the web page, then they could return and restart the job.
Any thoughts?
Cheers, Ian.
If i get you right, you actually dont need any "interaction" between background jobs and the long-running request, you just want to "lauch" background jobs with incoming requests? Not such a good idea. Take a look at the Quartz.NET project, it is scheduler embeddable into ASP.NET application, it will handle this stuff for you without need of requests. Of course, if there is app pool shutdown, also your scheduler goes down, but this you cant guarantee not to happen even with your long-running requests solution, dependent on browser waiting on other side.
Also take a look on this interesting article from phil haack on this topic, with his own little scheduler library specific for ASP.NET :
http://haacked.com/archive/2011/10/16/the-dangers-of-implementing-recurring-background-tasks-in-asp-net.aspx
A server side program (or ideally service) could still be quick and dirty and would be more reliable. You could still do step 1 as you have proposed, upload the file and insert the data ( don't forget to increase the maxRequestLength time out value in the web.config ). Then have a program running on the server that checks for new records and processes them.
If the user needs status you could store an entry in the database for each file and update the database record when the import is complete.
Maybe I'm reading the question and interpreting it in a weird way, but why couldn't you read the file into a database and store in a table the current line of the file that you've completed through. You could then track your progress via the db and just send small json objects telling the user how far along you are. That way if their connection drops you can keep processing their request, and if they return later you can notify them of how far along the job is. Also, if multiple clients are connecting you can use the db to queue and throttle (by serializing) the workload. Or if the user connects mid-job with another file, then their new request will be queued up after their current job.

Client Server API pattern in REST (unreliable network use case)

Let's assume we have a client/server interaction happening over unreliable network (packet drop). A client is calling server's RESTful api (over http over tcp):
issuing a POST to http://server.com/products
server is creating an object of "product" resource (persists it to a database, etc)
server is returning 201 Created with a Location header of "http://server.com/products/12345"
! TCP packet containing an http response gets dropped and eventually this leads to a tcp connection reset
I see the following problem: the client will never get an ID of a newly created resource yet the server will have a resource created.
Questions: Is this application level behavior or should framework take care of that? How should a web framework (and Rails in particular) handle a situation like that? Are there any articles/whitepapers on REST for this topic?
The client will receive an error when the server does not respond to the POST. The client would then normally re-issue the request as they assume that it has failed. Off the top of my head I can think of two approaches to this problem.
One is that the client can generate some kind of request identifier, such as a guid, which it includes in the request. If the server receives a POST request with a duplicate GUID then it can refuse it.
The other approach is to PUT instead of POST to create. If you cannot get the client to generate the URI then you can ask the server to provide a new URI with a GET and then do a PUT to that URI.
If you search for something like "make POST idempotent" you will probably find a bunch of other suggestions on how to do this.
If it isn't reasonable for duplicate resources to be created (e.g. products with identical titles, descriptions, etc.), then unique identifiers can be generated on the server which can be tracked against created resources to prevent duplicate requests from being processed. Unlike Darrel's suggestion of generating unique IDs on the client, this would also prevent separate users from creating duplicate resources (which you may or may not find desirable). Clients will be able to distinguish between "created" responses and "duplicate" responses by their response codes (201 and 303 respectively, in my example below).
Pseudocode for generating such an identifier — in this case, a hash of a canonical representation of the request:
func product_POST
// the canonical representation need not contain every field in
// the request, just those which contribute to its "identity"
tags = join sorted request.tags
canonical = join [request.name, request.maker, tags, request.desc]
id = hash canonical
if id in products
http303 products[id]
else
products[id] = create_product_from request
http201 products[id]
end
end
This ID may or may not be part of the created resources' URIs. Personally, I'd be inclined to track them separately — at the cost of an extra lookup table — if the URIs were going to be exposed to users, as hashes tend to be ugly and difficult for humans to remember.
In many cases, it also makes sense to "expire" these unique hashes after some time. For example, if you were to make a money transfer API, a user transferring the same amount of money to the same person a few minutes apart probably indicates that the client never received the "success" response. If a user transfers the same amount of money to the same person once a month, on the other hand, they're probably paying their rent. ;-)
The problem as you describe it boils down to avoiding what are called double-adds. As mentioned by others, you need to make your posts idempotent.
This can be easily implemented at the framework level. The framework can keep a cache of completed responses. The requests have to have a request unique so that any retries are treated as such, and not as new requests.
If the successful response gets lost on its way to the client, the client will retry with the same request unique, the server will then respond with its cached response.
You are left with durability of the cache, how long to keep responses, etc. One approach is to remove responses from the server cache after a given period of time, this will depend on your app domain and traffic and can be left as a configurable step on the framework piece. Another approach is to force the client to sent acknowledgements. The acks can be sent either as separate requests (note that these could be lost too), or as extra data piggy backed on real requests.
Although what I suggest is similar to what others suggest, I strongly encourage you to keep this layer of network resiliency to do only that, deal with drop requests/responses and not allow it to deal with duplicate resources from separate requests which is an application level task. Merging both pieces will mush all functionality and will not leave you with a clear separation of responsibilities.
Not an easy problem, but if you keep it clean you can make your app much more resilient to bad networks without introducing too much complexity.
And for some related experiences by others go here.
Good luck.
As the other responders have pointed out, the basic problem here is that the standard HTTP POST method is not idempotent like the other methods. There is an effort underway to establish a standard for an idempotent POST method known as Post-Once-Exactly, or POE.
Now I'm not saying that this is a perfect solution for everybody in the situation you describe, but if it is the case that you are writing both the server and the client, you may be able to leverage some of the ideas from POE. The draft is here: https://datatracker.ietf.org/doc/html/draft-nottingham-http-poe-00
It isn't a perfect solution, which is probably why it hasn't really taken off in the six years since the draft was submitted. Some of the problems, and some clever alternate options are discussed here:
http://tech.groups.yahoo.com/group/rest-discuss/message/7646
HTTP is a stateless protocol, meaning the server can't open an HTTP connection. All connections get initialized by the client. So you can't solve such an error on the server side.
The only solution I can think of: If you know, which client created the product, you can supply it the products it created, if it pulls that information. If the client never contacts you again, you won't be able to transmit information about the new product.

Resources