I'm building a Rails app that needs to connect to a custom TCP data service, which uses XML messages to exchange data. Functionally, this is not a problem, but I'm having trouble architecting it in a way that feels "clean".
Brief overview:
User logs in to the Rails app. At login, the credentials are validated with the data service and a "context id" is returned.
Request:
<login><username>testuser</username><password>mypass</password></login>
Response:
<reply><context_id>123456</context_id></reply>
This context_id is basically a session token. All subsequent requests for this user must supply this context_id in the XML message.
Request:
<history><context_id>123456</context_id><start_date>1/1/2010</start_date><end_date>1/31/2010</end_date></history>
Response:
<reply><history_item>...</history_item><history_item>..</history_item></reply>
I have hidden away all the XML building/parsing in my models, which is working really well. I can store the context_id in the user's session and retrieve it in my controllers, passing it to the model functions.
#transactions = Transaction.find( { :context_id => 123456, :start_date => '1/1/2010', :end_date => '1/31/2010' } )
From a design point of view, I have 2 problems I'd like to solve:
Passing the context_id to every Model action is a bit of a pain. It would be nice if the model could just retrieve the id from the session itself, but I know this breaks the separation of concerns rule.
There is a TcpSocket connection that gets created/destroyed by the models on every request. The connection is not tied to the context_id directly, so it would be nice if the socket could be stored somewhere and retrieved by the models, so I'm not reestablishing the connection for every request.
This probably sounds really convoluted, and I'm probably going about this all wrong. If anybody has any ideas I'd love to hear them.
Technical details: I'm running Apache/mod_rails, and I have 0 control over the TCP service and it's architecture.
Consider moving the API access to a new class, and store the TcpSocket instance and the context ID there. Change your models to talk to this API access class instead of talking to the socket themselves.
Add an around_filter to your controller(s) that pulls the context ID out of the session, stores it into the API access class, and nils it after running the action. As long as your Rails processes remain single-threaded, you'll be fine. If you switch to a multi-threaded model, you'll also need to change the API access class to store the context ID and the TcpSocket in thread-local storage, and you'll need one TcpSocket per thread.
Related
I don't understand the idea of TIdHTTPSession in TIdHTTPServer. What is it for? Is it a kind of container for request, or what is it? How do I use it properly after I have enabled AutoSessionStart? And what will happen if I do not enable AutoSessions?
For example, say we have some shared resourse FMyMessages: TStringList; Then how should I request this shared resource with/without sessions?
TIdHTTPSession has an FLock: TIdCriticalSection; member - so maybe I should use it to lock my shared resource FMyMessages from other threads if I have AutoSessions, otherwise I should use my own critical section?
Also, how can I count Sessions in the moment? I tried like this but it doesn't work:
Server.Contexts.Count.ToString;
I don't understand the idea of TIdHTTPSession in TIdHTTPServer ? What is it for? Is it a kind of container for request, or what is it? How do I use it properly after I have enabled AutoSessionStart?
HTTP is a stateless protocol. It does not remember information from one request to the next. And it does not even guarantee or require that the TCP connection itself remain open between requests.
That is where sessions come into play. The server can create a session object to store information, such as during a client login, and that session's unique ID is sent to the client via an HTTP cookie, which the client can send back to the server on subsequent requests to reuse the same session object. Eventually, the session will timeout and be destroyed, if you do not end the session explicitly, such as during client logout.
And what will happen if I do not enable AutoSessions?
The server will simply not automatically create a new session object if one does not exist yet for each request. You would have to create a new session manually on an as-needed basis instead.
For example, say we have some shared resourse FMyMessages: TStringList; Then how should I request this shared resource with/without sessions?
Sessions have nothing to do with accessing shared resources, and everything to do with persisting per-client state data. Such as user logins, database connections, etc.
TIdHTTPSession has an FLock: TIdCriticalSection; member - so maybe I should use it to lock my shared resource FMyMessages from other threads if I have AutoSessions, otherwise I should use my own critical section?
No. You can use Indy's TIdThreadSafeStringList instead.
Also, how can I count Sessions in the moment? I tried like this but it doesn't work:
Server.Contexts.Count.ToString;
The Contexts property stores the active client TCP connections. That has nothing to do with HTTP sessions. Those are stored in the server's SessionList property instead.
I'm making a website that I don't think makes sense to implement with a restful architecture (at least not the portion relevant to this problem), but it's causing some problems with race conditions across multiple servers which share a database.
My website has info about users of another product, so it has a Users table (not users of my site though). Users have many files.
Users and files are populated by an automated service, not manually on the site. The service posts the files to the server, the server parses them and gets the username from the file. If the username is new, it creates a new user row in the table. It then returns about the file to the service that made the request.
The problems I'm seeing are when race conditions when multiple requests come in at the same time for related objects, and it causes things like violations of unique indexes in the db.
For example, there is a unique key on username. This code can be a problem if 2 requests from the automated service for files from the same user come in at the same time.
var myuser = db.users.FirstOrDefault(u => u.username == username);
if(myuser == null)
{
myuser = new user(username);
db.AddObject(user);
}
db.SaveChanges();
Request 1 will see that there is no user with username foo, so the if condition returns true. Request 2 sees the same thing, not knowing that request 1 already began creating the user, and when request 2 tries to save, it violates the unique key.
Is there a common pattern or solution to this problem? I know this wouldn't be a problem if the server was RESTful, but I don't think it's really feasible for the service to change the way it makes requests, so I'd like that to stay the same if possible. Right now, it just posts the file to the server, not knowing whether the user of that file existed already, or whether that file was posted to the server yet (it may post it more than once). Those objects are created if they don't exist yet, and if they do, the list of items is updated. But as far the service is concerned, it just wants to know certain info about the file, and isn't concerned with whether or not it already exists in my db.
I think it'd be too slow for it to try to create a user via a request, then try to create the file via a request, and then request info about the file in another request. Also, the service runs multiple requests at a time via Parallel.ForEach, and it'd be too slow for it to run it in a single thread.
The first thing is separation of concerns. If you have an automated service populating the data, then that service (or another piece of middleware) should be responsible for creating the database records. This shouldn't happen at run time in response to a request to your website.
Second, if you must do it this way, that is what locks are for. Each request to your website runs in it's own thread(s). So, if multiple threads need to access the same volatile resource (your DB) then you need to institute optimistic locking, so that the first thread in wins and any further threads will only be able to try to interact with that table or row (depending on the type of lock) once the first has completed its work.
Third, this is pretty much exactly what RESTful architecture attempts to solve. You can use ETags to version your resources so any attempt to POST to an outdated resource will return an HTTP error (409 Conflict) directing the client to refetch the original resource.
I have an analytics engine which periodically packages a bunch of stats in JSON format. I want to send these packages to a Rails server. Upon a package arriving, the Rails server should examine it, generate a model instance out of it (for historical purposes), and then display the contents to the user. I've thought of two approaches.
1) Have a little app residing on the same host as the Rails server to be listening for these packages (using ZeroMQ). Upon receiving a package, the app would invoke a Rails action through CURL, passing on the package as a parameter. My concern with this approach is that my Rails server checks that only signed-in users can access actions which affect models. By creating an action accessible to this listening app (and therefore other entities), am I exposing myself to a major security flaw?
2) The second approach is to simply have the listening app dump the package into a special database table. The Rails server will then periodically check this table for new packages. Upon detecting one or more, it will process them and remove them from the table.
This is the first time I'm doing something like this, so if you have techniques or experiences you can share for better solutions, I'd love to learn.
Thank you.
you can restrict access to a certain call by limiting the host name that is allowed for the request in routes.rb
post "/analytics" => "analytics#create", :constraints => {:ip => /127.0.0.1/}
If you want the users to see updates, you can use polling to refresh the page every minute orso.
1) Yes you are exposing a major security breach unless :
Your zeroMQ app provides the needed data to do authentification and authorization on the rails side
Your rails app is configured to listen only on the 127.0.0.1 interface and is thus not accessible from the outside
Like Benjamin suggests, you restrict specific routes to certain IP
2) This approach looks a lot like what delayed_job does. You might wanna take a look there : https://github.com/collectiveidea/delayed_job and use a rake task to add a new job.
In short, your listening app will call a rake task that will add a custom delayed_job when receiving a packet. Then let delayed_job handle the load. You benefit from delayed_job goodness (different queues, scaling, ...). The hard part is getting the result.
One idea would be to associated a unique ID with each job, and have the delayed_job task output the result in a data store wich associated the job ID with the result. This data store can be a simple relational table
+----+--------+
| ID | Result |
+----+--------+
or a memecache/redis/whatever instance. You just need to poll that data store looking for the result associated with the job ID. And delete everything when you are done displaying that to the user.
3) Why don't you directly POST the data to the rails server ?
Following Benjamin's lead, I implemented a filter for this particular action.
def verify_ip
#ips = ['127.0.0.1']
if not #ips.include? request.remote_ip
redirect_to root_url
end
end
The listening app on the localhost now invokes the action, passing the JSON package received from the analytics engine as a param. Thank you.
I am designing a Rails app that takes in requests, uses data within the request to call a 3rd party web service, process the reply and then sends out a response to the original requestor and also issues a PUT request to yet another service.
I am trying to wrap my head around how to design this Rails app as it's different from the canonical Rails structure.
The objects are Lists and Tasks. Each List has many Tasks, and each Task belongs to a List.
The request I would get is something like:
http://myrailsapp.heroku.com/v1/lists?id=1&from=2012-02-12&to=2012-02-14&priority=high
In this example I am requesting tasks from 2/12/2012 to 2/14/2012 with a high priority in List #1
I would then issue a 3rd party web service call like this:
http://thirdpartywebservice.com/v1/lists?id=4128&from=2012-02-12&to=2012-02-14&priority=high
As you can see some processing was done on the data (id was changed in this case)
The results are then sent back to the requestor and to another web service via PUT.
My question is, how do I set up the Rails app to handle these types of behaviors? How does the controller structure change? This looks like a good use case for queues, how do I distribute multiple concurrent requests among queues?
For one thing I don't need data persistence (data can be discarded after the response is sent out) and also data structure design is simplified. (I don't think I need ruby objects, simply dictionaries or hashes representing these would be lighter weight and quicker to implement)
Edit
So I broke down the work flow of the app into these components
Parse incoming request
Construct 3rd part web service request
Send 3rd party request
Enqueue a worker to process the expected response
Process the response once it arrives
Send the parsed result back as a response
Which of the standard ruby controllers handle each of these steps? What are the models needed besides Lists and Tasks?
You should still use a database because passing data to Resque is messy. Rather, you should store it in the database and then pass the id to the workers, fetch the data, commit any new data or delete the record. It's really up to you but this method is cleaner. You can also use a push service like faye to let the user know when the processing is complete.
If you expect to have many concurrent requests, I would recommend Sidekiq as it's less of a memory hog. Having 4-5 resque workers can already suck up about 512 MB. The controller structure should not change. Please comment on anything you need clarified and I'll be happy to update my answer.
EDIT
You would want to use a separate database store, such as Postgres. Not sure if it's important what models you need, but essentially this is what should be happening.
In your controller, create a Request object which contains the query params you want to query this 3rd party service with. Then enqueue a job to be handled by Sidekiq/Resque, let's call this ThirdPartyRequest and pass in the id of the Request object you just created as an argument. Then render a view here showing the Request object. Let's say that Request#response is still empty cause it hasn't been processed yet, so let the user know it's still processing.
A worker then handles your job ThirdPartyRequest. ThirdPartyRequest should then fetch the Request object and obtain the query params needed to contact the third party service. It does that then gets a Request. Update the Request object with this Request then save it.
class ThirdPartyRequest
def self.perform(request_id)
request = Request.find(request_id)
# contact third party service
request.response = ...
request.save
end
end
The user can continually refresh his page to check on his/her Request object. Once it gets updated with the response, they will know its completed. If you want the page to refresh automatically, look into faye/juggernaut/private_pub or a SaaS solution like Pusher.
Let's assume we have a client/server interaction happening over unreliable network (packet drop). A client is calling server's RESTful api (over http over tcp):
issuing a POST to http://server.com/products
server is creating an object of "product" resource (persists it to a database, etc)
server is returning 201 Created with a Location header of "http://server.com/products/12345"
! TCP packet containing an http response gets dropped and eventually this leads to a tcp connection reset
I see the following problem: the client will never get an ID of a newly created resource yet the server will have a resource created.
Questions: Is this application level behavior or should framework take care of that? How should a web framework (and Rails in particular) handle a situation like that? Are there any articles/whitepapers on REST for this topic?
The client will receive an error when the server does not respond to the POST. The client would then normally re-issue the request as they assume that it has failed. Off the top of my head I can think of two approaches to this problem.
One is that the client can generate some kind of request identifier, such as a guid, which it includes in the request. If the server receives a POST request with a duplicate GUID then it can refuse it.
The other approach is to PUT instead of POST to create. If you cannot get the client to generate the URI then you can ask the server to provide a new URI with a GET and then do a PUT to that URI.
If you search for something like "make POST idempotent" you will probably find a bunch of other suggestions on how to do this.
If it isn't reasonable for duplicate resources to be created (e.g. products with identical titles, descriptions, etc.), then unique identifiers can be generated on the server which can be tracked against created resources to prevent duplicate requests from being processed. Unlike Darrel's suggestion of generating unique IDs on the client, this would also prevent separate users from creating duplicate resources (which you may or may not find desirable). Clients will be able to distinguish between "created" responses and "duplicate" responses by their response codes (201 and 303 respectively, in my example below).
Pseudocode for generating such an identifier — in this case, a hash of a canonical representation of the request:
func product_POST
// the canonical representation need not contain every field in
// the request, just those which contribute to its "identity"
tags = join sorted request.tags
canonical = join [request.name, request.maker, tags, request.desc]
id = hash canonical
if id in products
http303 products[id]
else
products[id] = create_product_from request
http201 products[id]
end
end
This ID may or may not be part of the created resources' URIs. Personally, I'd be inclined to track them separately — at the cost of an extra lookup table — if the URIs were going to be exposed to users, as hashes tend to be ugly and difficult for humans to remember.
In many cases, it also makes sense to "expire" these unique hashes after some time. For example, if you were to make a money transfer API, a user transferring the same amount of money to the same person a few minutes apart probably indicates that the client never received the "success" response. If a user transfers the same amount of money to the same person once a month, on the other hand, they're probably paying their rent. ;-)
The problem as you describe it boils down to avoiding what are called double-adds. As mentioned by others, you need to make your posts idempotent.
This can be easily implemented at the framework level. The framework can keep a cache of completed responses. The requests have to have a request unique so that any retries are treated as such, and not as new requests.
If the successful response gets lost on its way to the client, the client will retry with the same request unique, the server will then respond with its cached response.
You are left with durability of the cache, how long to keep responses, etc. One approach is to remove responses from the server cache after a given period of time, this will depend on your app domain and traffic and can be left as a configurable step on the framework piece. Another approach is to force the client to sent acknowledgements. The acks can be sent either as separate requests (note that these could be lost too), or as extra data piggy backed on real requests.
Although what I suggest is similar to what others suggest, I strongly encourage you to keep this layer of network resiliency to do only that, deal with drop requests/responses and not allow it to deal with duplicate resources from separate requests which is an application level task. Merging both pieces will mush all functionality and will not leave you with a clear separation of responsibilities.
Not an easy problem, but if you keep it clean you can make your app much more resilient to bad networks without introducing too much complexity.
And for some related experiences by others go here.
Good luck.
As the other responders have pointed out, the basic problem here is that the standard HTTP POST method is not idempotent like the other methods. There is an effort underway to establish a standard for an idempotent POST method known as Post-Once-Exactly, or POE.
Now I'm not saying that this is a perfect solution for everybody in the situation you describe, but if it is the case that you are writing both the server and the client, you may be able to leverage some of the ideas from POE. The draft is here: https://datatracker.ietf.org/doc/html/draft-nottingham-http-poe-00
It isn't a perfect solution, which is probably why it hasn't really taken off in the six years since the draft was submitted. Some of the problems, and some clever alternate options are discussed here:
http://tech.groups.yahoo.com/group/rest-discuss/message/7646
HTTP is a stateless protocol, meaning the server can't open an HTTP connection. All connections get initialized by the client. So you can't solve such an error on the server side.
The only solution I can think of: If you know, which client created the product, you can supply it the products it created, if it pulls that information. If the client never contacts you again, you won't be able to transmit information about the new product.