I was wondering what kind of concurrency models do folks do to process inbound hl7 messages (adt,...) and persist them in a normalized data model (relational or no-sql).
I am struggling with the thought of sequential message processing (mapping to a nosql db) and multi-threading when transforming/processing them from the (java, .net, whatever):
example: if I process messages received and transformed by clover leaf (transformed to be compliant with an internal web/rest api expected payload), and set to an internal web/rest api server (multi threaded java web app) then i can't guarantee I am parsing the messages sequentially due to threading.
if I process messages sequentially then mapping will be slow...
Whether you can process the messages asynchronously depends on the characteristics of the messages, and your processing logic. Consider this sequence:
you get a registration for a new patient
you get an episode listed against the patient
you get a merge message merging the new patient with a different patient
If you process the last message before the second last one, what happens? will you treat it as an error because you have a new episode on a merged patient?
This is why there is no simple answer to the question. It depends
If the sending application is using MLLP then you might not have any choice but to do sequential processing. Most MLLP clients will wait for an accept acknowledgment before sending the next message.
For many healthcare use cases the sequence does matter. For example if the sending application is producing ORU^R01 messages then it could send preliminary results first and then final results later. If you are presenting that data to users you wouldn't want to allow the preliminary results to overwrite the final results just because your application happened to process the messages out of order.
A normalized data model and a NoSQL persistence layer is generally a contradiction in terms.
Related
I have an MVC web site, where users can search for large recordsets from SQL Server and Oracle databases. Some of these recordsets can be very large, with many thousands of records. Sadly, it is a user requirement that they do not make their searches more specific.
When a user posts their search request to the database, my web page is hanging before often timing out (due to the amount of time taken to query the database).
We are thinking about removing the expensive database calls from the MVC site, and sending the query to a separate process to run in the background. When the query is complete, we can notify the user.
My proposed solution is:
1) When the user completes the search form in the web page, to simply display a message that the results are being generated and will be sent when complete
2) Send the SQL query to a database which can contain a list of SQL queries that need to be processed
3) Create a Windows Service which checks this database every couple of minutes for new queries
4) This Windows Service then queries the database. When the query is completed, it will create a CSV of the results, and email this to the user
I am looking for some advice and comments on my above approach? What do folks think of this as an approach to process expensive database calls in the background?
Generally speaking the requests will be made infrequently, but as mentioned, will be for a great amount of data. There is a chance that two or more requests could be made at the same time, but this will be infrequent.
I will also look at optimising the databases.
Grateful for any tips.
Martin :)
Another option is to supplement the existing code to execute the query on a separate thread so that periodic keep-alive updates can be sent to the requesting page while you wait for the query results. Similar to the way the insurance quote agregator pages work.
A second option is to make the results available as a hyperlink when they are ready and then communicate that either through the website or by email to the user.
Option three if these queries are not completely ad-hoc type queries then you could profile for the most frequent combinations and pre-compute them periodically placing the results into new tables (sort of halfway to optimising the current database structure).
The caveat there is that the data won't be as up to date - but given the time the queries are currently taking it probably isn't that important to be up to the second?
Whichever solution you choose I think it's going to depend on the user expectation - Do they know what they want and just send one big query and get it and be happy? or do they try several queries to find the right combination of parameters? If the latter then waiting for an email delivery of results might not be acceptable to them. But if what they want is a downloadable results document and they know what they want first time then it may. The only problem I see here is emails going astray or taking longer than the user thinks it should causing the request to be resubmitted multiple times and increasing the server workload - caching queries and results is probably a very good idea.
I would suggest to introduce layer of abstraction like messaging broker. Request will go in queue and batch layer will consume request from queue and once heavy work is done, batch layer will notify web layer again via messaging broker, Request-Reply pattern.
In addition on database side it is allways good to optimize queries.
I am building an app using phoenix framework that will use thousands of channels, users will be sending messages of long, lat info each one seconds.
So, I will be getting thousands of messages each second.
The question is, how would I store each message I get into DB?
Please bear with me to describe my thoughts about this, as I am thinking of the following options to handle message storing into DB:
Use the same connection process to store the message into DB.
Possible caveat:
Would this effect the latency time and performance of the channel?
Create a DB dedicated process for each channel to store its messages.
Possible caveat:
Then, if I have 100'000 channel, I will need another 100'000 process to store messages, is this fine? considering that erlang processes are lite and cheap?
Create one DB process for all channel, then each message from any channel will be queued, then stored by this individual DB process.
Possible caveat:
One process to store all messages of thousands of channels, the message queue will go high, it will be slow?
So, which is the recommended approach to store messages coming each second from thousands of channels?
EDIT
I will be using dynamo db which will scale to handle thousands of concurrent request with ease.
The most important question is if the request in the connection channel can be completed before it's written to the DB or not. You need to consider what would happen if the connection process responded back to the client and something happened to the DB so that it has been written. If the data loss is acceptable then the DB access can be completed asynchronously, if not, then it needs to be synchronous, e.g. respond to the client only after the DB confirmed that it has stored the request.
If the data can be stored to the database asynchronously then it's better to either spawn a new process to complete it or add it to a queue (2 and 3). If the data has to be stored synchronously then it's easier to handle it in the same process (1). Please note that the request has to be copied between processes which may affect performance if the DB write is handled in a different process and the message size is big.
It's possible to improve the reliability of the asynchronous write for example by storing the request somewhere persistently before it's written to the DB, then reply back to the client, and then try to complete the DB write, which can then be retried if the DB is down. But it complicates this a bit.
You also need to determine the bottleneck, what would the slowest part of the architecture. If the DB then it doesn't matter if you create one queue of requests to the DB or if you create a new process for each connection, the requests will pile up either in the memory of that single process or in the amount of created processes.
I would probably determine how many parallel connections the DB can handle without sacrificing on latency too much and create a pool of processes to handle the requests. Then I would create a queue to dispatch requests to those pooled processes. To handle bigger messages I would obtain a token (permission to write) from the queue and connect to the DB directly to avoid copying the message too much. That architecture would be easier to extend if any bottlenecks have been found later, e.g. persistently store incoming messages before they can be written to the DB or balance requests to additional nodes when the DB is overloaded.
I am developing a solution that queries a SOAP web service for certain transactions. Once retrieved, these transactions are meant to be saved in a database after which a callback url is invoked to send some data to another server. How would you best architect a solution to this problem. My point of confusion is whether gen_server or gen_fsm should be used and if so which component of the solution goes where i.e. if gen_server what task goes to the server which task goes to the client.
Think about what tasks can happen in parallel in your system. Can the SOAP queries be done in parallel, and does it make sense? Can the transactions be written to the database while SOAP is being queried again? Is the data to another server supposed to be sent after the transactions are written to the database, or can it be done at the same time?
Once you know these answers, you can build your pipeline. One example would be:
One process that regularly queries the SOAP service and calls a function with each batch of transactions.
In this function, two processes are started; one that writes the transactions to the database and one that sends data to the server. These processes are independent of each other.
This is just one example, and depending on your requirements, you might have to structure things another way.
In general, these processes can all be gen_servers as none of them have any clear states. In fact, the two worker processes doesn't have to be gen_servers at all, since they just do one task and then die. Using a gen_server in that case would be overkill.
I am using Twilio to send/receive texts in a Rails 4.2 app. I am sending in bulk, around 1000 at a time, and receiving sporadically.
Currently when I receive a text I save it to the DB (to, from, body) and then pass that record to an ActiveJob worker to process later. For sending messages I currently persist the Twilio params to another DB and pass that record to a different ActiveJob worker. Since I am often doing it in batches I have two workers. The first outgoing message worker sends a single message. The second one queries the DB and finds all the user who should receive the message, creates a DB record for each message that should be sent, and then passes that record to the first outgoing message worker. So the second one basically just creates a bunch of jobs for the first one to process.
Right now I have the workers destroying the records once they finish processing (both incoming and outgoing). I am worried about not persisting things incase the server, redis, or resque go down but I do not know if this is actually a good design pattern. It was suggested to me just to use a vanilla ruby object and pass it's id to the worker but I am not sure how that effects data reliability. So is it over kill to be creating all these DBs and should I just be creating vanilla ruby objects and passing those object's ids to the workers?
Any and all insight is appreciated,
Drew
It seems to me that the approach of sending a minimal amount of data to your jobs is the best approach. Check out the 'Best Practices' section on the sidekiq wiki: https://github.com/mperham/sidekiq/wiki/Best-Practices
What if your queue backs up and that quote object changes in the meantime? Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
Also in terms of reliability - you should be worried about your job queue going down. It happens. You either design your system to be fault tolerant of a failure or you find a job queue system that has higher reliability guarantees (but even then no queue system can guarantee 100% message deliverability). Sidekiq pro has better reliability guarantees than sidekiq (non-pro), but if you design your jobs with a little bit of forethought, you can create jobs that can scan your database after a crash and re-queue any jobs that may have been lost.
How much work you spend desinging fault tolerant solutions really just depends how critical it is that your information make it from point A to point B :)
I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.