Persistent messages over channels - erlang

I am building an app using phoenix framework that will use thousands of channels, users will be sending messages of long, lat info each one seconds.
So, I will be getting thousands of messages each second.
The question is, how would I store each message I get into DB?
Please bear with me to describe my thoughts about this, as I am thinking of the following options to handle message storing into DB:
Use the same connection process to store the message into DB.
Possible caveat:
Would this effect the latency time and performance of the channel?
Create a DB dedicated process for each channel to store its messages.
Possible caveat:
Then, if I have 100'000 channel, I will need another 100'000 process to store messages, is this fine? considering that erlang processes are lite and cheap?
Create one DB process for all channel, then each message from any channel will be queued, then stored by this individual DB process.
Possible caveat:
One process to store all messages of thousands of channels, the message queue will go high, it will be slow?
So, which is the recommended approach to store messages coming each second from thousands of channels?
EDIT
I will be using dynamo db which will scale to handle thousands of concurrent request with ease.

The most important question is if the request in the connection channel can be completed before it's written to the DB or not. You need to consider what would happen if the connection process responded back to the client and something happened to the DB so that it has been written. If the data loss is acceptable then the DB access can be completed asynchronously, if not, then it needs to be synchronous, e.g. respond to the client only after the DB confirmed that it has stored the request.
If the data can be stored to the database asynchronously then it's better to either spawn a new process to complete it or add it to a queue (2 and 3). If the data has to be stored synchronously then it's easier to handle it in the same process (1). Please note that the request has to be copied between processes which may affect performance if the DB write is handled in a different process and the message size is big.
It's possible to improve the reliability of the asynchronous write for example by storing the request somewhere persistently before it's written to the DB, then reply back to the client, and then try to complete the DB write, which can then be retried if the DB is down. But it complicates this a bit.
You also need to determine the bottleneck, what would the slowest part of the architecture. If the DB then it doesn't matter if you create one queue of requests to the DB or if you create a new process for each connection, the requests will pile up either in the memory of that single process or in the amount of created processes.
I would probably determine how many parallel connections the DB can handle without sacrificing on latency too much and create a pool of processes to handle the requests. Then I would create a queue to dispatch requests to those pooled processes. To handle bigger messages I would obtain a token (permission to write) from the queue and connect to the DB directly to avoid copying the message too much. That architecture would be easier to extend if any bottlenecks have been found later, e.g. persistently store incoming messages before they can be written to the DB or balance requests to additional nodes when the DB is overloaded.

Related

Should data being used by ActiveJob (resque) be persisted or put into a ruby object and passed by object id?

I am using Twilio to send/receive texts in a Rails 4.2 app. I am sending in bulk, around 1000 at a time, and receiving sporadically.
Currently when I receive a text I save it to the DB (to, from, body) and then pass that record to an ActiveJob worker to process later. For sending messages I currently persist the Twilio params to another DB and pass that record to a different ActiveJob worker. Since I am often doing it in batches I have two workers. The first outgoing message worker sends a single message. The second one queries the DB and finds all the user who should receive the message, creates a DB record for each message that should be sent, and then passes that record to the first outgoing message worker. So the second one basically just creates a bunch of jobs for the first one to process.
Right now I have the workers destroying the records once they finish processing (both incoming and outgoing). I am worried about not persisting things incase the server, redis, or resque go down but I do not know if this is actually a good design pattern. It was suggested to me just to use a vanilla ruby object and pass it's id to the worker but I am not sure how that effects data reliability. So is it over kill to be creating all these DBs and should I just be creating vanilla ruby objects and passing those object's ids to the workers?
Any and all insight is appreciated,
Drew
It seems to me that the approach of sending a minimal amount of data to your jobs is the best approach. Check out the 'Best Practices' section on the sidekiq wiki: https://github.com/mperham/sidekiq/wiki/Best-Practices
What if your queue backs up and that quote object changes in the meantime? Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
Also in terms of reliability - you should be worried about your job queue going down. It happens. You either design your system to be fault tolerant of a failure or you find a job queue system that has higher reliability guarantees (but even then no queue system can guarantee 100% message deliverability). Sidekiq pro has better reliability guarantees than sidekiq (non-pro), but if you design your jobs with a little bit of forethought, you can create jobs that can scan your database after a crash and re-queue any jobs that may have been lost.
How much work you spend desinging fault tolerant solutions really just depends how critical it is that your information make it from point A to point B :)

Is it a good idea to use MQ to store data in DB?

I'm going to use rabbitMQ as a message broker and switch most of the scripts to sending data to queue instead of performing direct writes/reads. Consumer will get those messages and perform corresponding operations. In my dreams this will give me more flexibility choosing DB engine, app level sharding and so on. But is it a good idea generally? Or am I missing something? Current write load is ~15k inserts/deletes for mysql and 30-50k sets for redis instances. Read load is the same ~15-20k selects, and 50-70k gets for redis.
The biggest issue you'll face will be the fact that your DB writes will be asynchronously processed. If a client writes data to the DB and then instantly reads it back, the value might not be what it originally inserted because the Rabbit queue might have been very busy or slow, delaying the update operation. Or an admin might accidentally purge your queue and then you'll have all these clients thinking their transactions had been committed but nothing will have been stored.
This sounds like a classic case of premature optimization. It's a solution in search of a problem, and you should probably avoid doing it.
With amqp you can run a none asynchronous operations using a RPC way, with this kind of architecture you should figure out all problems related with asynchronous operations.

Speeding up web service by writing to redis first, disk after?

I have a web service that runs multiple DB queries and takes roughly ~500ms-1,000ms (depending on how much I/O EC2 decides to give me at the given junction if invocation). Users want stuff faster than 1,000ms, and understandably so. What I'm thinking of doing is taking the request parameters, stuffing them into a redis queue without writing to disk, and then running a job in an asynchronous queue which does the disk writes. Does something like this happen normally in practice? am I insane for suggesting it?
So long as your Redis is persisting to disk on regular intervals, this should work. You want to limit the number of scenarios where you might lose data. A sufficiently aggressive persistence schedule for Redis should work for most cases.
Try to give feedback to the user immediately that their action has been received and is being processed. Nothing is more confusing than a slight delay before it appears that might prompt people to attempt the upload again.

Twitter Streaming API recording and Processing using Windows Azure and F#

A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.
So far I have two roles:
One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.
One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.
Which leads to lots of questions:
Is it okay to think of a worker role as an agent ?
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
Sorry for pounding all these questions, hope you don't mind,
Thanks a lot!
There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/
Is it okay to think of a worker role as an agent?
Yes, definitely.
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.
Is it okay to think of a worker role
as an agent ?
This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).
In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?
As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.
Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?
You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.

Executing large numbers of asynchronous IO-bound operations in Rails

I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.

Resources