Ruby IMAP IDLE concurrency - how to tackle? - ruby-on-rails

I'm trying to build a (private, for now) web application that will utilize IMAP IDLE connections to show peoples emails as they arrive.
I'm having a hard time figuring out how to hack this together - and how it would fit together with my Heroku RoR server.
I've written a basic script for connecting to an IMAP server and idling, looks something like this (simplified):
imap = Net::IMAP.new server, port, usessl
imap.login username, password
imap.select "INBOX"
imap.add_response_handler do |response|
if resp.kind_of(Net::IMAP::UntaggedResponse) && resp.name == "EXISTS"
# New mail recieved. Ping back and process.
end
end
imap.idle
loop do
sleep 10*60
imap.renew_idle
end
This will make one connection to the IMAP server and start idling. As you see, this is blocking with the loop.
I would like to have multiple IMAP connections idling at the same time for my users. Initially, I just wanted to put each of them in a thread, like so:
Thread.new do
start_imap_idling(server, port, usessl, username, password)
end
I'm not that sharp on threads yet, but with this solution I will still have to block my main thread to wait for the threads? So if I do something like:
User.each do |user|
Thread.new do
start_imap_idling(server, port, usessl, username, password)
end
end
loop do
# Wait
end
That would work, but not without the loop at the bottom to allow the threads to run?
My question is how I best melt this together with my Ruby On Rails application on Heroku? I can't be blocking the thread with that last loop - so how do I run this? Another server? A dyno more - perhaps a worker? I've been reading a bit about Event Machine - could this solve my problem, if so, how should I go about writing this?
Another thing is, that I would like to be able to add new imap clients and remove current ones on the fly. How might that look? Something with a queue perhaps?
Any help and comments are very much appreciated!

I'm not familiar with the specifics of RoR, Event Machine, etc. -- but it seems like you'd want to set up a producer/consumer.
The producer is your thread that's listening for changes from the IMAP server. When it gets changes it writes them to a queue. It seems like you'd want to set up multiple producers, one for each IMAP connection.
Your consumer is a thread that blocks on read from the queue. When something enters the queue it unblocks and processes the event.
Your main thread would then be free to do whatever you want. It sounds like you'd want your main thread doing things like adding new IMAP clients (i.e., producers) and removing current ones on the fly.
As for where you'd run these things: You could run the consumers and producer in one executable, in separate executables on the same machine, or on different machines... all depending upon your circumstances.
HTH

Related

How do I ping thousands of hosts and show the status as quick as possible

I need to implement a task in rails which ping huge number of hosts and check if its reachable or not. The result has to be displayed as fast as possible and number of hosts can scale up to some 10k or may be even more.
So far, I have tried to do using thread pool and its taking way long time.
work_q = Queue.new
#hosts.each{|x| work_q.push x }
workers = (0...200).map do
Thread.new do
begin
while host = work_q.pop(true)
ping_count = 1
server = host.address
result = `ping -q -c #{ping_count} #{server}`
if ($?.exitstatus == 0)
#res[host.hostname] = "up"
else
#res[host.hostname] = "down"
end
end
rescue ThreadError
end
end
end;
I also tried to use sidekiq to implement it as async task. Here is the way I thought of implementing it.
1. Pass the host ip to the job queue in sidekiq and find the last job id/worker id
2. Check the status of the last worker id and persist in some way.(not sure how to persist the worker id in better way)
3. Schedule a task to check the completion status of the last worker. Once the last worker is completed restart the sidekiq and ping the hosts all over again.
4. In this way you have the latest status of all hosts (up or down) based on ping result in min interval of time.
5. Whenever user clicks to check the status of hosts, show them the result and it will be latest one.
Can anybody help with any other idea, they can think of to do it in better way.
Thanks for the help.
Adding the comment as an answer
I think the problem might be that you're executing a new process for each ping, so the OS spends quite a bit of the time on the allocation of the process. Have you tried using a library such as net-ping? This approach should reduce the time considerably. Also, since the work is I/O bound, you can increase the number of threads to maybe something like 2k (depending on the ruby implementation), as most of those threads will be sleeping most of the time.
I've never used it, but a bit of searching just turned up a library called PacketFu which allows you to send ICMP packets from Ruby (it relies on libpcap). It also allows you to sniff packets from Ruby.
So, here's an idea:
Rather than creating a new ping process for each and every host you want to ping, use PacketFu to send ICMP echo requests to each host directly from the main Ruby process. At the same time, in a different thread, sniff packets using PacketFu's Capture class and match up the source IPs with the addresses you are trying to ping.
You will have to make sure all program state is accessed in a thread-safe manner or things will go BOOM! If you also have a web server or something running in-process for the user-facing interface, it also has to play nice and safe and not stick its little hands in the program state without locking!
Make sure the amount of memory you use is bounded, too! Don't try to keep a record of each and every ICMP echo reply which comes back, or you will have a memory-eating monster! It would be better to just record the last time when a reply came back from each host.
Just one more piece of advice before I send you on your way. Many hosts have firewall rules which limit how many pings they will accept per second. Even if they don't, I assume you are a nice person and don't want to DOS anyone. So don't get out of control and start machine-gunning those pings out at poor innocent people.

Ruby on Rails with IMAP IDLE for multiple accounts

I'm currently building a Ruby on Rails app that allows users to sign in via Gmail and it have a constant IDLE connection to their Inbox. Emails need to arrive in the app as soon as they come into their Gmail Inbox.
Currently I have the following in terms of implementation, and some issues that I really need some help figuring out.
At the moment, when the Rails app boots up, it creates a thread per user which authenticates and runs in a loop to keep the IDLE connection alive.
Every 10-15 minutes, the thread will "bounce IDLE", so that a little data is transferred to make sure the IDLE connection stays alive.
The major issue I think is in terms of scalability and how many connections the app has to Postgres. It seems that each thread requires a connection to Postgres, this will be heavily limited on Heroku by the number of max connections (20 for basic and 500 for any plans after that).
I really need help with the following:
What's the best way to keep all these IDLE connections alive, but reducing the number of threads and connections needed to the database?
Note: user token refresh may happen if the refresh token to Gmail runs out, so this requires access to the database
Are there any other suggestions for how this may be implemented?
EDIT:
I have implemented something similar to the OP in this question: Ruby IMAP IDLE concurrency - how to tackle?
There is no need to spawn a new thread for each IMAP session. These can be done in a single thread.
Maintain an Array (or Hash) of all users and their IMAP sessions. Spawn a thread, in that thread, send IDLE keep-alive to each of the connections one after the other. Run the loop periodically. This will definitely give you far more concurrency than your current approach.
A long term approach will be to use EventMachine. That will allow using many IMAP connections in the same thread. If you are processing web requests in the same process, you should create a separate thread for Event Machine. This approach can provide you phenomenal concurrency. See https://github.com/ConradIrwin/em-imap for Eventmachine compatible IMAP library.
Start an EventMachine in Rails
Since you are on Heroku, you are probably using thin, which already starts an EventMachine for you. However, should you ever move to another host and use some other web server (e.g. Phusion Passenger), you can start an EventMachine with a Rails initializer:
module IMAPManager
def self.start
if defined?(PhusionPassenger)
PhusionPassenger.on_event(:starting_worker_process) do |forked|
# for passenger, we need to avoid orphaned threads
if forked && EM.reactor_running?
EM.stop
end
Thread.new { EM.run }
die_gracefully_on_signal
end
else
# faciliates debugging
Thread.abort_on_exception = true
# just spawn a thread and start it up
Thread.new { EM.run } unless defined?(Thin)
# Thin is built on EventMachine, doesn't need this thread
end
end
def self.die_gracefully_on_signal
Signal.trap("INT") { EM.stop }
Signal.trap("TERM") { EM.stop }
end
end
IMAPManager.start
(adapted from a blog post by Joshua Siler.)
Share 1 connection
What you have is a good start, but having O(n) threads with O(n) connections to the database is probably hard to scale. However, since most of these database connections are not doing anything most of the time, one might consider sharing one database connection.
As #Deepak Kumar mentioned, you can use the EM IMAP adapter to maintain the IMAP IDLE connections. In fact, since you are using EM within Rails, you might be able to simply use Rails' database connection pool by making your changes through the Rails models. More information on configuring the connection pool can be found here.

What available message solutions are there for inter-process communication in ruby?

I have a rails app using delayed_job. I need my jobs to communicate with each other for things like "task 5 is done" or "this is the list of things that need to be processed for task 5".
Right now I have a special table just for this, and I always access the table inside a transaction. It's working fine. I want to build out a cleaner api/dsl for it, but first wanted to check if there were existing solutions for this already. Weirdly I haven't found a single things, I'm either googling completely wrong, or the task is so simple (set and get values inside a transaction) that no one has abstracted it out yet.
Am I missing something?
clarification: I'm not looking for a new queueing system, I'm looking for a way for background tasks to communicate with one another. Basically just safely shared variables. Do the below frameworks offer this facility? It's a shame that delayed job does not.
use case: "do these 5 tasks in parallel, and then when they are all done, do this 1 final task." So, each of the 5 tasks checks to see if it's the last one, and if it is, it fires off the final task.
I use resque. Also there are lots of plugins, which should make inter-process comms easier.
Using redis has another advantage: you can use the pub-sub channels for communication between workers/services.
Another approach (but untested by me): http://www.zeromq.org/, which also has ruby bindings. If you like to test new stuff, then try zeromq.
Update
To clarify/explain/extend my comments above:
Why I should switch from DelayedJobs to Resque is the mentioned advantage that I have queue and messages in one system because Redis offers this.
Further sources:
https://github.com/blog/542-introducing-resque
https://github.com/defunkt/resque#readme
If I had to stay on DJ I would extend the worker classes with redis or zeromq/0mq (only examples here) to get the messaging in my extisting background jobs.
I would not try messaging with ActiveRecord/MySQL (not even queueing actually!) because this DB isn't the best performing system for this use case especially if the application has too many background workers and huge queues and uncountable message exchanges in short times.
If it is a small app with less workers you also could implement a simple messaging via DB, but also here I would prefer memcache instead; messages are short living data chunk which can be handled in-memory only.
Shared variables will never be a good solution. Think of multiple machines where your application and your workers can live on. How you would ensure a save variable transfer between them?
Okay, someone could mention DRb (distributed ruby) but it seems not really used anymore. (never seen a real world example so far)
If you want to play around with DRb however, read this short introduction.
My personal preference order: Messaging (real) > Database driven messaging > Variable sharing
memcached
rabbitmq
You can use Pipes:
reader, writer = IO.pipe
fork do
loop do
payload = { name: 'Kris' }
writer.puts Marshal.dump(payload)
sleep(0.5)
end
end
loop do
begin
Timeout::timeout(1) do
puts Marshal.load(reader.gets) # => { name: 'Kris' }
end
rescue Timeout::Error
# no-op, no messages to receive
end
end
One way
Read as a byte stream
Pipes are expressed as a pair, a reader and a writer. To get two way communication you need two sets of pipes.

Executing large numbers of asynchronous IO-bound operations in Rails

I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.

requeue a sweatshop job in RabbitMQ

I am working on a Rails application where customer refunds are handed to a Sweatshop worker. If a refund fails (because we cannot reach the payment processor at that time) I want to requeue the job.
class RefundWorker < Sweatshop::Worker
def process_refund(job)
if refund
Transaction.find(job[:transaction]).update_attributes(:status => 'completed')
else
sleep 3
RefundWorker.async_process_refund(job) # requeue the job
end
end
Is there any better way to do this than above? I haven't found any "delay" feature in RabbitMQ, and this is the best solutions I've come up with so far. I want to avoid a busy loop while requeueing.
Have you looked at things like Ruote and Minion?
Some links here: http://delicious.com/alexisrichardson/rabbitmq+work+ruby
You could also try Celery which does not speak native Ruby but does speak HTTP+JSON.
All of the above work with RabbitMQ, so may help you.
Cheers
alexis
Have a timed-delivery service? You'd send the message to deliver as payload, wrapped up with a time-to-deliver, and the service would hold onto the message until the specified time had been reached. Nothing like that exists in the RabbitMQ server or any of the AMQP client libraries, as far as I'm aware, but it'd be a useful thing to have.
It doesn't seem like AMQP (or at least RabbitMQ) supports the idea of "delay this job." So the approach to re-queue the same job from inside the worker if it fails seems to be the best solution at this time.
I have the code working in a demo environment and it meeting my needs so far.

Resources