I'd like to run a small amount of code asynchronously in my rails application. The code logs some known information. I'd like this task to not block the rest of the response from my app. The operation is way too lightweight and frequent for it to be done as a delayed job tasks.
I'm thinking to just use:
Thread.new do
# my logging code
end
and call it a day. Is this achieving what I want it to achieve? Are there any drawbacks?
It may be overkill for your particular usage, but have you considered using some form of Message Queuing Middleware such as STOMP, AMQP, OpenWire or even Jabber?
The basic outline (pseudo-code!) would be:
s = client.create_connection(user,pass,server_ip,port)
s.message_send("Log Message Goes Here")
You would then have at least one "consumer" at the other end of the message queue which would write the log message to a file/database/chatroom/IRC Channel/whatever-you-want-really-it's-all-code... :)
It would also mean that if in future you wanted to hand-off high-intensity processing jobs (invoice generation for example) you would already have the infrastructure in place to do so.
Also if you're looking for a really easy Messaging server, I recommend RabbitMQ - it's written in Erlang (but don't let that put you off!) and is very easy to setup.
This is my first post so I can't post any more than two links, but I've published links to all the technologies mentioned above in a gist # https://gist.github.com/1090372
Aside from handling some baseline resource contention things, this should be sufficient. The database resource comes to mind if you're logging there...a file if you're logging there. But the basic approach is fine I would think. Simpler is better...
Related
I want to build a service that notifies me when a url returns status 200. I'm currently using a sidekiq worker, if the status == 200, it updates my database (row.available = true), if not, it raises an exception and retries the worker in n seconds, n amount of times.
Though this works, it doesn't feel efficient or scalable (1000's checks would result in 1000's of exceptions, and on certain platforms that's bad news -- JRuby), and I'm sure there is a way I can build an internal service to manage this url monitoring that doesn't rely on sidekiq (perhaps in Go, or another, more suited Ruby gem). However, I have no idea where to begin, and so I'd appreciate some general direction.
Writing and running a simple link checker is easy. Doing that for 1000s of links quickly, without redundancy, and handling dead and slow-responding links without bogging down your entire system gets harder.
I'd use three threads, plus two queues:
A dispatcher thread that only reads from the database. It is responsible for finding and queuing URLs to be checked in to a "to be checked" queue.
A worker thread that consumes from the first queue and pushes results into the "updated URL results" queue.
An updater/consumer thread that takes the result of a thread in #2 and updates the database.
Ruby has some built-in classes to help:
Thread
Queue
I'd highly recommend Typhoeus and Hydra for use in the middle thread. The documentation for these two classes cover a lot of what you need to do as far as handling multiple threads running in parallel.
I wouldn't write this code as part of a Rails application. There is no value added by Rails to this, nor is it necessary. I would either require Active Record and piggy-back on the existing database.yaml settings and models, or use Rails' "runner" to run the code as an adjunct to the Rails code.
Or, I'd write a small, application-specific, piece of code to run on a different server to avoid bogging down the Rails server. Using something like MySQL or PostgreSQL drivers would let you talk to the same database that Rails uses. In this case I'd use the Sequel gem to act as the ORM, but that's because I prefer it over Active Record.
There are a lot of things to consider as you write this code, including retries of failed URLs, sensing redirections and updating the source URLs to reflect them to avoid wasting time, and not beating up the hosting servers causing you to be banned.
I've written several apps for this purpose over the years and doing it right takes a lot of forethought, so think out your design up front otherwise you could end up with some major rewrites later on.
According to the documentation:
Q: How many times will I receive each message?
Amazon SQS is
engineered to provide “at least once” delivery of all messages in its
queues. Although most of the time each message will be delivered to
your application exactly once, you should design your system so that
processing a message more than once does not create any errors or
inconsistencies.
Is there any good practice to achieve the exactly-once delivery?
I was thinking about using the DynamoDB “Conditional Writes” as distributed locking mechanism but... any better idea?
Some reference to this topic:
At-least-once delivery (Service Behavior)
Exactly-once delivery (Service Behavior)
FIFO queues are now available and provide ordered, exactly once out of the box.
https://aws.amazon.com/sqs/faqs/#fifo-queues
Check your region for availability.
The best solution really depends on exactly how critical it is that you not perform the action suggested in the message more than once. For some actions such as deleting a file or resizing an image it doesn't really matter if it happens twice, so it is fine to do nothing. When it is more critical to not do the work a second time I use an identifier for each message (generated by the sender) and the receiver tracks dups by marking the ids as seen in memchachd. Fine for many things, but probably not if life or money depends on it, especially if there a multiple consumers.
Conditional writes sound like a clever solution, but it has me wondering if perhaps AWS isn't such a great solution for your problem if you need a bullet proof exactly-once solution.
Another alternative for distributed locking is Redis cluster, which can also be provisioned with AWS ElasticCache. Redis supports transactions which guarantee that concurrent calls will get executed in sequence.
One of the advantages of using cache is that you can set expiration timeouts, so if your message processing fails the lock will get timed release.
In this blog post the usage of a low-latency control database like Amazon DynamoDB is also recommended:
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
Amazon SQS FIFO queues ensure that the order of processing follows the
message order within a message group. However, it does not guarantee
only once delivery when used as a Lambda trigger. If only once
delivery is important in your serverless application, it’s recommended
to make your function idempotent. You could achieve this by tracking a
unique attribute of the message using a scalable, low-latency control
database like Amazon DynamoDB.
In short - we can put item or update item in dynamodb table with condition expretion attribute_not_exists(for put) or if_not_exists(for update), please check example here
https://stackoverflow.com/a/55110463/9783262
If we get an exception during put/update operations, we have to return from a lambda without further processing, if not get it then process the message (https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/)
The following resources were helpful for me too:
https://ably.com/blog/sqs-fifo-queues-message-ordering-and-exactly-once-processing-guaranteed
https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/
https://youtu.be/8zysQqxgj0I
I'm new to Erlang but I would like to get started with an application which feels applicable to the technology due to the concurrency desires I have.
This picture highlights what i want to do.
http://imagebin.org/163917
Where messages are pulled from a queue and routed to worker processes which have previously been setup as a result of a user making some input a form in a Django app. The setup requires some additional database (preexisting database so I don't want to use ETS/DETS for this bit) lookup which then talks to the message router and creates a relevant process.
My issue comes with given that I may want to ask my Django app in the future for all the workers that need to be setup and task them in the first place, what is the best way to communicate here. I favour HTTP/ json and have read up what little I can find on Mochiweb and MochiJson and I think that would do what I want. I was planning on having a OTP supervisor and application, so would it be sensible to have a seperate mochiweb process which then passes erlang messages to the router?
I have struggled a little with mochiweb due to all the tutorials talking about how you use a script to create a directory structure, which seems to put mochiweb centric to a design - which isn't want I want here, I want a lightweight mochiweb process that does occassional work.
Please tear this apart, all comments welcome.
Cheers
Dave
mochiweb is awesome but I think what you actually want is webmachine. The complete documentation is available here and here. In a nutshell, webmachine is a toolkit for making REST applications, which I think is what you want. It uses mochiweb behind the scenes but hides all of the complex (and undocumented) details. When you create a webmachine project you'll get a complete OTP application and a default resource. From there you'll do something like the following:
Add your own resources (or modify + rename the default one).
Modify the dispatcher so your resources and paths make sense for your app.
Add code to create and monitor your worker processes - probably a gen_server and a supervisor. See this and related articles for ideas. Note you'll want to start both under the main supervisor provided to you when you created your project.
Modify your resources to communicate with your gen_server.
I didn't quite follow everything else you are asking - it may be easier to answer any follow-up questions in comments.
I have a rails app using delayed_job. I need my jobs to communicate with each other for things like "task 5 is done" or "this is the list of things that need to be processed for task 5".
Right now I have a special table just for this, and I always access the table inside a transaction. It's working fine. I want to build out a cleaner api/dsl for it, but first wanted to check if there were existing solutions for this already. Weirdly I haven't found a single things, I'm either googling completely wrong, or the task is so simple (set and get values inside a transaction) that no one has abstracted it out yet.
Am I missing something?
clarification: I'm not looking for a new queueing system, I'm looking for a way for background tasks to communicate with one another. Basically just safely shared variables. Do the below frameworks offer this facility? It's a shame that delayed job does not.
use case: "do these 5 tasks in parallel, and then when they are all done, do this 1 final task." So, each of the 5 tasks checks to see if it's the last one, and if it is, it fires off the final task.
I use resque. Also there are lots of plugins, which should make inter-process comms easier.
Using redis has another advantage: you can use the pub-sub channels for communication between workers/services.
Another approach (but untested by me): http://www.zeromq.org/, which also has ruby bindings. If you like to test new stuff, then try zeromq.
Update
To clarify/explain/extend my comments above:
Why I should switch from DelayedJobs to Resque is the mentioned advantage that I have queue and messages in one system because Redis offers this.
Further sources:
https://github.com/blog/542-introducing-resque
https://github.com/defunkt/resque#readme
If I had to stay on DJ I would extend the worker classes with redis or zeromq/0mq (only examples here) to get the messaging in my extisting background jobs.
I would not try messaging with ActiveRecord/MySQL (not even queueing actually!) because this DB isn't the best performing system for this use case especially if the application has too many background workers and huge queues and uncountable message exchanges in short times.
If it is a small app with less workers you also could implement a simple messaging via DB, but also here I would prefer memcache instead; messages are short living data chunk which can be handled in-memory only.
Shared variables will never be a good solution. Think of multiple machines where your application and your workers can live on. How you would ensure a save variable transfer between them?
Okay, someone could mention DRb (distributed ruby) but it seems not really used anymore. (never seen a real world example so far)
If you want to play around with DRb however, read this short introduction.
My personal preference order: Messaging (real) > Database driven messaging > Variable sharing
memcached
rabbitmq
You can use Pipes:
reader, writer = IO.pipe
fork do
loop do
payload = { name: 'Kris' }
writer.puts Marshal.dump(payload)
sleep(0.5)
end
end
loop do
begin
Timeout::timeout(1) do
puts Marshal.load(reader.gets) # => { name: 'Kris' }
end
rescue Timeout::Error
# no-op, no messages to receive
end
end
One way
Read as a byte stream
Pipes are expressed as a pair, a reader and a writer. To get two way communication you need two sets of pipes.
I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.