Configure request timeout on heroku - ruby-on-rails

Is it possible to configure the request timeout on heroku?
I have some report queries that that take longer than the default 30 seconds heroku provides. Ideally, I'd like to set it for specific requests (just the reports).
Thanks.

Heroku seems to encourage responses of 500ms or less. Since they don't mention altering the request timeout on that page my guess would be that they don't support changing it (though you could ask Heroku Support).
What kind of work are you doing for your report?
Can you run calculation work in a background job and store the result? (e.g. in memcache, in another table in your database, etc) How up-to-date does the information need to be?
Can you speed up your query? What does your query plan look like? Is there room for improvement by changing how the data is queried or adding indexes?
Would an approach like HTTP streaming help?
I would suggest trying approaches 1 or 2 first and seeing if they help. If you give us more detail on what exactly you're doing we may be able to help more.

Related

How to process data on the server with rails and heroku

I am developing a website using Ruby on Rails and I am doing a bit of rough planning. I can and have deployed rails websites before, just adding to the database and retrieving from database based on my use case, but this time around, its a bit different. I am adding to database but i will need the data to be processed on the server before the data is being sent back to the user or when he decides to retrieve it. What i do not get is how i am going to process the data on the server. I know this doesnt follow the normal pattern for asking questions, i would search for it with google except I dont know what I am looking for. A nudge in the right direction will do.
What I want to do exactly is have users register and click a button (request) which puts the users id in an array , what I need to do on the server is to randomly or not randomly connect two users based on some qualities, this program keeps running infinitely, such that the user can come back later to check if he has been connected with someone already.
This kind of logic typically belongs in the controller, or perhaps on the models. You should read the Rails docs, particularly on controllers: http://guides.rubyonrails.org/action_controller_overview.html
I think you may find a lot of benefit from running a background job for this that is constantly looking for matches. You could have an infinitely running Sidekiq process that is queued up with users. Then once one finishes, just fire it up again.
Or you could create a rake task that does a User.find_each and have it run again when the task finishes. But this would make things blocking if you end up having a lot of users. I'd recommend one job per user and just bloat the system with them. This way you can scale out both horizontally and vertically.
You'll want to learn about ActiveJob and Sidekiq to help achieve what you're looking for :). Sidekiq requires Redis which you'll also have to setup as well. I'd recommend the redis-rails gem to help with the integration.
To go off BenMorganIO's answer, I think this is a job for a background worker. This is a job that is processed in the background, so it doesn't slow up your app. A good example of this is firing off an email in the background.
There are primarily 3 gems I've seen for this:
delayed_job
Resque
Sidekiq (just celebrated 5 years!)
Those should point you in the right direction.

Respond with large amount of objects through a Rails API

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud.
Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response.
I am curious on how to improve this whole task. Here's what I've thought of:
implement pagination at API level and call Model.where(page: xxx) from my service;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async).
If I were to use the first approach, how many objects should I retrieve per page? How big should a response be?
If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this.
What approach should I follow? Or, is there something better that I'm missing?
You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.
If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?
If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,
exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.
A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.
I think it really depends how you want do define 'performance' and what your goal for your API is. Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. Especially if the CSV generation is just an edge case, and the API is really built for other services. The number of items per page would then be limited by the speed at which you can deliver them. Your service would not be particularly more performant (even less so), since it needs to call the service multiple times.
Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set.
Having said that, I think strictly speaking it is the job of the API to be quick and responsive. So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time.

How should I schedule many Google Search scrapes over the course of a day?

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
There are definitely some background job libraries that might work.
delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
resque: https://github.com/defunkt/resque
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

send delayed emails rails on heroku

I have a table in my database with a list of emails to be sent, each at a specific time (precision down to the minute).
I'm on heroku, and I don't want to spend anything right now.. Is there a way to do this? The only way I thought was to create a deamon/cron somewhere else and make it call a private url every minute.. any other idea? Any way to have some background process or something that can handle this (on Heroku and without paying extra for addons..)?
thanks!
Heroku's free cron addon runs only once a day, so it is not suitable. Their paid cron addon runs only once an hour, so it is also not suitable. Running a daemon/cron elsewhere is a hack that will become problematic very quickly. It's fundamentally bad architecture.
Using delayed_job with a single Heroku Worker makes sense. Plus, delayed_job lets you specify exactly when each job should be run, down to a 5-second granularity. Yes, it is $36/mo to do this. But it frees you from doing things the wrong way. Plus, if you expect that you will not need the Worker most of the time, you can look into auto-scaling delayed_job on Heroku so the Worker is only turned on when you need it.
There is a whole bunch of free online services that would be more than happy to request your web page on a schedule that you set. You don't need to spend or code anything. Just Google :)

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Resources