RnR: Long running query, Heroku timeouts: webhooks? Ajax? - ruby-on-rails

So here's the issue, we have a data that the users want displayed. The query, we optimized and indexed to be as fast as I think its going get. We might shave off a second or two but with the amount of data, not much we can do.
Anyway, so the query runs great when limited to a day or two of data, however the users are running it for a week or two of data. So the query for two or three weeks of data takes about 40 seconds, with a Heroku timeout of 30 seconds, that doesn't work. So we need a solution.
So searching here and Google, I see comments that webhooks or Ajax would work as out solution. However, I've been unable to find a real concrete example. I also saw a comment where someone was saying we could send some kind of response that would "reset the clock." But again sounded intriguing but couldn't find an example.
We're kind of under the gun, the users are unhappy, so we need a solution that is fast and simple. Thanks in advance for your help!

I faced a similar problem. I have what is essentially a 'bulk download' page in my Sinatra app, which my client app calls to import data into a local webSQL db in the browser.
The download page (lets call it "GET /download") queries a CouchDB database to get all of the documents in a given view, and it's this part of the process (much like your query) that takes a long time.
I was able to work around the issue by using Sinatra's streaming API. As mentioned above, you can 'reset' the clock in Heroku by sending at least one by in the response before the 30s time is up. This resets the clock for another 55s (and each additional byte you send keeps resetting the clock again), so that while you're still sending data, the connection can be kept open indefinitely.
In Sinatra, I was able to replace this:
get '/download' do
body = db.view 'data/all'
end
..with something like this:
get '/download' do
# stream is a Sinatra helper that effectively does 'chunked transfer encoding'
stream do |out|
# Long query starts here
db.view 'data/all' do |row|
# As each row comes back from the db, stream it to the front-end
out << row
end
end
end
This works well for me, because the 'time to first byte' (ie. the time taken for the db query to return the first row is well under the 30s limit.
The only downside is that previously I was getting all of the results back from the database into my Sinatra app, then calculating an MD5 sum of the entire result to use as the etag header. My client app could use this either to do a conditional HTTP get (ie. if it tried to download again and no data had been modified, I could just send a 302 Not Modified); plus it could be compared against it's own checksum of the data it received (to ensure it hadn't been corrupted/modified in transit).
Once you start streaming data in the response, you lose the ability to calculate an MD5 sum of the content to send as a HTTP header (because you don't have all of the data yet; and you can't send headers after you've started sending body content).
I'm thinking about changing this to some sort of paginated, multiple AJAX call, solution; as Zenph suggested above. That, or use some sort of worker process (eg. Resque, DelayedJob) to offload the query to; but then I'm not sure how I'd notify the client when the data is ready to be fetched.
Hope this helps.

Related

How to dispaly a holding screen whilst ActiveJob retrieves lots of data from an external API

I have an application that makes API requests to salesforce using restforce.
Specifically the application finds a contact object, returns IDs for all related objects and then pulls the full record for every related object based on their ID.
This takes a long time for two reasons:
There are a lot of request to an external API, usually takes a few fractions of a second for each to reply and for some there can be +500 individual requests.
There is often a large amount of data being pulled back via each request.
All requests currently fall within the salesforce rest API limits but I'm getting timeout errors from my development server as it can take 5+ minutes for some of these requests to process.
Rails 4.2 - How best to handle this?
My question is how do I best get rails to handle this?
I can fire the API requests either from the controller (which definitely violates the skinny controllers) or from the view (via helper methods, which seems like a dodgy hack).
Ideally I'd like to get it running in a background job, but i'm unsure if I can just include all the authentication and other methods in a job in the same way I can include helper methods?
Even if I could get it to work in a background job, I'm unsure what best practice might be for the user experience. Ideally I'd like to route them to a page telling them to "hang tight, go get a coffee" with a progress bar, and then auto route them to the final page once the request is complete...
But I'm unsure how to generate a temporary display until a job has been completed?
Could anyone recommend any gems or strategies that might help me digest this problem?
You should definitely use a background job for this.
Give a database object to the job, which it will update to signal that is has finished, and maybe from time to time to indicate progress.
On the user side, simply tell them that the background job is working, with eventually a progress indicator, and display the result once the database object giving to the job tells you it's ready.

Heroku Request Timeout (H12) with Big Data

I have a Ruby on Rails application which gets lots of data from social media sites like Twitter, Facebook etc.
There is an index page that shows records as paged. I am using Kaminari for paging.
My issue is big data, I guess. Let's say I have millions of records and want to show them on my index page with Kaminari. When I tried to run the system by browser, Heroku gives me H12 error (request timeout).
What can I do to improve my app's performance? I have this idea of getting only the records that will be shown on the index page. Likewise, when clicked to Kaminari second page link, only fetching the second page records from database. Idea is basically that but I don't know where to start and how to implement it.
Here an example piece of code from my controller:
#ca_responses = #ca_responses_for_adaptors.where(:ca_request_id => #conditions)
.order(sort_column + " " + sort_direction)
.page(params[:page]).per(5)
#ca_responses: My records
#ca_responses_for_adaptor: Records based on adaptor. Think as admin and this returns all of the records.
#conditions: Getting specified adaptor records. For example getting only Twitter related records etc.
You could start by creating a page cache table which will be filled in with your data for your search results. That could be one approach.
There could be few downsides, but if I would know the exact problem, then I could propose better solution. I doubt that you will be listing million users on one page and then to access them by paginating the pages (?) or I am mistaken
EDIT:
There could be few problems with pagination. First is that the paginating gems work like this: They fetch all data, and then when you click on page number it only fetches the second 5 elements (or however you have set it) from the whole list. The problem here is fetching all the data before paginating. If you have a million of records, then this could take a while for every page. You could define new method that will run SQL query to select one amount of data from the database , and you can set offset instruction to fetch the data only for that page. In this case paginate gem is useless, so you would need to remove it.
The second option is that you could use something like user_cashe, something like that. By this I mean to create new table that will have just a few records - the records that will be displayed on the screen. The table will be smaller then the usuall user table, and then, it would be faster to search trough it.
There could be other more advanced solutions, but I doubt you could (want) to use it in your application.
Kaminari already paginates your records as expected.
Heroku is prone to random timeout errors due to its random router.
Try to reproduce on local. You may have bottlenecks in your code which make indeed your request being too long to return. You should not have any problem requesting 5 items from database, so you may have code before or after that that takes long time to run.
If everything is ok on local with production data, you may add new_relic to analyze your requests and see whether some problem occurs specifically on production (and why).
If it appears heroku router is indeed the problem, you can still try to use unicorn as a webserver, but you have to take special care that your app does not consume too much memory (each unicorn worker will consume the ram of a whole app, and you may hit heroku memory limits, which would produce R14 errors in place of those H12).

Accessing huge volumes of data from Facebook

So I am working on a Rails application, and the person I am designing it for has what seem like extremely hefty data volume requirements. They want to gather ALL posts by a user that logs into the application, and all of the posts for each of their friends for the past year.
Before this particular level of detail was communicated to me, I built the thing using the fb_graph gem and would paginate through posts. I am running into the fact that first it takes a very long time to do this, even when I change the number of posts requested per page. Second, I frequently run into the Oauth error #613, more than 600 requests per 600 seconds. After increasing each request to 200 posts I run into this limit less, but it still takes an incredibly long time to get all of this data.
I am not particularly familiar with the FQL alternative, but it seems to me that we are going to have to either prioritize speed or volume of data. Is there a way that I am missing that would allow me to quickly retrieve this level of information?
Edit: I do save all posts to the database as I retrieve them. What is required is to make one pass through and grab all of the posts for the past year, for the user and friends. This process takes a long time and I am basically wondering if there is any way that it can be sped up.
One thing that I'd like to point out here:
You should implement some kind of local caching for user's posts. I mean, instead of querying FB each time for the posts, you should save the posts in your local database and only check for new posts (whenever needed).
This is faster and saves you many API requests.

Searching for a song while using multiple API's

I'm going to attempt to create an open project which compares the most common MP3 download providers.
This will require a user to enter a track/album/artist name i.e. Deadmau5 this will then pull the relevant prices from the API's.
I have a few questions that some of you may have encountered before:
Should I have one server side page that requests all the data and it is all loaded simultaneously. If so, how would you deal with timeouts or any other problems that may arise. Or should the page load, then each price get pulled in one by one (ajax). What are your experiences when running a comparison check?
The main feature will to compare prices, but how can I be sure that the products are the same. I was thinking running time, track numbers but I would still have to set one source as my primary.
I'm making this a wiki, please add and edit any issues that you can think of.
Thanks for your help. Look out for a future blog!
I would check amazon first. they will give you a SKU (the barcode on the back of the album, I think amazon calls it an EAN) If the other providers use this, you can make sure they are looking at the right item.
I would cache all results into a database, and expire them after a reasonable time. This way when you get 100 requests for Britney Spears, you don't have to hammer the other sites and slow down your application.
You should also make sure you are multithreading whatever requests you are doing server side. Curl for instance allows you to pull multiple urls, and assigns a user defined callback. I'd have the callback send a some data so you can update your page with as the results come back. GETTUNES => curl callback returns some data for each url while connection is open that you parse it on the client side.

How to Get Results From a Background Process

I am designing a Ruby on Rails application that requests XML feeds, reads them in, and parses them into objects to be used in views. Since the request for the XML feed and subsequent receipt of it can take several seconds from some sources to complete I need a way to offload these tasks from my front-line application tier. I do not want my application servers to take more than a few hundred milliseconds to process a request. Currently the application serving processes sit and wait for the XML feed data to be returned so they can parse it and finish return the user's request. I am aware of DelayedJobs, however given that the result of this action is to be returned to the user in real-time I am unsure of how to offload it to a background task and receive the result.
If I offload this task to a background task how does the result get returned to the user loading the page?
One common model for this sort of thing is to use your preferred background job library (you mention DelayedJob, which seems to be a popular one) to offload the task from the request/response cycle, and then set up AJAX polling on the client to update the page with the results once they become available.
You can have your main returned page fire an AJAX request at a second tier of servers that handle the XML retrieval, and return HTML for the section of the page that will contain that information. That way you aren't running any asynchronous jobs (from the server's point of view) and the retrieval won't start until the AJAX request comes in, which will reduce the bandwidth you waste on bots.
This is a standard use of AJAX, so I'm not sure whether I'm missing something in your problem that makes it inappropriate for you.
The most common approach is to use AJAX and DelayedJob here, but it is only an usability improvement - instead of user waiting for 5sec to load the page they get an empty or half-empty page with a spinner for 5 seconds. The only way (in my opinion) to really improve the user experience is to load and process those xml feeds periodically and display to user the cached result.
If you are open to Perl code running on your server, I'd lift a piece of LiveJournal infrastructure: Gearman and TheSchwartz
Sounds like you want Gearman - and it has Ruby client bindings.
(see
http://www.livejournal.com/doc/server/lj.install.workers_setup_install.html )

Resources