Heroku Request Timeout (H12) with Big Data - ruby-on-rails

I have a Ruby on Rails application which gets lots of data from social media sites like Twitter, Facebook etc.
There is an index page that shows records as paged. I am using Kaminari for paging.
My issue is big data, I guess. Let's say I have millions of records and want to show them on my index page with Kaminari. When I tried to run the system by browser, Heroku gives me H12 error (request timeout).
What can I do to improve my app's performance? I have this idea of getting only the records that will be shown on the index page. Likewise, when clicked to Kaminari second page link, only fetching the second page records from database. Idea is basically that but I don't know where to start and how to implement it.
Here an example piece of code from my controller:
#ca_responses = #ca_responses_for_adaptors.where(:ca_request_id => #conditions)
.order(sort_column + " " + sort_direction)
.page(params[:page]).per(5)
#ca_responses: My records
#ca_responses_for_adaptor: Records based on adaptor. Think as admin and this returns all of the records.
#conditions: Getting specified adaptor records. For example getting only Twitter related records etc.

You could start by creating a page cache table which will be filled in with your data for your search results. That could be one approach.
There could be few downsides, but if I would know the exact problem, then I could propose better solution. I doubt that you will be listing million users on one page and then to access them by paginating the pages (?) or I am mistaken
EDIT:
There could be few problems with pagination. First is that the paginating gems work like this: They fetch all data, and then when you click on page number it only fetches the second 5 elements (or however you have set it) from the whole list. The problem here is fetching all the data before paginating. If you have a million of records, then this could take a while for every page. You could define new method that will run SQL query to select one amount of data from the database , and you can set offset instruction to fetch the data only for that page. In this case paginate gem is useless, so you would need to remove it.
The second option is that you could use something like user_cashe, something like that. By this I mean to create new table that will have just a few records - the records that will be displayed on the screen. The table will be smaller then the usuall user table, and then, it would be faster to search trough it.
There could be other more advanced solutions, but I doubt you could (want) to use it in your application.

Kaminari already paginates your records as expected.
Heroku is prone to random timeout errors due to its random router.
Try to reproduce on local. You may have bottlenecks in your code which make indeed your request being too long to return. You should not have any problem requesting 5 items from database, so you may have code before or after that that takes long time to run.
If everything is ok on local with production data, you may add new_relic to analyze your requests and see whether some problem occurs specifically on production (and why).
If it appears heroku router is indeed the problem, you can still try to use unicorn as a webserver, but you have to take special care that your app does not consume too much memory (each unicorn worker will consume the ram of a whole app, and you may hit heroku memory limits, which would produce R14 errors in place of those H12).

Related

Best Practice for storing Query Results for page refreshes

I have a search control that breaks results into multiple pages. Currently it operates by executing the query then storing the results in a session variable. When a second, third, x page of results is requested it pulls the data from session to display the required elements.
However once the page is navigated away from the session data remains unnecessarily. It seems like I'd want to remove it once I no longer need it. On the other hand, if the user navigates back to the search results, keeping it in session data means I don't need to re-run the query... though it seems possibly dangerous.
I'm wondering if there's a standard practice that's worth following in this scenario that someone could link me to or suggest to improve the design that's in place.
Thanks.

How should I indicate a long request to the user in Rails?

I have an rails application with a huge database (hundreds of gigabytes) that has a lot of different options what to do with the data. In some cases, like changing data, this can be done in a background task I do with Sidekiq. But in other cases, like viewing data with a lot of rows and columns or complex SQL queries, the process of getting the data takes quite long.
What I want to do, is show the user that something is happening when he clicks a link. Like the progress bar many browsers have, but more obvious, so even users not used to working with browsers should see that something is happening and loading.
The question is how to do this. I already tried different options with AJAX and jQuery but most of the times I can only do this for certain actions, but basically I want to do it for the whole application. So every time the user sends a request to load a new page, I want to immediately show him, that something is happening.
The closest I came was a Javascript, that was always triggered. The problem was that it literally happened every time and forced to reload the page. This means when I toggled an element it showed the progress bar and then reloaded the page.
In essence, what I'm looking for:
My application runs on Ruby and Rails 4 and every time a new page is loaded I want the user to show that his request is being processed, so even if the request takes a couple of seconds, the user won't get nervous because he knows that something is happening.
I would really appreciate any help for finding a solution, because I can't seem to find any...
You should use a javascript animation to show to the user that something is happening.
For example : http://ricostacruz.com/nprogress/

Accessing huge volumes of data from Facebook

So I am working on a Rails application, and the person I am designing it for has what seem like extremely hefty data volume requirements. They want to gather ALL posts by a user that logs into the application, and all of the posts for each of their friends for the past year.
Before this particular level of detail was communicated to me, I built the thing using the fb_graph gem and would paginate through posts. I am running into the fact that first it takes a very long time to do this, even when I change the number of posts requested per page. Second, I frequently run into the Oauth error #613, more than 600 requests per 600 seconds. After increasing each request to 200 posts I run into this limit less, but it still takes an incredibly long time to get all of this data.
I am not particularly familiar with the FQL alternative, but it seems to me that we are going to have to either prioritize speed or volume of data. Is there a way that I am missing that would allow me to quickly retrieve this level of information?
Edit: I do save all posts to the database as I retrieve them. What is required is to make one pass through and grab all of the posts for the past year, for the user and friends. This process takes a long time and I am basically wondering if there is any way that it can be sped up.
One thing that I'd like to point out here:
You should implement some kind of local caching for user's posts. I mean, instead of querying FB each time for the posts, you should save the posts in your local database and only check for new posts (whenever needed).
This is faster and saves you many API requests.

RnR: Long running query, Heroku timeouts: webhooks? Ajax?

So here's the issue, we have a data that the users want displayed. The query, we optimized and indexed to be as fast as I think its going get. We might shave off a second or two but with the amount of data, not much we can do.
Anyway, so the query runs great when limited to a day or two of data, however the users are running it for a week or two of data. So the query for two or three weeks of data takes about 40 seconds, with a Heroku timeout of 30 seconds, that doesn't work. So we need a solution.
So searching here and Google, I see comments that webhooks or Ajax would work as out solution. However, I've been unable to find a real concrete example. I also saw a comment where someone was saying we could send some kind of response that would "reset the clock." But again sounded intriguing but couldn't find an example.
We're kind of under the gun, the users are unhappy, so we need a solution that is fast and simple. Thanks in advance for your help!
I faced a similar problem. I have what is essentially a 'bulk download' page in my Sinatra app, which my client app calls to import data into a local webSQL db in the browser.
The download page (lets call it "GET /download") queries a CouchDB database to get all of the documents in a given view, and it's this part of the process (much like your query) that takes a long time.
I was able to work around the issue by using Sinatra's streaming API. As mentioned above, you can 'reset' the clock in Heroku by sending at least one by in the response before the 30s time is up. This resets the clock for another 55s (and each additional byte you send keeps resetting the clock again), so that while you're still sending data, the connection can be kept open indefinitely.
In Sinatra, I was able to replace this:
get '/download' do
body = db.view 'data/all'
end
..with something like this:
get '/download' do
# stream is a Sinatra helper that effectively does 'chunked transfer encoding'
stream do |out|
# Long query starts here
db.view 'data/all' do |row|
# As each row comes back from the db, stream it to the front-end
out << row
end
end
end
This works well for me, because the 'time to first byte' (ie. the time taken for the db query to return the first row is well under the 30s limit.
The only downside is that previously I was getting all of the results back from the database into my Sinatra app, then calculating an MD5 sum of the entire result to use as the etag header. My client app could use this either to do a conditional HTTP get (ie. if it tried to download again and no data had been modified, I could just send a 302 Not Modified); plus it could be compared against it's own checksum of the data it received (to ensure it hadn't been corrupted/modified in transit).
Once you start streaming data in the response, you lose the ability to calculate an MD5 sum of the content to send as a HTTP header (because you don't have all of the data yet; and you can't send headers after you've started sending body content).
I'm thinking about changing this to some sort of paginated, multiple AJAX call, solution; as Zenph suggested above. That, or use some sort of worker process (eg. Resque, DelayedJob) to offload the query to; but then I'm not sure how I'd notify the client when the data is ready to be fetched.
Hope this helps.

Searching for a song while using multiple API's

I'm going to attempt to create an open project which compares the most common MP3 download providers.
This will require a user to enter a track/album/artist name i.e. Deadmau5 this will then pull the relevant prices from the API's.
I have a few questions that some of you may have encountered before:
Should I have one server side page that requests all the data and it is all loaded simultaneously. If so, how would you deal with timeouts or any other problems that may arise. Or should the page load, then each price get pulled in one by one (ajax). What are your experiences when running a comparison check?
The main feature will to compare prices, but how can I be sure that the products are the same. I was thinking running time, track numbers but I would still have to set one source as my primary.
I'm making this a wiki, please add and edit any issues that you can think of.
Thanks for your help. Look out for a future blog!
I would check amazon first. they will give you a SKU (the barcode on the back of the album, I think amazon calls it an EAN) If the other providers use this, you can make sure they are looking at the right item.
I would cache all results into a database, and expire them after a reasonable time. This way when you get 100 requests for Britney Spears, you don't have to hammer the other sites and slow down your application.
You should also make sure you are multithreading whatever requests you are doing server side. Curl for instance allows you to pull multiple urls, and assigns a user defined callback. I'd have the callback send a some data so you can update your page with as the results come back. GETTUNES => curl callback returns some data for each url while connection is open that you parse it on the client side.

Resources