Accessing huge volumes of data from Facebook - ruby-on-rails

So I am working on a Rails application, and the person I am designing it for has what seem like extremely hefty data volume requirements. They want to gather ALL posts by a user that logs into the application, and all of the posts for each of their friends for the past year.
Before this particular level of detail was communicated to me, I built the thing using the fb_graph gem and would paginate through posts. I am running into the fact that first it takes a very long time to do this, even when I change the number of posts requested per page. Second, I frequently run into the Oauth error #613, more than 600 requests per 600 seconds. After increasing each request to 200 posts I run into this limit less, but it still takes an incredibly long time to get all of this data.
I am not particularly familiar with the FQL alternative, but it seems to me that we are going to have to either prioritize speed or volume of data. Is there a way that I am missing that would allow me to quickly retrieve this level of information?
Edit: I do save all posts to the database as I retrieve them. What is required is to make one pass through and grab all of the posts for the past year, for the user and friends. This process takes a long time and I am basically wondering if there is any way that it can be sped up.

One thing that I'd like to point out here:
You should implement some kind of local caching for user's posts. I mean, instead of querying FB each time for the posts, you should save the posts in your local database and only check for new posts (whenever needed).
This is faster and saves you many API requests.


What's the most efficient way to create an alert queue for a model with hundreds of millions of entries?

I am currently working on an application in Rails (though language/framework shouldn't matter for this question since it is more of a theoretical one). I'm working on wrapping my head around this problem:
Say I am tracking millions of blogs online and am plugged into their RSS feeds. My app pings these feeds every few few minutes to see if there has been any new activity across any of these millions of blogs. If there is any new activity, I want to alert users of my application who have signed up to receive alerts for specific blogs that there has been an alert.
Does it make sense to have a user_blog_alerts table (where a user can specify custom keywords to be alerted about) and continuously check this table against every new entry that comes in from my feed? And when there is a match, to add them to a queue (using Redis)?
What is the best, most efficient way to build and model this alerting system? Am I even thinking about this in the right way? Are there any good examples or tutorials on this when working with such large amounts of data?
I'm not sure what the right way to do this is, but the thought of continuously scanning a table over and over sounds exhausting (ie. unscalable).
Off the top of my head, what if you created a LIST for every blog in Redis. The values would be the user IDs of those who wanted an alert. The key name would contain the blog id (ex: "user_blog_alerts:12345").
Then when you got a new post for blog 12345 it's a simple lookup to see if that key exists. If it does, then fire off alerts for each user in the list.

client-server fetch large data

I have a application need a list of data, but these data may be very large. If I'm going to show this list of data in client (mobile app), I can't get all of the data from server because the limit space of mobile.
For example, like Facebook app, there are tons of newsfeed in server, and user can only see some of them. If user want to see more, they need to scroll down and fresh. So how to implement something like this in both client and server? (Currently my server is written in ruby on rails, and client is iOS)
And once the client get those data, does it store in memory or in local database? I'm worried about memory limit in mobile phones.
On the server-side, you could probably write an API supporting pagination and custom results count, i.e.: to get the first 20 results, and when the user scrolls all the way down your view on the iPhone, fetch the next items, like that:
If you plan your design well, you'll get something very flexible that you'll be able to change later if you realize that 20 results is too much/not enough, etc.
Depending on your app's architecture and the amount of data your app will handle, you might also need to provide API methods based on the last-updated time, to ensure you're not missing data (e.g., if you call your second get?start=20 a few minutes after the first one, the start index might not have the same meaning).
As for storing data locally, it all depends on what you want to achieve. Are you sure you need to save everything the user has downloaded? You could store only the most recently fetched items in a local SQLite database and query them the next time your app starts up, before refreshing the view (I don't know how it is implemented in Facebook's iPhone app but at least it looks like it's done that way).

Heroku Request Timeout (H12) with Big Data

I have a Ruby on Rails application which gets lots of data from social media sites like Twitter, Facebook etc.
There is an index page that shows records as paged. I am using Kaminari for paging.
My issue is big data, I guess. Let's say I have millions of records and want to show them on my index page with Kaminari. When I tried to run the system by browser, Heroku gives me H12 error (request timeout).
What can I do to improve my app's performance? I have this idea of getting only the records that will be shown on the index page. Likewise, when clicked to Kaminari second page link, only fetching the second page records from database. Idea is basically that but I don't know where to start and how to implement it.
Here an example piece of code from my controller:
#ca_responses = #ca_responses_for_adaptors.where(:ca_request_id => #conditions)
.order(sort_column + " " + sort_direction)
#ca_responses: My records
#ca_responses_for_adaptor: Records based on adaptor. Think as admin and this returns all of the records.
#conditions: Getting specified adaptor records. For example getting only Twitter related records etc.
You could start by creating a page cache table which will be filled in with your data for your search results. That could be one approach.
There could be few downsides, but if I would know the exact problem, then I could propose better solution. I doubt that you will be listing million users on one page and then to access them by paginating the pages (?) or I am mistaken
There could be few problems with pagination. First is that the paginating gems work like this: They fetch all data, and then when you click on page number it only fetches the second 5 elements (or however you have set it) from the whole list. The problem here is fetching all the data before paginating. If you have a million of records, then this could take a while for every page. You could define new method that will run SQL query to select one amount of data from the database , and you can set offset instruction to fetch the data only for that page. In this case paginate gem is useless, so you would need to remove it.
The second option is that you could use something like user_cashe, something like that. By this I mean to create new table that will have just a few records - the records that will be displayed on the screen. The table will be smaller then the usuall user table, and then, it would be faster to search trough it.
There could be other more advanced solutions, but I doubt you could (want) to use it in your application.
Kaminari already paginates your records as expected.
Heroku is prone to random timeout errors due to its random router.
Try to reproduce on local. You may have bottlenecks in your code which make indeed your request being too long to return. You should not have any problem requesting 5 items from database, so you may have code before or after that that takes long time to run.
If everything is ok on local with production data, you may add new_relic to analyze your requests and see whether some problem occurs specifically on production (and why).
If it appears heroku router is indeed the problem, you can still try to use unicorn as a webserver, but you have to take special care that your app does not consume too much memory (each unicorn worker will consume the ram of a whole app, and you may hit heroku memory limits, which would produce R14 errors in place of those H12).

Searching for a song while using multiple API's

I'm going to attempt to create an open project which compares the most common MP3 download providers.
This will require a user to enter a track/album/artist name i.e. Deadmau5 this will then pull the relevant prices from the API's.
I have a few questions that some of you may have encountered before:
Should I have one server side page that requests all the data and it is all loaded simultaneously. If so, how would you deal with timeouts or any other problems that may arise. Or should the page load, then each price get pulled in one by one (ajax). What are your experiences when running a comparison check?
The main feature will to compare prices, but how can I be sure that the products are the same. I was thinking running time, track numbers but I would still have to set one source as my primary.
I'm making this a wiki, please add and edit any issues that you can think of.
Thanks for your help. Look out for a future blog!
I would check amazon first. they will give you a SKU (the barcode on the back of the album, I think amazon calls it an EAN) If the other providers use this, you can make sure they are looking at the right item.
I would cache all results into a database, and expire them after a reasonable time. This way when you get 100 requests for Britney Spears, you don't have to hammer the other sites and slow down your application.
You should also make sure you are multithreading whatever requests you are doing server side. Curl for instance allows you to pull multiple urls, and assigns a user defined callback. I'd have the callback send a some data so you can update your page with as the results come back. GETTUNES => curl callback returns some data for each url while connection is open that you parse it on the client side.

Rails - Tracking Referrals to Conversions

We just launched and are looking to better understand where the users who are converting to registered users are actually coming from. We can see our traffic sources and referrals via Google Analytics and our other web statistics programs, but in volume, it's difficult to tie these specifically to which users in our database have converted and from where.
We have several "goals" in Google Analytics setup to better help track conversions, but what are others doing to associate user signups with inbound traffic sources?
One thought we've been kicking around - capturing the referral on the first page load and pass it along in the session into the registration form where you store it into the user record.
Any other solutions that are working successfully for you?
Indeed, I would suggest storing the referrer in the user record. Then you can write some code to sensibly draw out additional data from the URL. For instance, you could parse Google URL's to determine the keywords used to discover your site. And your code could detect things like referrals from ad runs, specific SEO campaigns you're running, or partner deals you have going.
It would be beneficial to spend some time building an admin-only page to visualize these conversions to help you better learn what is working and what isn't. And when things are going well, such a page is encouraging for the whole team!
Capturing referral is a good start. You should capture it to persistent cookie instead of a session so that if user returns tomorrow it still has the same referral information.
I've created a gem to automate tracking and saving referral infos. See for more info.
Some notes when designing tracking (I've tried to catch these with the gem already)
It might be better to save tracking data to separate table. So that when you delete user account you won't delete information about how that user account was created. You get the answer like "where does bogus user accounts come from?"
Save also cookies to db. If you are using Google Analytics you can parse Googles cookies to get additional information about the visitor. Like the number of visits or campaign information.
It's good to save also user_agents etc to be able to differ between mobile and desktop browsers etc.
In the end its good to visualize the information and conversions. But in the beginning its hard to know what data you want to visualize and how. So try to capture as much data as possible and then later decide how to crunch that data with scripts.
