Incremental searching only from your follower in twitter API - twitter

Is there a smart way using Twitter API to quickly incremental search twitter users only from your followers?
For example, there is user Alice who is followed from one million users. One day, Alice wanted to send a DM to another user which is one of her followers, but she only remembered that the name of him/her starts from the letter 'Bo'. So she wants to filter her one million users name with prefix 'Bo'.
How can Alice get all users using Twitter API, which name starts from 'Bo'?
Method 1. Call followers/ids 200 times and filter : Since followers/ids can get only 5,000 followers at a time, you should call followers/ids at least 200 times to get all one million followers. After that, you filter the users by name which starts from 'Bo'. This IS extremely slow because the filtering never starts until the 200 requests end. It's scalability is poor because it uses lots of memory to save and filter loaded follower list for each user.
Method 2. Call users/search many times and check if they're following me : Unfortunately, only the first 1,000 matches are available. Matches over 1,001 will never appear as your followers.
I want to know if there is a better way than this. The following 2 methods are too stupid.
P.S.
Lady Gaga seems to be followed by 28 million users. https://twitter.com/ladygaga

Related

Timeline Reconstruction When a User Is Followed

This question is very similar to this one, however there are no answers on that one. I posted this one with more clarity in hopes of receiving an answer.
According to this presentation, Twitter incorporates a fanout method to push Tweets to each individual user's timeline in Redis. Obviously, this fanout only takes place when a user you're following Tweets something.
Suppose a new user, who has never followed anyone before (and conversely has no Tweets in their timeline), decides to follow someone. Using just the above method, they would have to wait until the user they followed Tweeted something for anything to show up on their timeline. After some observation, this is not the case. Twitter pulls in the latest Tweets from the user.
Now suppose that a new user follows 5 users, how does Twitter organize and push those Tweets into the user's timeline in Redis?
Suppose a user already follows 5 users and they have a fair amount of Tweets from these users in their timeline. When they follow another 5 users, how are these user's individual Tweets pushed into the initial user's timeline in Redis in the correct order? More importantly, how is it able to calculate how many to bring in from each user (seeing that they cap timelines at 800 Tweets).
Here is a way of how I would try to implement it this if I understand well your question.
Store each tweet in a hash. The key of the hash could be something like: tweet:<tweetID>.
Store the IDs of the tweets of a given user in a sorted set named user:<userID>:tweets. You set the score of the tweet as a unix timestamp, so they appear in the correct order. You can then get a list of the 800 most recent tweet IDs for the user with the instruction ZREVRANGEBYSCORE
ZREVRANGEBYSCORE user:<userID>:tweets +inf -inf LIMIT 0 800
When a user follows a new person, you copy the list of ids returned by this instruction in the timeline of the follower (either in the application code, or using a LUA script). This timeline is once again represented by a sorted set, with unix timestamps as scores. If you do the copy in the application code, which is perfectly acceptable with Redis, don't forget to use pipelining to perform your multiples writes in the sorted set in a unique network operation. It will greatly improve the performances.
To get the timeline content, use pipelining too. Request the tweets ID, using ZREVRANGEBYSCORE with a limit option and/or a timestamp as lower limit if you don't want tweets posted before a certain date.

Accessing huge volumes of data from Facebook

So I am working on a Rails application, and the person I am designing it for has what seem like extremely hefty data volume requirements. They want to gather ALL posts by a user that logs into the application, and all of the posts for each of their friends for the past year.
Before this particular level of detail was communicated to me, I built the thing using the fb_graph gem and would paginate through posts. I am running into the fact that first it takes a very long time to do this, even when I change the number of posts requested per page. Second, I frequently run into the Oauth error #613, more than 600 requests per 600 seconds. After increasing each request to 200 posts I run into this limit less, but it still takes an incredibly long time to get all of this data.
I am not particularly familiar with the FQL alternative, but it seems to me that we are going to have to either prioritize speed or volume of data. Is there a way that I am missing that would allow me to quickly retrieve this level of information?
Edit: I do save all posts to the database as I retrieve them. What is required is to make one pass through and grab all of the posts for the past year, for the user and friends. This process takes a long time and I am basically wondering if there is any way that it can be sped up.
One thing that I'd like to point out here:
You should implement some kind of local caching for user's posts. I mean, instead of querying FB each time for the posts, you should save the posts in your local database and only check for new posts (whenever needed).
This is faster and saves you many API requests.

Can I use the Twitter gem to pull Twitter .user data from multiple Twitter names at the same time?

Ok. I've got a Ruby on Rails application and have successfully gotten authentication (thanks to other responses here). I can get it to tweet, read from the timeline, pull followers of the authenticated user, etc.
However, now what I'm looking to do is pull the follower lists of the authenticated user, determine which (if any) of those accounts is verified or not and then display the verified accounts. I can pull the nickname of each follower, however, it seems like I have to do each call individually (.user(ID#1) then .user(ID#2) then etc.) which results in me hitting the rate limit cap.
Therefore, what I'm looking for is a way to pass multiple IDs and return the user information for each ID in one call (or at least fewer than a 1 ID: 1 API call). I feel like this has to be possible (in fac
Yup, just do:
Twitter.users(id1, id2, id3)
Docs: http://rdoc.info/github/jnunemaker/twitter/Twitter/Client#users-instance_method

If I call Twitter API to get all of my followers, how many calls to the API is that?

If I want to download a list of all of my followers by calling the twitter API, how many calls is it? Is it one call or is it the number of followers I have?
Thanks!
Sriram
If you just need the IDs of your followers, you can specify:
http://api.twitter.com/1/followers/ids.json?screen_name=yourScreenName&cursor=-1
The documentation for this call is here. This call will return up to 5,000 follower IDs per call, and you'll have to keep track of the cursor value on each call. If you have less than 5,000 followers, you can omit the cursor parameter.
If, however, you need to get the full details for all your followers, you will need to make some additional API calls.
I recommend using statuses/followers to fetch the follower profiles since you can request up to 100 profiles per API call.
When using statuses/followers, you just specify which user's followers you wish to fetch. The results are returned in the order that the followers followed the specified user. This method does not require authentication, however it does use a cursor, so you'll need manage the cursor ID for each call. Here's an example:
http://api.twitter.com/1/statuses/followers.json?screen_name=yourScreenName&cursor=-1
Alternatively, you can user users/lookup to fetch the follower profiles by specifying a comma-separated list of user IDs. You must authenticate in order to make this request, but you can fetch any user profiles you want -- not just those that are following the specified user. An example call would be:
http://api.twitter.com/1/users/lookup.json?user_id=123123,5235235,456243,4534563
So, if you had 2,000 followers, you would use just one call to obtain all of your follower IDs via followers/ids, if that was all you needed. If you needed the full profiles, you would burn 20 calls using statuses/followers, and you would use 21 calls when alternatively using users/lookup due to the additional call to followers/ids necessary to fetch the IDs.
Note that for all Twitter API calls, I recommend using JSON since it is a much more lightweight document format than XML. You will typically transfer only about 1/3 to 1/2 as much data over the wire, and I find that (in my experience) Twitter times-out less often when serving JSON.
http://dev.twitter.com/doc/get/followers/ids
Reading this, it looks like it should only be 1 call since you're just pulling back an xml or json page. Unless you have more than 5000 followers, in which case you would have to make a call for each page of the paginated values.

How do social networking websites compute friend updates?

Social networking website probably maintain tables for users, friends and events...
How do they use these tables to compute friends events in an efficient and scalable manner?
Many of the social networking sites like Twitter don't use an RDBMS at all but a Message Queue application. A lot of them start out with a already present application like RabbitMQ. Some of them get big enough they have to heavily customize or build their own. Twitter is in the process of doing this for the second time.
A message queue application works by holding messages from one service for one or more other services. For instance say service Frank is publishing messages to a queue foo. Joe and Jill are subscribed to Franks foo queue. the application will keep track of whether or not Joe or Jill have recieved the messages and once every subscriber to the queue has recieved the message it discards it. Frank fires messages and forgets about it. Joe and Jill ask for messages from foo and get whatever messages they haven't gotten yet. Joe and Jill do whatever they need to do with the message. Perhaps keeping it around perhaps not.
The message queue application guarantees that everyone who is supposed to get the message can and will get the message when they request them. The publisher can send the messages confident that subscriber can get them eventually. This has the benefit of being completely asynchronous and not requiring costly joins.
EDIT: I should mention also that usually the storage for these kind of things at high scale are heavily denormalized. So Joe and Jill may be storing a copy of the exact same message. This is considered ok because it helps the application scale to billions of users.
Other reading:
http://www.rabbitmq.com/
http://qpid.apache.org/
The mainstay data structure of social networking sites is the graph. On facebook the graph is undirected (When you're someone's friend, they're you're friend). On twitter the graph is directed (You follow someone, but they don't necessarily follow you).
The two popular ways to represent graphs are adjacency lists and adjacency matrices.
An adjacency list is simply a list of edges on the graph. Consider a user with an integer userid.
User1, User2
1 2
1 3
2 3
The undirected interpretation of these records is that user 1 is friends with users 2 and 3 and user 2 is also friends with user 3.
Representing this in a database table is trivial. It is the many to many relationship join table that we are familiar with. SQL queries to find friends of a particular user are quite easy to write.
Now that you know a particular user's friends, you just need to join those results to the updates table. This table contains all the user's updates indexed by user id.
As long as all these tables are properly indexed, you'd have a pretty easy time designing efficient queries to answer the questions you're interested in.
Travis wrote a great post on this ,
Activity Logs and Friend Feeds on Rails & pfeed
For the small scale doing a join on users.friends and users.events and query caching is probably fine but does slow down pretty quickly as friends and events grow. You could also try an event based model in which every time a user creates an event an entry is created in a join table (perhaps called "friends_events"). Thus whenever a user wants to see what events their friends have created they can simply do a join between their own id and the friends_events table and find out. In this way you avoid grabbing all a users with friends and then joining their friends with the events table.

Resources