How do social networking websites compute friend updates? - scalability

Social networking website probably maintain tables for users, friends and events...
How do they use these tables to compute friends events in an efficient and scalable manner?

Many of the social networking sites like Twitter don't use an RDBMS at all but a Message Queue application. A lot of them start out with a already present application like RabbitMQ. Some of them get big enough they have to heavily customize or build their own. Twitter is in the process of doing this for the second time.
A message queue application works by holding messages from one service for one or more other services. For instance say service Frank is publishing messages to a queue foo. Joe and Jill are subscribed to Franks foo queue. the application will keep track of whether or not Joe or Jill have recieved the messages and once every subscriber to the queue has recieved the message it discards it. Frank fires messages and forgets about it. Joe and Jill ask for messages from foo and get whatever messages they haven't gotten yet. Joe and Jill do whatever they need to do with the message. Perhaps keeping it around perhaps not.
The message queue application guarantees that everyone who is supposed to get the message can and will get the message when they request them. The publisher can send the messages confident that subscriber can get them eventually. This has the benefit of being completely asynchronous and not requiring costly joins.
EDIT: I should mention also that usually the storage for these kind of things at high scale are heavily denormalized. So Joe and Jill may be storing a copy of the exact same message. This is considered ok because it helps the application scale to billions of users.
Other reading:
http://www.rabbitmq.com/
http://qpid.apache.org/

The mainstay data structure of social networking sites is the graph. On facebook the graph is undirected (When you're someone's friend, they're you're friend). On twitter the graph is directed (You follow someone, but they don't necessarily follow you).
The two popular ways to represent graphs are adjacency lists and adjacency matrices.
An adjacency list is simply a list of edges on the graph. Consider a user with an integer userid.
User1, User2
1 2
1 3
2 3
The undirected interpretation of these records is that user 1 is friends with users 2 and 3 and user 2 is also friends with user 3.
Representing this in a database table is trivial. It is the many to many relationship join table that we are familiar with. SQL queries to find friends of a particular user are quite easy to write.
Now that you know a particular user's friends, you just need to join those results to the updates table. This table contains all the user's updates indexed by user id.
As long as all these tables are properly indexed, you'd have a pretty easy time designing efficient queries to answer the questions you're interested in.

Travis wrote a great post on this ,
Activity Logs and Friend Feeds on Rails & pfeed

For the small scale doing a join on users.friends and users.events and query caching is probably fine but does slow down pretty quickly as friends and events grow. You could also try an event based model in which every time a user creates an event an entry is created in a join table (perhaps called "friends_events"). Thus whenever a user wants to see what events their friends have created they can simply do a join between their own id and the friends_events table and find out. In this way you avoid grabbing all a users with friends and then joining their friends with the events table.

Related

What's the most efficient way to create an alert queue for a model with hundreds of millions of entries?

I am currently working on an application in Rails (though language/framework shouldn't matter for this question since it is more of a theoretical one). I'm working on wrapping my head around this problem:
Say I am tracking millions of blogs online and am plugged into their RSS feeds. My app pings these feeds every few few minutes to see if there has been any new activity across any of these millions of blogs. If there is any new activity, I want to alert users of my application who have signed up to receive alerts for specific blogs that there has been an alert.
Does it make sense to have a user_blog_alerts table (where a user can specify custom keywords to be alerted about) and continuously check this table against every new entry that comes in from my feed? And when there is a match, to add them to a queue (using Redis)?
What is the best, most efficient way to build and model this alerting system? Am I even thinking about this in the right way? Are there any good examples or tutorials on this when working with such large amounts of data?
I'm not sure what the right way to do this is, but the thought of continuously scanning a table over and over sounds exhausting (ie. unscalable).
Off the top of my head, what if you created a LIST for every blog in Redis. The values would be the user IDs of those who wanted an alert. The key name would contain the blog id (ex: "user_blog_alerts:12345").
Then when you got a new post for blog 12345 it's a simple lookup to see if that key exists. If it does, then fire off alerts for each user in the list.

What is the best way to store a user's Facebook friends list in my database?

Overview
I'm creating a Ruby on Rails website which uses Facebook to login.
For each user I have a database entry which stores their Facebook User ID along with other basic information.
I'm also using the Koala gem in order to retrieve a user's friendlist from Facebook, but I'm unsure as to how I should store this data...
Option 1
I could store the user's friends as a serialized hash in the User table, then if I wanted to display a list of all the current user's friends, I could grab this hash and do something along the lines of SELECT FROM Users WHERE facebook_user_id IN hash
Each time the user logs in I could update this field to store the latest friends list.
Option 2
I could create a Friend table and store friendship information in here, where a User has many Friends. So there would be a row for each friendship, (User1 and User2 columns). Then to display a list of the current user's friends I could do something like SELECT User2 FROM Friends WHERE User1 = current_user
This seems like the better option to me, but...
It has the disadvantage that there would be many rows... If there were 100,000 users, each with 100 friends, that's now 10,000,000 rows in the Friends table.
It also means each time the user logs in, I'd need to loop over their Facebook friends list returned using Koala and create a Friend record if someone on their friendlist is in my User table and there isn't a corresponding entry in the Friends table. This seems like it'd be slow if a user has 1000 Facebook friends?
I'd appreciate any guidance on how it would be best to achieve this.
Apologies for the badly worded question, I'll try and reword/organise it shortly.
Thanks for any help in advance.
If you need to store a lot of data, then you need to store a lot of data. If you are like most, you probably won't run into that problem sooner than you have the cash to solve it. In other words, you are probably assuming you'll have more traffic and data than you'll get, at least in the short-term. So I doubt this is an issue, even though it is a good sign that you are thinking about it now rather than later.
As I mentioned in my comment below, the easiest solution is to have a tie table with a row for each side of the friend relationship (a has_many :friends, through: :facebook_friend_relationships, class_name: 'FacebookFriend' on FacebookFriend, per the design mentioned below). But your question seemed to be about how to reduce the number of records, so that is what the remainder of the answer will address.
If you have to store in the DB and you know for sure that you will absolutely have every FB user on the planet hitting your site because it is so awesome, but they won't all hit at once, then if you are limited in storage, you may want to use a LRU algorithm (remove the least recently used records) possibly with timed expiration also. You could just have a cron job that does a query on the DB then deletes old/unused records to do this. Wouldn't be perfect, but it would be a simple solution.
You could also archive older data rather than throw it away. So, frequently used data could stay in the table of active users, and then you might offload older data to another table or even another database (and you might see the apartment and second_base gems for that). However, once you get to the size, you're probably looking at a number of other architectural solutions that have much less to do with ActiveRecord models/associations or schema design. Though it pays to plan ahead, I wouldn't worry about that excessively until you are sure that the application will get enough users to invest the time in that.
Even though ActiveRecord has some caching, you could just avoid the DB and cache friends in memory yourself in the beginning for speed, especially if you don't yet have many users, which you probably don't yet. If you think you'll run out of memory because of the high number of users, LRU might be a good option here also, and lru_redux looks interesting. Again, you might want to time the cache also so expires and re-gets friends when the cache expires. Even just storing the results in the user session may be adequate, i.e. in the controller action method, just do #friends ||= Something.find_friends(fb_user_id), and the latter is what most might do as a first shot at it while you're getting started.
If you use ActiveRecord, in your query in the controller (or on the association in the model) consider using include: to avoid n+1 queries. That will speed up things.
For the schema design, maybe:
User - users table with email and authN info. Look at the Devise gem.
FacebookUser - info about the Facebook user.
FacebookFriendRelationship - a tie model with (id and) two columns, one for one FacebookUser id and one for the other.
By separating the authN info (User) from the FB data (FacebookUser and FacebookFriendRelationship), you make it easier to have other social media accounts, etc. each with information specific to those accounts in other tables.
The complexity comes in FacebookUser's relationship with friends if the goal is to minimize rows in the relationship table. To half the number of rows, you'd have a single row for a relationship where the id of FacebookUser could be in either foreign key column. Either the user has a friend or is a friend, so you could have two has_many :through associations on FacebookFriend that each use a different foreign key in FacebookFriendRelationship. Or you could do HABTM without the model and use foreign_key and association_foreign_key options in each association. Either way, you could add a method to add both associations together (because they are arrays). Instead, you could use custom SQL in a single has_many if you didn't care about having to use ActiveRecord to remove associations the normal way. However, per your comments, I think you want to avoid this complexity, and I agree with you, unless you really must limit the number of relationship rows. However, it isn't the number of tie table rows that will eat the data, it is going to be all of the user info you keep in the FacebookFriends table.

RavenDB Complex Model

I am trying to learn RavenDB by replacing my RDBMS in a project that I've already worked on so that I'm using it in a real situation. I've come to a standstill while trying to create the database, and I'd love to know the best way to model this in a document database. Every possibility I come up with either ends up looking like a relational database or ends up repeating vasts amount of information. Repeating the information in the database isn't a big deal, but keeping it all up to date when changes occur would be.
I'm hoping that I'm stuck in SQL mode and I'm just completely unable to see an obvious answer.
Here are the basic objects I need to record data for:
-Event
-Person
-Organization
-Cabin
Simple Requirements:
-A person can be a part of multiple organizations.
-An organization can have many members (people).
-A person can attend multiple events.
-An event has many people that attend.
-Some details about a cabin may change depending on the event (e.g. Accommodations).
Complex Requirements:
-I need to be able to reserve cabins for an event so that a single cabin is not used by two events at once. (with RDBMS I would just create an "EventCabins" table).
-I need to be able to record which people are attending an event. People attending an event will have information associated with them that is not part of Person or Event.
-I need to be able to record which organizations are attending an event. Organizations attending will have information associated with them that is not part of Organization or Event.
-I need to be able to record which People are assigned to which cabins in a particular event.
-I need to be able to record which People are attending a particular event as a part of an organization (it's not required to attend as a part of an organization). Even though a person can belong to more than one organization, he/she can only attend as a part of one of those organizations for a particular event. He/she might attend as a part of a different organization for another event.
-In the program, the user will be looking at only one event at a time. In that event, the user can look at attenders grouped by cabin or grouped by organization.
It seems obvious that I will need separate collections for Events, People, Organizations, and Cabins. Fulfilling the complex requirements above is where I hit the wall.
Do I put Attenders inside the Event collection? If so, then what do I do with Cabins and Organizations?
Do I create a separate collection for Attenders? If so, then there will be 4 different related collections that I will need to store Ids for and query at various times (Organizations, Cabins, Events, People). This seems opposite of the document database approach.
Thanks!
It seems to me that you should just use a relational database for this project.
If you want to use RavenDB I would suggest to use completely separated collections for all of these objects, but keeping references to other documents. Then you could query database using .Include functionality. And the best way - to create map/reduce indecies for all of the possible cases, like an index returning object for Event filled with all of invited people.

efficiently display total number of users and number of users currently on the application

In a rails application how do you efficiently display the total number of users and current number of users online?
Sam,
There are many methods for tracking online users. For example, authlogic has a last_request_at column which tracks when they last made a request to the site. Though, it's not very efficient to run a query for that every page load. I personally use Redis for tracking that sort of activity.
Here is a great example: Redis in practice, who is online
Hopefully this helps.
The common way to get the list of your online users is to store sessions in the DB (use ActiveRecord SessionStore), then retrieve recently updated sessions from the DB, deserialize them and see which users they belong to.

Need advice on MongoDB Schema for Chat App. Embedded vs Related Documents

I'm starting a MongoDB project just for kicks and as a chance to learn MongoDB/NoSQL schemas. It'll be a live chat app and the stack includes: Rails 3, Ruby 1.9.2, Devise, Mongoid/MongoDB, CarrierWave, Redis, JQuery.
I'll be handling the live chat polling/message queueing separately. Not sure how yet, either Node.js, APE or custom EventMachine app. But in regards to Mongo, I'm thinking to use it for everything else in the app, specifically chat logs and historical transcripts.
My question is how best to design the schema as all my previous experience has been with MySQL and relational DB schema's. And as a sub-question, when is it best to us embedded documents vs related documents.
The app will have:
Multiple accounts which have multiple rooms
Multiple rooms
Multiple users per room
List of rooms a user is allowed to be in
Multiple user chats per room
Searchable chat logs on a per room and per user basis
Optional file attachment for a given chat
Given Mongo (at least last time I checked) has a document limit of 4MB, I don't think having a collection for rooms and storing all room chats as embedded documents would work out so well.
From what I've thought about so far, I'm thinking of doing something like:
A collection for accounts
A collection for rooms
Each room relates back to an account
Related documents in chats collections for all chat messages in the room
Embedded Document listing all users currently in the room
A collection for users
Embedded Document listing all the rooms the user is currently in
Embedded Document listing all the rooms the user is allowed to be in
A collection for chats
Each chat relates back to a room in the rooms collection
Each chat relates back to a user in the users collection
Embedded document with info about optional uploaded file attachment.
My main concern is how far do I go until this ends up looking like a relational schema and I defeat the purpose? There is definitely more relating than embedding going on.
Another concern is that referencing related documents is much slower than accessing embedded documents I've heard.
I want to make generic queries such as:
Give me all rooms for an account
Give me all chats in a room (or filtered via date range)
Give me all chats from a specific user
Give me all uploaded files in a given room or for a given org
etc
Any suggestions on how to structure the schema efficiently in a way that scales? Thanks everyone.
I think you're pretty much on the right track. I'd use a capped collection for chat lines, with each line containing the user ID, room ID, timestamp, and what was said. This data would expire once the capped collection's "end" is reached, so if you needed a historical log you'd want to copy data out of the capped collection into a "log" collection periodically, but capped collections are specifically designed for logging-style applications where you aren't going to be deleting documents, and insertion order matters. In the case of chat, it's a perfect match.
The only other change I'd suggest would be to maintain uploads in a separate collection, as well.
I am a big fan of mongodb as a document database aswell. But are you sure you are using mongodb for the right reason? What is mongodb powerful at?
Its a subjective question but for me in-place (atomic) updates over documents is what makes mongodb powerful. And I can't really see you using it that much. And on top of that you are hitting the document size limit problem aswell.(With experience I can tell you that embedding files to mongodb is not a good idea). You want to have a live chat application on top of database too.
Your document schema's seems logical. But I wouldn't go with mongodb for this kind of project where your application heavily depends on inserts. I would go for CouchDB.
With CouchDB you wouldn't have to worry about attachments problem, you can embed them easily. "_changes" would make your life much much easier to eighter build a live chat application / long pooling / feeding search engine (if you want to implement one).
And I saw an open source showcase project in couchone. It has some similarities with your goals: Anologue. You should check it out.
PS : Sorry it was a little off topic but I couldn't hold myself.

Resources