Best MongoDB schema for twitter clone? - twitter

I know similar questions have been asked, but looking for a very basic answer to a basic question. I am new to MongoDB and making a twitter style app (blogs, followers, etc) and I'm wondering the best schema to use.
Right now I have (on a very high level):
Member {
login: string,
pass: string,
posts: [
{
title: string,
blog: string,
comments: [ { comment: string } ]
}
]
}
There is more to it, but that gives you the idea. Now the problem is I'm looking to add the "follow" feature and I'm not sure the best route to go.
I could add a "following" embedded doc to the Member, but I'm just not sure using mongoDB what the smartest method would be. My main concearn would obviously be the main "feed" page where you see all of the people you are following's posts.

This is not an ideal schema for a Twitter clone. The main problem is that "posts" is an evergrowing array which means mongo will have to move your massive document every few posts because it ran out of document padding. Additionally there's a hard (16mb) size limit to documents which makes this schema restrictive at best.
The ideal schema depends on whether or not you expect Twitter's load. The "perfect" mongodb schema in terms of maintainability and easy of use is not the same as the one I'd use for something with Twitter's throughput. For example, in the former case I'd use a posts collection with a document per post. In the high throughput scenario I'd start making bucket documents for small groups of posts (say, one per "get more" page). Additionally in the high throughput scenario you'd have to keep the follower's timeline up to date in seperate user timeline documents while in low throughput scenarios you can simply query them.

This question is the same the one how widely used in the blog post example and how to model blog posts and comments. You just have to apply the same concepts here. You have the following options:
embedded documents
dedicated collections and performing multiple queries
The pros and cons have been widely discussed. Embedded docs can only be 16MB large and it is not possible to return individual parts of an matched array in MongoDB...make your choice.
Not going any further because as said: the same question has been discussed in numerous questions about "schema design". Just google "Schema Design MongoDB" or look for the same on SO.

Adding a "following" array to the Member document should work well. It should contain the user IDs of the people that member is following. Your code will have to retrieve the list and construct a query that retrieves the tweets of those users. As Mongo is nonrelational, there's no way to construct a query that joins the Member and Tweet collections and does this in a single query, but you should be able to reduce network overhead by doing this on the database server, using server-side code execution: http://www.mongodb.org/display/DOCS/Server-side+Code+Execution.

Related

Dynamic Queries using Couch_Potato

The documentation for creating a fairly straightforward view is easy enough to find:
view :completed, :key => :name, :conditions => 'doc.completed === true'
How, though, does one construct a view with a condition created on the fly? For example, if I want to use a query along the lines of
doc.owner_id == my_var
Where my_var is set programatically.
Is this even possible? I'm very new to NoSQL so apologies if I'm making no sense.
Views in CouchDB are incrementally built / indexed as data is inserted / updated into that particular database. So in order to take full advantage of the power behind views you won't want to dynamically query them. You'll want to construct your views in such a way that you can efficiently access the data based on the expected usage patterns of the application. In my experience it's not uncommon to have multiple views each giving you a different way to access / query the same data. I find it helpful to think of CouchDB views as a way to systematically denormalize your documents.
On the other hand there are also ways to generalize your indexes in your views so you can use a single view for endless combinations of queries.
For example, you have an "articles" database, and each article document contains a list of tags. If you want to set up a query to dynamically retrieve all articles tagged with a handful of tags, you could emit multiple entries to the view on the same document:
// this article is tagged with "tag1","tag2","tag3"
emit("tag1",doc._id);
emit("tag2",doc._id);
emit("tag3",doc._id);
....
Now you have a way to query: Give me all articles tagged with these words: ["tag1","tag2",etc]
For more info on how to query multiple keys see "Parameter -> keys" in the table of Querying Options here:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
One problem with the above example is it would produce duplicates if a single document was tagged with both or all of the tags you were querying for. You can easily de-dupe the results of the view by using a CouchDB "List Function". More info about list functions can be found here:
http://guide.couchdb.org/draft/transforming.html
Another way to construct views for even more robust "dynamic" access to the data would be to compose your indexes out of complex data types such as JavaScript arrays. Also incorporating "range queries" can help. So for example if you have a 3-item array in your index, but only have the first 2 values, you can set up a range query to pull all documents that match the first 2 items of the array. Some useful info about that can be found here:
http://guide.couchdb.org/draft/views.html
Refer to the "startkey", and "endkey" options under "Querying Options" table here:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
It's good to know how CouchDB indexes itself. It uses a "B+ tree" data structure:
http://guide.couchdb.org/draft/btree.html
Keep this in mind when thinking about how to compose your indexes. This has specific implications about how you need to construct your indexes. For example, you can't expect to get good performance on a view if you query with a range on the first item in the array. For example:
startkey = [a,1,2]
endkey = [z,1,2]
You'll get the performance you'd expect if your query is:
startkey = [1,2,a]
endkey = [1,2,z]
This, in more general terms, means that index order does matter when querying views. Not just on basis of performance, but on basis of what documents will be returned. If you index a document in a view with [1,2,3], you can't expect it to show up in query for index [3,2,1], [2,1,3], or any other combination.
In my experience, most data-access problems can be solved elegantly and efficiently with CouchDB and the basic tools it provides. If / when your project needs true dynamic access to the data, I generally still use CouchDB for common data access needs, but I'll also integrate ElasticSearch using an ElasticSearch plugin which streams your data from CouchDB into ElasticSearch as it becomes available:
http://www.elasticsearch.org/
https://github.com/elasticsearch/elasticsearch-river-couchdb

Applying distinct on specific field in CloudSearch query

I am examining AWS CloudSearch for system's new searching engine.
Assume that there are articles and some comments written on each articles. The search API should return articles that are matching or having any matching comments. So is there any possible way to retrieve DISTINCT values(in this case, unique ID of the article) from CloudSearch with single query execution? If not, what would be the nice solution to resolve this feature's requirement with CloudSearch?
I know there's text-array type for document field in CloudSearch but it seems expensive to update documents since N of comments on single article can be more than thousands.
I faced similar problem, putting comments is not an option in your case as array elements cannot be more than 1000 in cloudsearch. I will make two search domains, articles and comments. I will issue search query to both of them in parallel (async or multithreaded depending upon the language), articles will always generate non duplicate ids but on the results of comments query you have to apply the logic to an article id only once and always pick the top one, as results are sorted by matching score.

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...
[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!
You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Databases for filtering pop culture quotes or celebrity-related data out of text (e.g., tweets)?

I am trying to mine social media data, such as tweets. However, social media data have a lot of noise- for example people discussing celebrities or quoting a movie/TV/song, that is something most generally that is not about themselves or somebody they actually know personally.
So, is: are there any dynamic (i.e., automatically updated) databases on the most popular current celebrities? Movie quotes that they are in or song lyrics that they sing would also be relevant.
I don't think such a curated list exists. Smaller ones do exist, for example the 100 top movies quotes on Wikipedia. However, these are not updated.
One possibility is to filter out the aspects of your input that appear on another social media site that tracks trends, such as Delicious. Unless you are looking for trends, something that rises to the top of two trending sites likely ... is just a trend.
Delicious has a nice Python wrapper for its API.
In Pythonic pseudocode,
data = social-media.content
data = filter(lambda datum: datum not in delicious.content-list,data)

Building a (simple) twitter-clone with CouchDB

I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
user-schema:
{
_id:xxxxxxx,
_rev:yyyyyy,
"type":"user",
"user_id":1,
"username":"john",
...
}
tweet-schema:
{
"_id":"xxxx",
"_rev":"yyyy",
"type":"tweet",
"text":"Sample Text",
"user_id":1,
...
"created_at":"2011-10-17 10:21:36 +000"
}
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
function(doc){
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
}
}
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.

Resources