Parsing Wikipedia countries, regions, cities - parsing

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...

[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!

You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Related

Applying distinct on specific field in CloudSearch query

I am examining AWS CloudSearch for system's new searching engine.
Assume that there are articles and some comments written on each articles. The search API should return articles that are matching or having any matching comments. So is there any possible way to retrieve DISTINCT values(in this case, unique ID of the article) from CloudSearch with single query execution? If not, what would be the nice solution to resolve this feature's requirement with CloudSearch?
I know there's text-array type for document field in CloudSearch but it seems expensive to update documents since N of comments on single article can be more than thousands.
I faced similar problem, putting comments is not an option in your case as array elements cannot be more than 1000 in cloudsearch. I will make two search domains, articles and comments. I will issue search query to both of them in parallel (async or multithreaded depending upon the language), articles will always generate non duplicate ids but on the results of comments query you have to apply the logic to an article id only once and always pick the top one, as results are sorted by matching score.

Desire 2 Learn Org Unit ID

What is the API call for finding a particular orgUnit ID for a particular course? I am trying to pull grades and a class list from API but I can not do it without the orgUnitID
There's potentially a few ways to go about this, depending on the kind of use-case you're in. Firstly, you can traverse the organizational structure to find the details of the course offering you're looking for. Start from the organization's node (the root org) and use the route to retrieve an org's descendants to work your way down: you'll want to restrict this call to only course-offering type nodes (org unit type ID '3' by default). This process will almost certainly require fetching a large amount of data, and then parsing through it.
If you know the course offering's Code (the unique identifier your organization uses to define course offerings), or the name, then you can likely find the offering in the list of descendants by matching against those values.
You can also make this search at a smaller scope in a number of ways:
If you already know the Org Unit ID for a node in the structure that's related to the course offering (for example, the Department or Semester that's a parent of the course offering), you can start your search from that node and you'll have a lot fewer nodes to parse through.
If your calling user context (or a user context that you know, and can authenticate as) is enrolled in the course offering, or in a known parent org (like a Department), then you can fetch the list of all that user's enrollments, and parse through those to find the single course offering you're looking for. (Note that this enrollments route sends back data as a paged result set, and not as a simple JSON array, so you may have to make several calls to work your way through a number of data pages before finding the one you want.)
In all these scenarios, the process will end up with you retrieving a JSON structure that will contain the Org Unit ID which you can then persist and use directly later.

Building a (simple) twitter-clone with CouchDB

I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
user-schema:
{
_id:xxxxxxx,
_rev:yyyyyy,
"type":"user",
"user_id":1,
"username":"john",
...
}
tweet-schema:
{
"_id":"xxxx",
"_rev":"yyyy",
"type":"tweet",
"text":"Sample Text",
"user_id":1,
...
"created_at":"2011-10-17 10:21:36 +000"
}
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
function(doc){
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
}
}
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.

Best MongoDB schema for twitter clone?

I know similar questions have been asked, but looking for a very basic answer to a basic question. I am new to MongoDB and making a twitter style app (blogs, followers, etc) and I'm wondering the best schema to use.
Right now I have (on a very high level):
Member {
login: string,
pass: string,
posts: [
{
title: string,
blog: string,
comments: [ { comment: string } ]
}
]
}
There is more to it, but that gives you the idea. Now the problem is I'm looking to add the "follow" feature and I'm not sure the best route to go.
I could add a "following" embedded doc to the Member, but I'm just not sure using mongoDB what the smartest method would be. My main concearn would obviously be the main "feed" page where you see all of the people you are following's posts.
This is not an ideal schema for a Twitter clone. The main problem is that "posts" is an evergrowing array which means mongo will have to move your massive document every few posts because it ran out of document padding. Additionally there's a hard (16mb) size limit to documents which makes this schema restrictive at best.
The ideal schema depends on whether or not you expect Twitter's load. The "perfect" mongodb schema in terms of maintainability and easy of use is not the same as the one I'd use for something with Twitter's throughput. For example, in the former case I'd use a posts collection with a document per post. In the high throughput scenario I'd start making bucket documents for small groups of posts (say, one per "get more" page). Additionally in the high throughput scenario you'd have to keep the follower's timeline up to date in seperate user timeline documents while in low throughput scenarios you can simply query them.
This question is the same the one how widely used in the blog post example and how to model blog posts and comments. You just have to apply the same concepts here. You have the following options:
embedded documents
dedicated collections and performing multiple queries
The pros and cons have been widely discussed. Embedded docs can only be 16MB large and it is not possible to return individual parts of an matched array in MongoDB...make your choice.
Not going any further because as said: the same question has been discussed in numerous questions about "schema design". Just google "Schema Design MongoDB" or look for the same on SO.
Adding a "following" array to the Member document should work well. It should contain the user IDs of the people that member is following. Your code will have to retrieve the list and construct a query that retrieves the tweets of those users. As Mongo is nonrelational, there's no way to construct a query that joins the Member and Tweet collections and does this in a single query, but you should be able to reduce network overhead by doing this on the database server, using server-side code execution: http://www.mongodb.org/display/DOCS/Server-side+Code+Execution.

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Resources